Explain how Q-learning fits in with the agent architecture of Section 2.2.1 of the textbook. Suppose that the Q-learning agent has discount factor
, a step size of
, and is carrying out an
-greedy exploration strategy.
What are the percepts of an agent?
What are the components of the belief state of the Q-learning agent?
What is the command function of the Q-learning agent?
What is the belief-state transition function of the Q-learning agent?
Compare the different parameter settings for the game of Example 11.8 of the textbook, as at http://artint.info/demos/rl/sGameQ.html. In particular compare the following situations:
varies, and the
-values are initialized to 0.0.
varies, and the
-values are initialized to 5.0.
is fixed to 0.1, and the
-values are initialized to 0.0.
is fixed to 0.1, and the
-values are initialized to 5.0.
Some other parameter settings.
Consider four different ways to derive the value of
from
in Q-learning (note that for Q-learning with varying
, there must be a different count
for each state–action pair).
Let
.
Let
.
Let
.
Let
for the first 10,000 steps,
for the next 10,000 steps,
for the next 10,000 steps,
for the next 10,000 steps, and so on.
Which of these will converge to the true
-value in theory?
Which converges to the true
-value in practice (i.e., in a reasonable number of steps)? Try it at least for the simple game at http://artint.info/demos/rl/sGameQ.html. Note that for the code,
is set in
in
.
Which can adapt when the environment changes slowly?
How long did this assignment take? What did you learn? Was it reasonable?
One way to understand the effect of different
s in temporal difference learning is to plot the values and the estimates through time. What happens if the values change (e.g., in a step function)? What happens if the values keep rising? What happens if there is noise in the values? Plot some scenarios and explain to the class what you learned from doing this.