# CPSC422 Spring 2012Assignment 2

Due: 10:00am, Wednesday 18 January 2012.

You are encouraged to discuss this assignment and collaborate with your classmates, as long as (a) you list the people with whom you discussed the assignment and (b) you give your own answers to the questions. Feel free to ask questions on the Vista bulletin board.

# Question 1

Explain how Q-learning fits in with the agent architecture of Section 2.2.1 of the textbook. Suppose that the Q-learning agent has discount factor , a step size of , and is carrying out an -greedy exploration strategy.

What are the percepts of an agent?

What are the components of the belief state of the Q-learning agent?

What is the command function of the Q-learning agent?

What is the belief-state transition function of the Q-learning agent?

# Question 2

Compare the different parameter settings for the game of Example 11.8 of the textbook, as at http://artint.info/demos/rl/sGameQ.html. In particular compare the following situations:

varies, and the -values are initialized to 0.0.

varies, and the -values are initialized to 5.0.

is fixed to 0.1, and the -values are initialized to 0.0.

is fixed to 0.1, and the -values are initialized to 5.0.

Some other parameter settings.

For each of these, carry out multiple runs and compare the distributions of minimum values, zero crossing, the asymptotic slope for the policy that includes exploration, and the asymptotic slope for the policy that does not include exploration. To do the last task, after the algorithm has converged, set the exploitation parameter to 100% and run a large number of additional steps.

# Question 3

Consider four different ways to derive the value of from in Q-learning (note that for Q-learning with varying , there must be a different count for each stateâ€“action pair).

Let .

Let .

Let .

Let for the first 10,000 steps, for the next 10,000 steps, for the next 10,000 steps, for the next 10,000 steps, and so on.

Which of these will converge to the true -value in theory?

Which converges to the true -value in practice (i.e., in a reasonable number of steps)? Try it at least for the simple game at http://artint.info/demos/rl/sGameQ.html. Note that for the code, is set in in .

Which can adapt when the environment changes slowly?

# Question 4

How long did this assignment take? What did you learn? Was it reasonable?

# Mini-project suggestions

One way to understand the effect of different s in temporal difference learning is to plot the values and the estimates through time. What happens if the values change (e.g., in a step function)? What happens if the values keep rising? What happens if there is noise in the values? Plot some scenarios and explain to the class what you learned from doing this.