# CPSC422 Spring 2012Assignment 3

Due: 10:00am, Wednesday 25 January 2012.

You are encouraged to discuss this assignment and collaborate with your classmates, as long as (a) you list the people with whom you discussed the assignment and (b) you give your own answers to the questions. Feel free to ask questions on the Vista bulletin board.

# Question 1

Consider the feature-based reinforcement learner at: http://artint.info/demos/rl/sGameFA.html.

1. What happens if the gradient descent step size is too large? (What values are too large?)

2. What happens if the gradient descent step size is too small? (What values are too small?)

3. What gradient descent step size gives the best performance? (Tell us how you measure performance, and how you found this step size).

#### Solution:

1. If the gradient descent step size is too large, the magnitude of the parameters keeps increasing and quickly overflows. 0.1 is too big.

2. If the gradient descent step size is too small, the convergence is very slow. 0.001 is too small.

3. To find the best performance, run it for a number of steps at each value to get a reasonably accurate measure of what you want to optimise (accumulated reward, perhaps), and then do a binary search.

# Question 2

Consider the feature-based reinforcement learner at: http://artint.info/demos/rl/sGameFA.html. In this question you are going to play with the set of features. You can change the set of features in SGameFeatureSet.java; to remove a feature just comment it out (and it will have value 0).

1. Give a minimal subset of the existing features for which the agent can eventually learn a policy with a positive average reward. Explain what each of the features does.

2. What happens when redundant features are added?

3. What happens if some features are multiplied by a constant?

4. What happens if some features are negative?

5. Does it perform better if the same or the product of existing features are added?

6. Give two (new or existing) features that when added to your minimimal set of features gives the most improvement together. Bonus marks will be given to the student(s) who find features with the best performance.

#### Solution:

For this you were to report what you observed.

1. A minimal set that I found that works is to include

• feature 0, which is the constant 1

• feature 1, which checks of there is a monster ahead

• feature 2, which checks if there is a wall ahead is useful when there is no prize. When there is a prize heading towards the prize keeps it away from the wall.

• feature 3 which checks if there is a prize ahead.

If feature 0 is not used, the even features from 10 to 24, which record the x and y positions for the cases where there are monsters are useful. Note that these does not affect the policy as these features don’t depend on the actions. But it does let it estimate the Q-values better.

2. Redundant features can be useful, e.g., adding in the (odd) features from 10-24, which check the distance from the other wall are useful in that they increase the reward.

3. If some features are multiplied by a constant: The weights get bigger when the constant is less than 1. Sometimes when the weights are increased it doesn’t work; it might be because of the fact that it implicitly changes the step size. However, increasing the reward for seems to make it worse; I am not sure why.

4. If some features are negated, it seems to work the same; the weights just get negated.

5. Yes. They seems to have some effect. To get marks for this, you need to provide evidence for your claims.

6. This was to be creative to find features.

# Question 3

Consider the effect of increasing the discount factor in reinforcement learning.

1. What is the effect on the values?

2. What is the effect on the optimal policy?

3. What is the effect on the rate at which the agent learns?

#### Solution:

1. As the discount factor increases (within the range [0,1)), the values increase.

2. As the discount factor increases (within the range [0,1)), the optimal policy gets more fixated on long-term rewards. E.g., in the sGame, the agent is more likely to visit the repair station as this has a long-term improvement, whereas for a small discount factor, the long-term advantage of being undamaged is not worth the cost.

3. As the discount factor increases (within the range [0,1)), the agent learns more slowly. It takes longer for the Q-values to converge.

Hint: it may be insightful to play with the applet for a grid domain at: http://artint.info/demos/mdp/vi.html.

# Question 4

How long did this assignment take? What did you learn? Was it reasonable?

It should have taken around 4 hours to play with the applet, and record some statistics on variants. You should have learned about the art of designing features and how to evaluate and optimize algorithms that give different results for each run.