Due: 10:00am, Monday 6 February 2012.
Consider the model-based reinforcement learner at: http://artint.info/demos/rl/sGameModel.html, and the controller in the dostep method of SGameModelController.java.
The current applet uses the empirical proportion as the probability. Does adding pseudo-counts improve the performance? Can you explain why or why not?
Suppose we wanted to measure the performance in terms of the number of asynchronous value iteration updates. Does doing more than one update per step improve the performance of the learner in terms of number of updates? (The current applet records the number of actions taken, not the number of updates. You do not need to change the code to do this question, but you need to report in terms of number of updates, not number of steps.)
To make it use pseudo-counts, we change the line 139 to
qs1a1+= (transitions[s1][a1][s2]+1.0)*(rewards[s1][a1]+discount*value(s2)) /(STATE_SIZE+(double)count_actions_in_state[s1][a1]);
And them we can remove the surrounding conditional. This works much worse because the prior does not reflect how the world works: not every transition is possible. It takes a long time to overcome this prior.
To check this, I compared 100000 steps at 1 update per step, with 50000 steps at 2 updates per step and 20000 steps at 5 updates per step (each of which does the same number of updates).
The 100000 steps at 1 update per step, was much better with approximately 9000-12000 rewards (considering the 10 and 90 percentile of the rewards).
The 50000 steps at 2 updates per step, was much worse with rewards of 2600-3900.
20000 steps at 5 updates per step was again much worse (and much more variable) with rewards of -2000 to 200.
So if experiences are cheap, it is better to use our time collecting experiences than updating Q-values.
To determine the best prediction for a single Boolean random variable on test data, assume that the data cases are generated stochastically according to some true parameter . Try the following a number of time (say 1000).
choose a number randomly. This is going to be the “ground truth”.
Generate training examples (try various values for , some small, say 5, 10, 20, and some large, say 1,000) by sampling with probability .
From the training set, generate , the number of false instances in the training set and , the number of true instances.
Generate a test set that contains many (say 1000) test cases using the same parameter .
For each of the optimality criteria – sum of absolute values, sum of squares, and likelihood (or entropy) – which of the following gives a lower error on the test set:
if , use , if , use , else use . (Try this for different numbers when the counts are zero.)
for different values of
another predictor that is a function of and .
The python file http://cs.ubc.ca/~poole/cs422/2012/as4/testPredictionSol.py produces the data needed. This gives the sum of errors for 1,000,000 test cases (1000 test cases for 1000 different values for ).
For sum of absolute values, the mode is best, whenever . When the empirical average is the same as the mode. Any pseudo-count makes the prediction worse.
For the sum of squares error, Laplace smoothing (a pseudo count of 1) works best, although the pseudo-count doesn’t make much difference for more than 20 training examples (only changes the 3rd decimal point). Just removing the 0,1 predictions is not as good as using pseudo-counts.
For the entropy, Laplace smoothing (a pseudo-count of 1) works best. The mode always has infinite entropy and the empirical average has infinite error for small training set sizes. Just removing the 0,1 predictions avoids the infinite entropy, but isn’t as good as adding pseudo counts.
How long did this assignment take? What did you learn? Was it reasonable?
This should have takes a couple of hours. You should have learned about the role of training and test sets, and the role of pseudo-counts.