This applet shows how value iteration works for a simple 10x10 grid world. The numbers in the bottom left of each square shows the value of the grid point. The blue arrows show the optimal action based on the current value function (when it looks like a star, all actions are optimal). To start, press "step".
In this example, there are three absorbing states, one worth +10, one worth -5 and one -10. (You can see these when you press "step" once or press "reset").
There are 4 actions available: up, down, left and right. If you carry out one of these actions in a non absorbing state, it have a 0.7 chance of going one step in the desired direction and a 0.1 change in going one step in any of the other three directions. If it bumps into the outside wall (i.e., the square computed as above is outside the grid), there is a penalty on 1 (i.e., a reward of -1) and the agent doesn't actually move.
The initial discount rate is 0.9. (This is the number displayed at the top right of the grid). It is interesting to try the value iteration at different discount rates (using the "Increment Discount" and "Decrement Discount" buttons). Try 0.7, 0.4, 1.0. Look, in particular, at the policy for the point 3-across and 3-down, and at the point 2-across and 6-down. Can you explain why the directions is the way it is for the optimal policy (and why it changes as the value function is built)?
The commands "Brighter" and "Dimmer" change the contrast (the mapping between non-extreme values and colour). "Grow" and "Shrink" change the size of the grid.