12 Learning to Act

12.3 Temporal Differences

To understand how reinforcement learning works, consider how to average values that arrive to an agent sequentially.

Suppose there is a sequence of numerical values, v1,v2,v3,, and the goal is to predict the next value, given the previous values. One way to do this is to have a running approximation of the expected value of vi. For example, given a sequence of students’ grades and the aim of predicting the next grade, a reasonable prediction may be to predict the average grade. This can be implemented by maintaining a running average, as follows:

Let Ak be an estimate of the expected value based on the first k data points v1,,vk. A reasonable estimate is the sample average:

Ak=v1++vkk.

Thus,

k*Ak =v1++vk-1+vk
=(k-1)Ak-1+vk.

Dividing by k gives

Ak=(1-1k)*Ak-1+vkk.

Let αk=1k; then

Ak= (1-αk)*Ak-1+αk*vk
= Ak-1+αk*(vk-Ak-1). (12.1)

The difference, vk-Ak-1, is called the temporal difference error or TD error; it specifies how different the new value, vk, is from the old prediction, Ak-1. The old estimate, Ak-1, is updated by αk times the TD error to get the new estimate, Ak. The qualitative interpretation of the temporal difference formula is that if the new value is higher than the old prediction, increase the predicted value; if the new value is less than the old prediction, decrease the predicted value. The change is proportional to the difference between the new value and the old prediction. Note that this equation is still valid for the first value, k=1, in which case A1=v1.

This analysis assumes that all of the values have an equal weight. However, suppose you are keeping an estimate of the expected price of some item in the grocery store. Prices go up and down in the short term, but tend to increase slowly; the newer prices are more useful for the estimate of the current price than older prices, and so they should be weighted more in predicting new prices.

In reinforcement learning, the values are estimates of the effects of actions; more recent values are more accurate than earlier values because the agent is learning, and so they should be weighted more. One way to weight later examples more is to use Equation 12.1, but with α as a constant (0<α1) that does not depend on k. Unfortunately, this does not converge to the average value when there is variability in the values in the sequence, but it can track changes when the underlying process generating the values changes.

You could reduce α more slowly and potentially have the benefits of both approaches: weighting recent observations more and still converging to the average. You can guarantee convergence if

k=1αk= and k=1αk2<.

The first condition is to ensure that random fluctuations and initial conditions get averaged out, and the second condition guarantees convergence.

One way to give more weight to more recent experiences, but also converge to the average, is to set αk=(r+1)/(r+k) for some r>0. For the first experience α1=1, so it ignores the prior A0. If r=9, after 11 experiences, α11=0.5 so it weights that experience as equal to all of its prior experiences. The parameter r should be set to be appropriate for the domain.

Note that guaranteeing convergence to the average is not compatible with being able to adapt to make better predictions when the underlying process generating the values keeps changing.

For the rest of this chapter, α without a subscript is assumed to be a constant. With a subscript it is a function of the number of cases used for the particular estimate.