### 11.3.7 Assigning Credit and Blame to Paths

In Q-learning and SARSA, only the previous state-action pair has its value revised when a reward is received. Intuitively, when an agent takes a number of steps that lead to a reward, all of the steps along the way could be held responsible and so receive some of the credit or the blame for a reward. This section gives an algorithm that assigns the credit and blame for all of the steps that lead to a reward.

Example 11.13: Suppose there is an action right that visits the states s1, s2, s3, and s4 in this order and a reward is only given when the agent enters s4 from s3, and any action from s4 returns to state s1. There is also an action left that moves to the left except in state s4. In Q-learning and SARSA, after traversing right through the states s1, s2, s3, and s4 and receiving the reward, only the value of Q[s3,right] is updated. If the same sequence of states is visited again, the value of Q[s2,right] will be updated when it transitions into s3. The value of Q[s1,right] is only updated after the next transition from state s1 to s2. In this sense, we say that Q-learning does a one-step backup.

Consider updating the value of Q[s3,right] based on the reward for entering state s4. From the perspective of state s4, the algorithm is doing a one-step backup. From the perspective of state s3, it is doing a one-step look-ahead. To make the algorithm allow the blame to be associated with more than the previous step, the reward from entering step s4 could do a two-step backup to update s2 or, equivalently, a two-step look-ahead from s2 and update s2's value when the reward from entering s4 is received. We will describe the algorithm in terms of a look-ahead but implement it using a backup.

With a two-step look-ahead, suppose the agent is in state st, does action at, ends up in state st+1, and receives reward rt+1, then does action at+1, resulting in state st+2 and receiving reward rt+2. A two-step look-ahead at time t gives the return Rt(2) = rt+1 + γrt+2 + γ2 V(st+2), thus giving the TD error

δt = rt+1 + γrt+2 + γ2 V(st+2)-Q[st,at],

where V(st+2) is an estimate of the value of st+2. The two-step update is

Q[st,at] ←Q[st,at]+αδt.

Unfortunately, this is not a good estimate of the optimal Q-value, Q*, because action at+1 may not be an optimal action. For example, if action at+1 was the action that takes the agent into a position with a reward of -10, and better actions were available, the agent should not update Q[s0,a0]. However, this multiple-step backup provides an improved estimate of the policy that the agent is actually following. If the agent is following policy π, this backup gives an improved estimate of Qπ. Thus multiple-step backup can be used in an on-policy method such as SARSA.

Suppose the agent is in state st, and it performs action at resulting in reward rt+1 and state st+1. It then does action at+1, resulting in reward rt+2 and state st+2, and so forth. An n-step return at time t, where n ≥ 1, written Rr(n), is a data point for the estimated future value of the action at time t, given by looking n steps ahead, is

 Rt(n) = rt+1 + γrt+2 + γ2 rt+3 + ...+ γn-1 rt+n + γn V(st+n).

This could be used to update Q[st,at] using the TD error Rt(n)-Q[st,at]. However, it is difficult to know which n to use. Instead of selecting one particular n and looking forward n steps, it is possible to have an average of a number of n-step returns. One way to do this is to have a weighted average of all n-step returns, in which the returns in the future are exponentially decayed, with a decay of λ. This is the intuition behind the method called SARSA(λ); when a reward is received, the values of all of the visited states are updated. Those states farther in the past receive less of the credit or blame for the reward.

Let

Rtλ= (1-λ) ∑n=1λn-1 Rt(n),

where (1-λ) is a normalizing constant to ensure we are getting an average. The following table gives the details of the sum:

 look-ahead Weight Return 1 step 1-λ rt+1 + γV(st+1) 2 step (1-λ) λ rt+1 + γrt+2 + γ2 V(st+2) 3 step (1-λ) λ2 rt+1 + γrt+2 + γ2 rt+3 + γ3 V(st+3) 4 step (1-λ) λ3 rt+1 + γrt+2 + γ2 rt+3 + γ3 rt+4 + γ4 V(st+3) ··· ··· ··· n step (1-λ) λn-1 rt+1 + γrt+2 + γ2 rt+3 + ...+ γn V(st+n) ··· ··· ··· total 1

Collecting together the common rt+i terms gives

 Rtλ = rt+1+γV(st+1)- λγV(st+1) + λγrt+2 + λγ2 V(st+2)-λ2γ2 V(st+2) + λ2γ2 rt+3 + λ2γ3 V(st+3)-λ3γ3 V(st+3) + λ3γ3 rt+4 + λ3γ4 V(st+4)-λ4γ4 V(st+4) +....

This will be used in a version of SARSA in which the future estimate of V(st+i) is the value of Q[st+i,at+i]. The TD error - the return minus the state estimate - is

 Rtλ-Q[st,at] = rt+1+γQ[st+1,at+1]-Q[st,at] + λγ( rt+2 + γQ[st+2,at+2]-Q[st+1,at+1]) + λ2γ2( rt+3 + γQ[st+3,at+3]- Q[st+2,at+2]) + λ3γ3( rt+4 + γQ[st+4,at+4] -Q[st+3,at+3]) +....

Instead of waiting until the end, which may never occur, SARSA(λ) updates the value of Q[st,at] at every time in the future. When the agent receives reward rt+i, it can use the appropriate sum in the preceding equation to update Q[st,at]. The preceding description refers to all times; therefore, the update rt+3 + γQ[st+3,at+3]- Q[st+2,at+2] can be used to update all previous states. An agent can do this by keeping an eligibility trace that specifies how much a state-action pair should be updated at each time step. When a state-action pair is first visited, its eligibility is set to 1. At each subsequent time step its eligibility is reduced by a factor of λγ. When the state-action pair is subsequently visited, 1 is added to its eligibility.

The eligibility trace is implemented by an array e[S,A], where S is the set of all states and A is the set of all actions. After every action is carried out, the Q-value for every state-action pair is updated.

 controller SARSA(λ,S,A,γ,α) inputs: S is a set of states A is a set of actions γ the discount α is the step size λ is the decay rate internal state: real array Q[S,A] real array e[S,A] previous state s previous action a begin initialize Q[S,A] arbitrarily initialize e[s,a]=0 for all s,a observe current state s select action a using a policy based on Q repeat forever: carry out an action a observe reward r and state s' select action a' using a policy based on Q δ←r+ γQ[s',a'] - Q[s,a] e[s,a] ←e[s,a]+1 for all s",a": Q[s",a"] ←Q[s",a"] + αδe[s",a"] e[s",a"] ←γλe[s",a"] s ←s' a ←a' end-repeat end
Figure 11.14: SARSA(λ)

The algorithm, known as SARSA(λ), is given in Figure 11.14.

Although this algorithm specifies that Q[s,a] is updated for every state s and action a whenever a new reward is received, it may be much more efficient and only slightly less accurate to only update those values with an eligibility over some threshold.