foundations of computational agents
Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal policy independently of the agent’s actions, as long as it explores enough. An off-policy learner can learn the optimal policy even if it is acting randomly. An off-policy learner that is exploring does not learn the value of the policy it is following, because it includes exploration steps.
There may be cases, such as where there are large negative rewards, where ignoring what the agent actually does is dangerous. An alternative is to learn the value of the policy the agent is actually carrying out, which includes exploration steps, so that that policy can be iteratively improved. The learner can thus take into account the costs associated with exploration. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.
SARSA (so called because it uses state–action–reward–state–action experiences to update the -values) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. An experience in SARSA is of the form , which means that the agent was in state , did action , received reward , and ended up in state , from which it decided to do action . This provides a new experience to update . The new value that this experience provides is .
Figure 13.5 gives the SARSA algorithm. The Q-values that SARSA computes depend on the current exploration policy which, for example, may be greedy with random steps. It can find a different policy than Q-learning in situations where exploring may incur large penalties. For example, when a robot goes near the top of a flight of stairs, even if this is an optimal policy, it may be dangerous for exploration steps. SARSA will discover this and adopt a policy that keeps the robot away from the stairs. It will find a policy that is optimal, taking into account the exploration inherent in the policy.
In Example 13.1, the optimal policy is to go up from state in Figure 13.1. However, if the agent is exploring, this action may be bad because exploring from state is very dangerous.
If the agent is carrying out the policy that includes exploration, “when in state , 80% of the time select the action that maximizes , and 20% of the time select an action at random,” going up from is not optimal. An on-policy learner will try to optimize the policy the agent is following, not the optimal policy that does not include exploration.
The -values of the optimal policy are less in SARSA than in -learning. The values for -learning and for SARSA (the exploration rate in parentheses) for the domain of Example 13.1, for a few state–action pairs, are
| Algorithm | |||||
|---|---|---|---|---|---|
| -learning | 19.48 | 23.28 | 26.86 | 16.9 | 30.95 | 
| SARSA (20%) | 9.27 | 7.90 | 14.80 | 4.43 | 18.09 | 
| SARSA (10%) | 13.04 | 13.95 | 18.90 | 8.93 | 22.47 | 
The optimal policy using SARSA with 20% exploration is to go right in state , but with 10% exploration the optimal policy is to go up in state . With 20% exploration, this is the optimal policy because exploration is dangerous. With exploration, going into state is less dangerous. Thus, if the rate of exploration is reduced, the optimal policy changes. However, with less exploration, it would take longer to find an optimal policy. The value -learning converges to does not depend on the exploration rate.
SARSA is useful when deploying an agent that is exploring in the world. If you want to do offline learning, and then use that policy in an agent that does not explore, Q-learning may be more appropriate.