Intelligence 2E

foundations of computational agents

Dimensions: flat, states, infinite horizon, fully observable, stochastic, utility, learning, single agent, online, bounded rationality

In *Q*-learning and related RL algorithms, an agent tries to learn the
optimal policy from its history of
interaction with the environment. A history of an agent is a sequence of state–action–rewards:

$$ |

which means that the agent was in state ${s}_{0}$ and did action ${a}_{0}$, which resulted in it receiving reward ${r}_{1}$ and being in state ${s}_{1}$; then it did action ${a}_{1}$, received reward ${r}_{2}$, and ended up in state ${s}_{2}$; then it did action ${a}_{2}$, received reward ${r}_{3}$, and ended up in state ${s}_{3}$; and so on.

We treat this history of interaction as a sequence of experiences, where an experience is a tuple

$$ |

which means that the agent was in state $s$, it did action $a$, it received reward $r$, and it went into state ${s}^{\prime}$. These experiences will be the data from which the agent can learn what to do. As in decision-theoretic planning, the aim is for the agent to maximize its value, which is usually the discounted reward.

Recall that ${Q}^{*}(s,a)$, where $a$ is an action and $s$ is a state, is the expected value (cumulative discounted reward) of doing $a$ in state $s$ and then following the optimal policy.

*Q*-learning uses temporal differences to estimate the
value of ${Q}^{*}(s,a)$.
In *Q*-learning, the agent
maintains a table of $Q[S,A]$, where $S$ is the set of states and $A$
is the set of actions. $Q[s,a]$ represents its current estimate of
${Q}^{*}(s,a)$.

An experience $$ provides one data point for the value of $Q(s,a)$. The data point is that the agent received the future value of $r+\gamma V({s}^{\prime})$, where $V({s}^{\prime})={\mathrm{max}}_{{a}^{\prime}}Q({s}^{\prime},{a}^{\prime})$; this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation (12.1) to update its estimate for $Q(s,a)$:

$$Q[s,a]:=Q[s,a]+\alpha *\left(r+\gamma \underset{{a}^{\prime}}{\mathrm{max}}Q[{s}^{\prime},{a}^{\prime}]-Q[s,a]\right)$$ |

or, equivalently,

$$Q[s,a]:=(1-\alpha )*Q[s,a]+\alpha *\left(r+\gamma \underset{{a}^{\prime}}{\mathrm{max}}Q[{s}^{\prime},{a}^{\prime}]\right).$$ |

Figure 12.3 shows a *Q*-learning
controller, where the agent is acting and learning at the same
time. The $do(a)$ on line 15
specifies that the action $a$ is the command the
controller sends to the body. The reward and the resulting state are
the percepts the controller receives from the body.

The *Q*-learner learns (an approximation of) the optimal $Q$-function as long
as the agent explores enough, and there is no bound on the
number of times it tries an action in any state (i.e., it does not
always do the same subset of actions in a state).

Consider the two-state MDP of Example 9.27. The agent knows there are two states ${\mathrm{\{}}{h}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{y}{\mathrm{,}}{s}{\mathit{}}{i}{\mathit{}}{c}{\mathit{}}{k}{\mathrm{\}}}$ and two actions ${\mathrm{\{}}{r}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{a}{\mathit{}}{x}{\mathrm{,}}{p}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{t}{\mathit{}}{y}{\mathrm{\}}}$. It does not know the model and it learns from the ${s}{\mathrm{,}}{a}{\mathrm{,}}{r}{\mathrm{,}}{{s}}^{{\mathrm{\prime}}}$ experiences. With a discount, ${\gamma}{\mathrm{=}}{\mathrm{0.8}}$, ${\alpha}{\mathrm{=}}{\mathrm{0.3}}$, and ${Q}$ initially 0, the following is a possible trace (to a few significant digits and with the states and actions abbreviated):

${s}$ | ${a}$ | ${r}$ | ${{s}}^{{\prime}}$ | ${U}{}{p}{}{d}{}{a}{}{t}{}{e}{=}{(}{1}{-}{\alpha}{)}{*}{Q}{}{[}{s}{,}{a}{]}{+}{\alpha}{}{(}{r}{+}{\gamma}{}{m}{}{a}{}{{x}}_{{{a}}^{{\prime}}}{}{Q}{}{[}{{s}}^{{\prime}}{,}{{a}}^{{\prime}}{]}{)}$ |
---|---|---|---|---|

${h}{}{e}$ | ${r}{}{e}$ | ${7}$ | ${h}{}{e}$ | ${Q}{}{[}{h}{}{e}{,}{r}{}{e}{]}{=}{0.7}{*}{0}{+}{0.3}{*}{(}{7}{+}{0.8}{*}{0}{)}{=}{2.1}$ |

${h}{}{e}$ | ${r}{}{e}$ | ${7}$ | ${h}{}{e}$ | ${Q}{}{[}{h}{}{e}{,}{r}{}{e}{]}{=}{0.7}{*}{2.1}{+}{0.3}{*}{(}{7}{+}{0.8}{*}{2.1}{)}{=}{4.07}$ |

${h}{}{e}$ | ${p}{}{a}$ | ${10}$ | ${h}{}{e}$ | ${Q}{}{[}{h}{}{e}{,}{p}{}{a}{]}{=}{0.7}{*}{0}{+}{0.3}{*}{(}{10}{+}{0.8}{*}{4.07}{)}{=}{3.98}$ |

${h}{}{e}$ | ${p}{}{a}$ | ${10}$ | ${s}{}{i}$ | ${Q}{}{[}{h}{}{e}{,}{p}{}{a}{]}{=}{0.7}{*}{3.98}{+}{0.3}{*}{(}{10}{+}{0.8}{*}{0}{)}{=}{5.79}$ |

${s}{}{i}$ | ${p}{}{a}$ | 2 | ${s}{}{i}$ | ${Q}{}{[}{s}{}{i}{,}{p}{}{a}{]}{=}{0.7}{*}{0}{+}{0.3}{*}{(}{2}{+}{0.8}{*}{0}{)}{=}{0.06}$ |

${s}{}{i}$ | ${r}{}{e}$ | 0 | ${s}{}{i}$ | ${Q}{}{[}{s}{}{i}{,}{r}{}{e}{]}{=}{0.7}{*}{0}{+}{0.3}{*}{(}{0}{+}{0.8}{*}{0.06}{)}{=}{0.014}$ |

${s}{}{i}$ | ${r}{}{e}$ | 0 | ${h}{}{e}$ | ${Q}{}{[}{s}{}{i}{,}{r}{}{e}{]}{=}{0.7}{*}{0.014}{+}{0.3}{*}{(}{0}{+}{0.8}{*}{5.79}{)}{=}{1.40}$ |

With ${\alpha}$ fixed, the *Q*-values will approximate, but not converge to, the values obtained with
value iteration in Example 9.31. The smaller
${\alpha}$ is, the closer it will converge to the actual *Q*-values, but the
slower it will converge.

The controller of Figure 12.3 has $\alpha $ fixed. If ${\alpha}_{k}$ were
decreasing appropriately it would converge to the actual *Q*-values. To
implement this, there needs to be a separate ${\alpha}_{k}$ for each state–action
pair, which can be implemented using an array,
$visits[S,A]$, which counts the number of times action $A$ was carried
out in state $S$. Before
line 17 of Figure 12.3, $visits[s,a]$ can be
incremented, and $\alpha $ set to, say, $10/(9+visits[s,a])$; see Exercise 5.