9.5.1 Policies

In a fully-observable Markov decision process, the agent gets to observe its current state before deciding which action to carry out. For now assume that the Markov decision process is fully observable. A policy specifies what the agent should do as a function of the state it is in. A stationary policy is a function $\pi:S\rightarrow A$. In a non-stationary policy the action is a function of the state and the time; we assume policies are stationary.

Given a reward criterion, a policy has an expected value for every state. Let $V^{\pi}(s)$ be the expected value of following $\pi$ in state $s$. This specifies how much value the agent expects to receive from following the policy in that state. Policy $\pi$ is an optimal policy if there is no policy $\pi^{\prime}$ and no state $s$ such that $V^{\pi^{\prime}}(s)>V^{\pi}(s)$. That is, it is a policy that has a greater or equal expected value at every state than any other policy.

Example 9.29.

For Example 9.27, with two states and two actions, there are $2^{2}=4$ policies:

• Always relax.

• Always party.

• Relax if healthy and party if sick.

• Party if healthy and relax if sick.

The total reward for all of these is infinite because the agent never stops, and can never continually get a reward of 0. To determine the average reward is left as an exercise (Exercise 14). How to compute the discounted reward is discussed in the next section.

Example 9.30.

In the MDP of Example 9.28 there are 100 states and 4 actions, therefore there are $4^{100}\approx 10^{60}$ stationary policies. Each policy specifies an action for each state.

For infinite horizon problems, a stationary MDP always has an optimal stationary policy. However for finite-stage problems, a non-stationary policy might be better than all stationary policies. For example, if the agent had to stop at time $n$, for the last decision in some state, the agent would act to get the largest immediate reward without considering the future actions, but for earlier decisions it may decide to get a lower reward immediately to obtain a larger reward later.

Value of a Policy

Consider how to compute the expected value, using the discounted reward of a policy, given a discount factor of $\gamma$. The value is defined in terms of two interrelated functions:

• $V^{\pi}(s)$ is the expected value of following policy $\pi$ in state $s$.

• $Q^{\pi}(s,a)$, is the expected value, starting in state $s$ of doing action $a$, then following policy $\pi$. This is called the Q-value of policy $\pi$.

$Q^{\pi}$ and $V^{\pi}$ are defined recursively in terms of each other. If the agent is in state $s$, performs action $a$, and arrives in state $s^{\prime}$, it gets the immediate reward of $R(s,a,s^{\prime})$ plus the discounted future reward, $\gamma V^{\pi}(s^{\prime})$. When the agent is planning it does not know the actual resulting state, so it uses the expected value, averaged over the possible resulting states:

 $\displaystyle Q^{\pi}(s,a)$ $\displaystyle=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)(R(s,a,s^{\prime})+\gamma V% ^{\pi}(s^{\prime}))$ $\displaystyle=R(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\mid s,a)V^{\pi}(s^{% \prime})$ (9.2)

where $R(s,a)=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)R(s,a,s^{\prime})$.

$V^{\pi}(s)$ is obtained by doing the action specified by $\pi$ and then following $\pi$:

 $\displaystyle V^{\pi}(s)$ $\displaystyle=Q^{\pi}(s,\pi(s)).$

Value of an Optimal Policy

Let $Q^{*}(s,a)$, where $s$ is a state and $a$ is an action, be the expected value of doing $a$ in state $s$ and then following the optimal policy. Let $V^{*}(s)$, where $s$ is a state, be the expected value of following an optimal policy from state $s$.

$Q^{*}$ can be defined analogously to $Q^{\pi}$:

 $\displaystyle Q^{*}(s,a)$ $\displaystyle=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)(R(s,a,s^{\prime})+\gamma V% ^{*}(s^{\prime}))$ $\displaystyle=R(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\mid s,a)\gamma V^{*}(% s^{\prime}).$

$V^{*}(s)$ is obtained by performing the action that gives the best value in each state:

 $\displaystyle V^{*}(s)$ $\displaystyle=\max_{a}Q^{*}(s,a).$

An optimal policy $\pi^{*}$ is one of the policies that gives the best value for each state:

 $\displaystyle\pi^{*}(s)$ $\displaystyle=\arg\max_{a}Q^{*}(s,a)$

where $\arg\max_{a}Q^{*}(s,a)$ is a function of state $s$, and its value is an action $a$ that results in the maximum value of $Q^{*}(s,a)$.