9.5 Decision Processes

9.5.3 Policy Iteration

Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy π0 (an approximation to the optimal policy works best) and carries out the following steps starting from i=0:

  • Policy evaluation: determine Vπi(S). The definition of Vπ is a set of |S| linear equations in |S| unknowns. The unknowns are the values of Vπi(S). There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.

  • Policy improvement: choose πi+1(s)=argmaxaQπi(s,a), where the Q-value can be obtained from V using Equation 9.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set πi+1(s) to be πi(s) if πi(s) is one of the actions that maximizes Qπi(s,a).

  • Stop if there is no change in the policy, if πi+1=πi, otherwise increment i and repeat.

1:procedure Policy_iteration(S,A,P,R)
2:      Inputs
3:            S is the set of all states
4:            A is the set of all actions
5:            P is state transition function specifying P(ss,a)
6:            R is a reward function R(s,a)       
7:      Output
8:            optimal policy π
9:      Local
10:            action array π[S]
11:            Boolean variable noChange
12:            real array V[S]       
13:      set π arbitrarily
14:      repeat
15:            noChange:=true
16:            Solve V[s]=R(s,a)+γ*sSP(ss,π[s])*V[s]
17:            for each sS do
18:                 QBest:=V[s]
19:                 for each aA do
20:                       Qsa:=R(s,a)+γ*sSP(ss,a)*V[s]
21:                       if Qsa>QBest then
22:                             π[s]:=a
23:                             QBest:=Qsa
24:                             noChange:=false                                                     
25:      until noChange
26:      return π
Figure 9.18: Policy iteration for MDPs

The algorithm is shown in Figure 9.18. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.

A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 9.2 and then do an improvement.

Policy iteration is useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter a in some context s, which corresponds to the derivative of Q(a,s), can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.