foundations of computational agents
Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy ${\pi}_{0}$ (an approximation to the optimal policy works best) and carries out the following steps starting from $i=0$:
Policy evaluation: determine ${V}^{{\pi}_{i}}(S)$. The definition of ${V}^{\pi}$ is a set of $|S|$ linear equations in $|S|$ unknowns. The unknowns are the values of ${V}^{{\pi}_{i}}(S)$. There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.
Policy improvement: choose ${\pi}_{i+1}(s)=\mathrm{arg}{\mathrm{max}}_{a}{Q}^{{\pi}_{i}}(s,a)$, where the $Q$-value can be obtained from $V$ using Equation 9.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set ${\pi}_{i+1}(s)$ to be ${\pi}_{i}(s)$ if ${\pi}_{i}(s)$ is one of the actions that maximizes ${Q}^{{\pi}_{i}}(s,a)$.
Stop if there is no change in the policy, if ${\pi}_{i+1}={\pi}_{i}$, otherwise increment $i$ and repeat.
The algorithm is shown in Figure 9.18. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.
A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 9.2 and then do an improvement.
Policy iteration is useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter $a$ in some context $s$, which corresponds to the derivative of $Q(a,s)$, can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.