# 3.8.3 Dynamic Programming

Dynamic programming is a general method for optimization that involves computing and storing partial solutions to problems. Solutions that have already been found can be retrieved rather than being recomputed. Dynamic programming algorithms are used throughout AI and computer science.

Dynamic programming can be used for finding paths in finite graphs by constructing a cost-to-goal function for nodes that gives the exact cost of a minimal-cost path from the node to a goal.

Let $\mbox{cost\_to\_goal}(n)$ be the actual cost of a lowest-cost path from node $n$ to a goal; $\mbox{cost\_to\_goal}(n)$ can be defined as

 $\mbox{cost\_to\_goal}(n)=\left\{\begin{array}[]{ll}0&\mbox{if }\mbox{goal}(n),% \\ \min_{\left\in A}(\mbox{cost}(\left)+\mbox{cost\_to\_% goal}(m))&\mbox{otherwise.}\end{array}\right.$

The general idea is to build a table offline of the $\mbox{cost\_to\_goal}(n)$ value for each node. This is done by carrying out a lowest-cost-first search, with multiple-path pruning, from the goal nodes in the inverse graph, which is the graph with all arcs reversed. Rather than having a goal to search for, the dynamic programming algorithm records the cost_to_goal values for each node found. It uses the inverse graph to compute the costs from each node to the goal and not the costs from the goal to each node. In essence, dynamic programming works backward from the goal, building the lowest-cost paths to the goal from each node in the graph.

###### Example 3.20.

For the graph given in Figure 3.2, $r123$ is a goal, so

 $\mbox{cost\_to\_goal}(r123)=0.$

Continuing with a lowest-cost-first search from $r123$:

 $\displaystyle{\mbox{cost\_to\_goal}(o123)=4}$ $\displaystyle{\mbox{cost\_to\_goal}(o119)=13}$ $\displaystyle{\mbox{cost\_to\_goal}(o109)=29}$ $\displaystyle{\mbox{cost\_to\_goal}(b4)=36}$ $\displaystyle{\mbox{cost\_to\_goal}(b2)=39}$ $\displaystyle{\mbox{cost\_to\_goal}(o103)=41}$ $\displaystyle{\mbox{cost\_to\_goal}(b3)=43}$ $\displaystyle{\mbox{cost\_to\_goal}(b1)=45}$

At this stage the backward search halts. Notice that, if a node does not have a cost_to_goal value, there is no path to the goal from that node.

A policy is a specification of which arc to take from each node. An optimal policy is a policy such that the cost of following that policy is not worse than the cost of following any other policy. Given a cost_to_goal function, which is computed offline, a policy can be computed as follows: From node $n$ it should go to a neighbor $m$ that minimizes $\mbox{cost}(\left)+\mbox{cost\_to\_goal}(m)$. This policy will take the agent from any node to a goal along a lowest-cost path.

Either this neighbor can be recorded for all nodes offline, and the mapping from node to node is provided to the agent for online action, or the cost_to_goal function is given to the agent and each neighbor can be computed online.

Dynamic programming takes time and space linear in the size of the graph to build the cost_to_goal table. Once the cost_to_goal function has been built, even if the policy has not been recorded, the time to determine which arc is optimal depends only on the number of neighbors for the node.

###### Example 3.21.

Given the cost-to-goal of Example 3.20 for the goal of getting to $r123$, if the agent is at $o103$, it compares $4+43$ (the cost of going via $b3$) with $12+29$ (the cost of going straight to $o109$) and can determine to go next to $o109$. It does not need to consider going to ts as it knows there is no path from ts to $r123$.

Optimality of the $A^{*}$ algorithm A search algorithm is optimal if no other search algorithm uses less time or space or expands fewer paths, while still guaranteeing solution quality. The optimal search algorithm would be one that picks the correct node at each choice. However, this specification is not effective because we cannot directly implement it. Whether such an algorithm is possible is related to whether $P\neq NP$. There seems to be a statement that can be proved. Optimality of $A^{*}$: Among search algorithms that only use arc costs and a consistent heuristic, no algorithm expands fewer paths than $A^{*}$ and guarantees to find a lowest-cost path. Proof sketch: Given only the information about the arc costs and the heuristic information, unless the algorithm has expanded each path $p$, where $f(p)$ is less than the cost of an optimal path, it does not know whether $p$ leads to a lower-cost path. More formally, suppose an algorithm $A^{\prime}$ found a path for a problem $P$ where some path $p$ was not expanded such that $f(p)$ was less than the solution found. Suppose there was another problem $P^{\prime}$, which was the same as $P$, except that there really was a path via $p$ with cost $f(p)$. The algorithm $A^{\prime}$ cannot tell $P^{\prime}$ from $P$, because it did not expand the path $p$, so it would report the same solution for $P^{\prime}$ as for $P$, but the solution found for $P$ would not be optimal for $P^{\prime}$ because the solution found has a higher cost than the path via $p$. Therefore, an algorithm is not guaranteed to find a lowest-cost path unless it explores all paths with $f$-values less than the value of an optimal path; that is, it must explore all the paths that $A^{*}$ explores. Counterexample: Although this proof seems reasonable, there are algorithms that explore fewer nodes. Consider an algorithm that does a forward $A^{*}$-like search and a backward dynamic programming search, where the steps are interleaved in some way (e.g., by alternating between the forward steps and the backward steps). The backward search builds a table of $\mbox{cost\_to\_goal}(n)$ values of the actual discovered cost from $n$ to a goal, and it maintains a bound $b$, where it has explored all paths of cost less than $b$ to a goal. The forward search uses a priority queue on $cost(p)+c(n)$, where $n$ is the node at the end of the path $p$, and $c(n)$ is $\mbox{cost\_to\_goal}(n)$ if it has been computed; otherwise, $c(n)$ is $max(h(n),b)$. The intuition is that, if a path exists from the end of path $p$ to a goal node, either it uses a path that has been discovered by the backward search or it uses a path that costs at least $b$. This algorithm is guaranteed to find a lowest-cost path and often expands fewer paths than $A^{*}$ (see Exercise 11). Conclusion: Having a counterexample would seem to mean that the optimality of $A^{*}$ is false. However, the proof has some appeal and should not be dismissed outright. $A^{*}$ is not optimal out of the class of all algorithms, but the proof is correct for the class of algorithms that only do forward search (with conditions related to tie-breaking). Lakatos [1976] discusses how proofs and refutations are common. Dechter and Pearl [1985] give a detailed analysis of conditions when $A^{*}$ is optimal.

Dynamic programming has the advantage that it specifies what to do for every node, and so can be used given a goal for any starting position.

###### Example 3.22.

In route planning (see Example 3.1), we also might want to find a robust solution that can allow for dynamic online replanning by quickly adapting when the user deviates (intentionally or unintentionally) from the best route. It should be able to give the best route from the user’s current location and driving direction.

It is even possible to adapt $A^{*}$ for dynamic programming for cases where there is a known goal, and an initial starting position, but where the agent can deviate from the optimal path. One way this can be done is to carry out an $A^{*}$ search (with multiple-path pruning) from the destination to the current location, in the inverse graph. When the user deviates from an optimal route, the other paths to the goal have often been explored, or can be generated to find a path from the current location.

Dynamic programming can be used to construct heuristic functions which can be used for $A^{*}$ and branch-and-bound searches. One way to build a heuristic function is to simplify the problem (e.g., by leaving out some details) until the simplified problem is small enough. Dynamic programming can be used to find the cost of an optimal path to a goal in the simplified problem. This information forms a pattern database that can then be used as a heuristic for the original problem.

Dynamic programming is useful when

• the goal nodes are explicit (the previous methods only assumed a function that recognizes goal nodes)

• a lowest-cost path is needed

• the graph is finite and small enough to be able to store the cost_to_goal value for each node

• the goal does not change very often, and

• the policy is used a number of times for each goal, so that the cost of generating the cost_to_goal values can be amortized over many instances of the problem.

The main problems with dynamic programming are that

• it only works when the graph is finite and the table can be made small enough to fit into memory

• an agent must recompute a policy for each different goal, and

• the time and space required is linear in the size of the graph, where the graph size for finite graphs is typically exponential in the path length.