3.8.3 Dynamic Programming
Dynamic programming is a general method for optimization that involves
computing and storing partial solutions to problems.
Solutions that have already been found
can be retrieved rather than being recomputed. Dynamic programming
algorithms are used throughout AI and computer science.
Dynamic programming can be
used for finding paths in finite graphs by constructing a costtogoal function for nodes that gives the exact cost of a minimalcost
path from the node to a goal.
Let $\text{cost\_to\_goal}(n)$ be the actual cost of a lowestcost path from node
$n$ to a goal; $\text{cost\_to\_goal}(n)$ can be defined as
The general idea is to build a table offline of
the $\text{cost\_to\_goal}(n)$ value for each node. This is done by carrying out a lowestcostfirst search, with multiplepath pruning, from
the goal nodes in the inverse graph, which is the graph with all arcs
reversed. Rather than having a goal to search for, the dynamic
programming algorithm records the cost_to_goal values for each node found.
It uses the inverse graph to compute the costs from each
node to the goal and not the costs from the goal to each node.
In essence, dynamic programming works backward from the goal,
building the lowestcost paths to the goal from each node in the graph.
Example 3.20.
For the graph
given in Figure 3.2,
${r}{\mathit{}}{\mathrm{123}}$ is a goal, so

$${\mathit{\text{cost\_to\_goal}}}{}{(}{r}{}{123}{)}{=}{0}{.}$$ 

Continuing with a lowestcostfirst search from ${r}{\mathit{}}{\mathrm{123}}$:

${\mathit{\text{cost\_to\_goal}}}{}{(}{o}{}{123}{)}{=}{4}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{o}{}{119}{)}{=}{13}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{o}{}{109}{)}{=}{29}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{b}{}{4}{)}{=}{36}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{b}{}{2}{)}{=}{39}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{o}{}{103}{)}{=}{41}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{b}{}{3}{)}{=}{43}$ 


${\mathit{\text{cost\_to\_goal}}}{}{(}{b}{}{1}{)}{=}{45}$ 

At this stage the backward search halts. Notice that, if a node does not have a cost_to_goal
value, there is no path to the goal from that node.
A policy is a specification of which arc to take from each
node.
An optimal policy is a policy such that the cost of following
that policy is not worse than the cost of following any other policy.
Given a cost_to_goal function,
which is computed offline, a policy can be computed as follows:
From node $n$ it should go to
a neighbor $m$ that minimizes $$. This policy will take the agent from any node to a goal
along a
lowestcost path.
Either this neighbor can be recorded for all nodes offline, and
the mapping from node to node is provided to the agent for online action, or the
cost_to_goal function is given to the agent and each
neighbor can be
computed online.
Dynamic programming takes time and space linear in the size
of the graph to build the cost_to_goal table. Once the
cost_to_goal function has been built, even if the policy has not
been recorded, the time to determine which
arc is optimal depends only on the number
of neighbors for the node.
Example 3.21.
Given the costtogoal of Example 3.20 for the
goal of getting to ${r}{\mathit{}}{\mathrm{123}}$, if the agent is at ${o}{\mathit{}}{\mathrm{103}}$, it compares ${\mathrm{4}}{\mathrm{+}}{\mathrm{43}}$ (the cost of going via ${b}{\mathit{}}{\mathrm{3}}$) with
${\mathrm{12}}{\mathrm{+}}{\mathrm{29}}$ (the cost of going straight to ${o}{\mathit{}}{\mathrm{109}}$) and can
determine to go next to ${o}{\mathit{}}{\mathrm{109}}$. It does not need to consider going to
ts as it knows there is no path from ts to ${r}{\mathit{}}{\mathrm{123}}$.
Optimality of the ${\colorbox[rgb]{1,1,0.701960784313725}{$\mathrm{A}$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\mathrm{*}$}}$ algorithm
A search algorithm is optimal if no other search algorithm uses less
time or space or expands fewer paths, while still guaranteeing
solution quality. The optimal search algorithm would be one that picks the correct node at each choice. However, this specification is not
effective because we cannot directly
implement it. Whether such an algorithm is possible is related to
whether $\colorbox[rgb]{1,1,0.701960784313725}{$P$}\colorbox[rgb]{1,1,0.701960784313725}{$\ne $}\colorbox[rgb]{1,1,0.701960784313725}{$N$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$P$}$. There seems to be a
statement that can be proved.
Optimality of ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\mathrm{*}$}}$:
Among search algorithms that only use arc costs
and a consistent heuristic, no
algorithm expands fewer paths than ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ and guarantees to find a
lowestcost path.
Proof sketch:
Given only the information about the arc costs and the heuristic
information, unless the algorithm has expanded each path $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$, where
$\colorbox[rgb]{1,1,0.701960784313725}{$f$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$p$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$ is less than the cost of an optimal path, it does not know
whether $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$ leads to a lowercost path. More formally, suppose an
algorithm ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$ found a path for a problem $\colorbox[rgb]{1,1,0.701960784313725}{$P$}$ where some path $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$ was not
expanded such that $\colorbox[rgb]{1,1,0.701960784313725}{$f$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$p$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$ was less than the solution found. Suppose there
was another problem ${\colorbox[rgb]{1,1,0.701960784313725}{$P$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$, which was the same as $\colorbox[rgb]{1,1,0.701960784313725}{$P$}$, except that there
really was a path via $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$ with cost $\colorbox[rgb]{1,1,0.701960784313725}{$f$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$p$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$. The algorithm ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$ cannot tell
${\colorbox[rgb]{1,1,0.701960784313725}{$P$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$ from $\colorbox[rgb]{1,1,0.701960784313725}{$P$}$, because it did not expand the path $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$, so it would report
the same solution for ${\colorbox[rgb]{1,1,0.701960784313725}{$P$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$ as for $\colorbox[rgb]{1,1,0.701960784313725}{$P$}$, but the solution found for $\colorbox[rgb]{1,1,0.701960784313725}{$P$}$
would not be optimal for ${\colorbox[rgb]{1,1,0.701960784313725}{$P$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$\prime $}}$ because the solution found has a higher cost than the
path via $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$. Therefore, an algorithm is not guaranteed to find
a lowestcost path unless it explores all paths with $\colorbox[rgb]{1,1,0.701960784313725}{$f$}$values less than
the value of an optimal path; that is, it must explore all the paths that
${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ explores.
Counterexample: Although this proof seems reasonable,
there are algorithms that explore fewer nodes. Consider an algorithm that does a forward ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$like search and
a backward dynamic programming search, where the steps are
interleaved in some way (e.g., by alternating between the forward
steps and the backward steps). The backward search builds a table of
$\colorbox[rgb]{1,1,0.701960784313725}{$\text{cost\_to\_goal}$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$ values of the actual discovered cost from $\colorbox[rgb]{1,1,0.701960784313725}{$n$}$ to a goal, and it maintains a
bound $\colorbox[rgb]{1,1,0.701960784313725}{$b$}$, where it has explored all paths of cost less than $\colorbox[rgb]{1,1,0.701960784313725}{$b$}$ to a
goal. The forward search uses a priority queue on
$\colorbox[rgb]{1,1,0.701960784313725}{$c$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$o$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$s$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$t$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$p$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}\colorbox[rgb]{1,1,0.701960784313725}{$+$}\colorbox[rgb]{1,1,0.701960784313725}{$c$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$, where $\colorbox[rgb]{1,1,0.701960784313725}{$n$}$ is
the node at the end of the path $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$, and $\colorbox[rgb]{1,1,0.701960784313725}{$c$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$ is $\colorbox[rgb]{1,1,0.701960784313725}{$\text{cost\_to\_goal}$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$
if it has been computed; otherwise, $\colorbox[rgb]{1,1,0.701960784313725}{$c$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$ is $\colorbox[rgb]{1,1,0.701960784313725}{$m$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$a$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$x$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$h$}\colorbox[rgb]{1,1,0.701960784313725}{$$}\colorbox[rgb]{1,1,0.701960784313725}{$($}\colorbox[rgb]{1,1,0.701960784313725}{$n$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}\colorbox[rgb]{1,1,0.701960784313725}{$,$}\colorbox[rgb]{1,1,0.701960784313725}{$b$}\colorbox[rgb]{1,1,0.701960784313725}{$)$}$. The intuition is that, if a path
exists from the end of path $\colorbox[rgb]{1,1,0.701960784313725}{$p$}$ to a goal node, either it uses a path
that has been discovered by the backward search or it uses a path that
costs at least $\colorbox[rgb]{1,1,0.701960784313725}{$b$}$. This algorithm is guaranteed to find a lowestcost
path and often expands fewer paths than ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ (see
Exercise 11).
Conclusion: Having a counterexample would seem to mean
that the optimality of ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ is false. However, the proof has some appeal and should not be dismissed
outright. ${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ is not optimal out of the class of all algorithms, but
the proof is correct for the class of algorithms that only do forward
search (with conditions related to
tiebreaking).
Lakatos [1976] discusses how proofs and refutations are
common.
Dechter and Pearl [1985] give a detailed analysis of conditions when
${\colorbox[rgb]{1,1,0.701960784313725}{$A$}}^{\colorbox[rgb]{1,1,0.701960784313725}{$*$}}$ is optimal.
Dynamic programming has the advantage that it specifies what to do for
every node, and so can be used given a goal for any starting position.
Example 3.22.
In route planning (see Example 3.1),
we also might want to find a robust solution that can allow for
dynamic online replanning by quickly adapting
when the user deviates (intentionally or unintentionally) from the
best route. It should be able to give the best route from the user’s
current location and driving direction.
It is even possible to adapt ${{A}}^{{\mathrm{*}}}$ for dynamic programming for
cases where there is a known goal, and an initial starting position,
but where the agent can deviate from the optimal path.
One way this can be done is to carry out an ${{A}}^{{\mathrm{*}}}$ search (with
multiplepath pruning) from the
destination to the current location, in the inverse graph. When the
user deviates from an optimal route,
the other paths to the goal have often been explored, or can be
generated to find a path from the current location.
Dynamic programming can be used to construct heuristic functions which can be
used for ${A}^{*}$ and
branchandbound searches. One way to build a heuristic function is to simplify the problem (e.g.,
by leaving out some details) until the simplified problem is
small enough. Dynamic
programming can be used to find the cost of an optimal path
to a goal in
the simplified problem. This information forms a pattern
database that can then be used as a heuristic for the
original problem.
Dynamic programming is useful when

•
the goal nodes are explicit (the previous methods only assumed
a function that recognizes goal nodes)

•
a lowestcost path is needed

•
the graph is finite and small enough to be able to store the cost_to_goal value for each node

•
the goal does not change very often, and

•
the policy is used a number of times for each goal, so that the cost of
generating the cost_to_goal values can be amortized over many instances of the problem.
The main problems with dynamic programming are that

•
it only works when the graph is finite and the table can be made
small enough to fit into memory

•
an agent must recompute a
policy for each different goal, and

•
the time and space required is linear
in the size of the graph, where the graph size for finite graphs is
typically exponential in the path length.