foundations of computational agents
One way to solve reinforcement algorithms is to treat this as an optimization problem, with the aim of selecting a policy that maximizes the expected reward collected. This can be done with a policy search through the space of policies to find the best policy. A policy is a controller that can be evaluated by running it in the agent acting in the environment.
Policy search is often solved as a stochastic local search algorithm by searching in the space of policies. A policy can be evaluated by running it in the environment a number of times.
Starting from an initial policy, the policy can be repeatedly evaluated in the environment and iteratively improved. This process is called an evolutionary algorithm because the agent, as a whole, is evaluated on how well it survives. This is often combined with genetic algorithms, which take us one step closer to the biological analogy of competing agents mutating genes.
Evolutionary algorithms have a number of issues. The first is the size of the state space. If there are states and actions, there are policies. For example, for the game described in Example 12.1, there are different policies. For the game of Example 12.2, there are 250 states, and so policies. This is a very small game, but it has more policies than there are particles in the universe.
Second, evolutionary algorithms use experiences very wastefully. If an agent was in state of Example 12.1 and it moved left, you would like it to learn that it is bad to go left from state . But evolutionary algorithms wait until the agent has finished and judge the policy as a whole. Stochastic local search will randomly try doing something else in state and so may eventually determine that that action was not good, but it is very indirect. Genetic algorithms are slightly better in that the policies that have the agent going left in state will die off, but again this is very indirect.
Third, the performance of evolutionary algorithms can be very sensitive to the representation of the policy. The representation for a genetic algorithm should be such that crossover preserves the good parts of the policy. The representations are often tuned for the particular domain.
An alternative pursued in the rest of this chapter is to learn after every action. The components of the policy are learned, rather than the policy as a whole. By learning what do in each state, learning can have linear or polynomial time and space complexity in the number of states, rather than exponential in the number of states.