11.3.8 Model-Based Methods

In many applications of reinforcement learning, plenty of time is available for computation between each action. For example, a physical robot may have many seconds between each action. Q-learning, which only does one backup per action, will not make full use of the available computation time.

An alternative to just learning the Q-values is to use the data to learn the model. That is, an agent uses its experience to explicitly learn P(s'|s,a) and R(s,a,s'). For each action that the agent carries out in the environment, the agent can then do a number of steps of asynchronous value iteration to give a better estimate of the Q-function.

controller ModelBasedReinforementLearner(S,A,γ,c)
S is a set of states
A is a set of actions
γ the discount
c is prior count
internal state:
real array Q[S,A]
real array R[S,A,S]
integer array T[S,A,S]
state s, s'
action a
initialize Q[S,A] arbitrarily
initialize R[S,A,S] arbitrarily
initialize T[S,A,S] to zero
observe current state s
select and carry out action a
repeat forever:
observe reward r and state s'
select and carry out action a
s ←s'
select state s1, action a1
let P=∑s2(T[s1,a1,s2]+c)
Q[s1,a1]←∑s2 (T[s1,a1,s2]+c)/(P)(R[s1,a1,s2]+γmaxa2Q[s2,a2])
until an observation arrives
Figure 11.15: Model-based reinforcement learner

Figure 11.15 shows a generic model-based reinforcement learner. As with other reinforcement learning programs, it keeps track of Q[S,A], but it also maintains a model of the dynamics, represented here as T, where T[s,a,s'] is the count of the number of times that the agent has done a in state s and ended up in state s'. The counts are added to prior counts, as in a Dirichlet distribution, to compute probabilities. The algorithm assumes a common prior count. The R[s,a,s'] array maintains the average reward for transitioning from state s, doing action a, and ending up in state s'.

After each action, the agent observes the reward r and the resulting state s'. It then updates the transition-count matrix T and the average reward R. It then does a number of steps of asynchronous value iteration, using the updated probability model derived from T and the updated reward model. There are three main undefined parts to this algorithm:

  • Which Q-values should be updated? It seems reasonable that the algorithm should at least update Q[s,a], because more data have been received on the transition probability and reward. From there it can either do random updates or determine which Q-values would change the most. The elements that potentially have their values changed the most are the Q[s1,a1] with the highest probability of ending up at a Q-value that has changed the most (i.e., where Q[s1,a2] has changed the most). This can be implemented by keeping a priority queue of Q-values to consider.
  • How many steps of asynchronous value iteration should be done between actions? An agent should continue doing steps of value iteration until it has to act or until it gets new information. Figure 11.15 assumes that the agent acts and then waits for an observation to arrive. When an observation arrives, the agent acts as soon as possible. There are may variants, including a more relaxed agent that runs the repeat loop in parallel with observing and acting. Such an agent acts when it must, and it updates the transition and reward model when it observes.
  • What should be the initial values for R[S,A,S] and Q[S,A]? Once the agent has observed a reward for a particular ⟨s,a,s'⟩ transition, it will use the average of all of the rewards received for that transition. However, it requires some value for the transitions it has never experienced when updating Q. If it is using the exploration strategy of optimism in the face of uncertainty, it can use Rmax, the maximum reward possible, as the initial value for R, to encourage exploration. As in value iteration, it is best to initialize Q to be as close as possible to the final Q-value.

The algorithm in Figure 11.15 assumes that the prior count is the same for all ⟨s,a,s'⟩ transitions. If some prior knowledge exists that some transitions are impossible or some are more likely, the prior count should not be uniform.

This algorithm assumes that the rewards depend on the initial state, the action, and the final state. Moreover, it assumes that the reward for a ⟨s,a,s'⟩ transition is unknown until that exact transition has been observed. If the reward only depends on the initial state and the action, it is more efficient to have an R[S,A]. If there are separate action costs and rewards for entering a state, and the agent can separately observe the costs and rewards, the reward function can be decomposed into C[A] and R[S], leading to more efficient learning.

It is difficult to directly compare the model-based and model-free reinforcement learners. Typically, model-based learners are much more efficient in terms of experience; many fewer experiences are needed to learn well. However, the model-free methods often use less computation time. If experience was cheap, a different comparison would be needed than if experience was expensive.