12 Learning to Act

12.8 Model-Based Reinforcement Learning

Dimensions: flat, states, infinite horizon, fully observable, stochastic, utility, learning, single agent, online, bounded rationality

In many applications of reinforcement learning, plenty of time is available for computation between each action. For example, a physical robot may have many seconds between each action. Q-learning, which only does one backup per action, will not make full use of the available computation time.

An alternative to doing one Q-value update after each action is to use the experiences to learn a model. An agent can explicitly learn P(ss,a) and R(s,a). For each action that the agent carries out in the environment, the agent can do a number of steps of asynchronous value iteration to give a better estimate of the Q-function.

1:controller Model_based_reinforcement_learner(S,A,γ)
2:      Inputs
3:            S is a set of states
4:            A is a set of actions
5:            γ the discount       
6:      Local
7:            real array Q[S,A]
8:            real array R[S,A]
9:            integer array T[S,A,S]
10:            integer array C[S,A]       
11:      initialize Q[S,A] arbitrarily
12:      initialize R[S,A] arbitrarily
13:      initialize T[S,A,S] to zero
14:      initialize C[S,A] to zero
15:      observe current state s
16:      select action a
17:      do(a)
18:      repeat
19:            observe reward r and state s
20:            T[s,a,s]:=T[s,a,s]+1
21:            C[s,a]:=C[s,a]+1
22:            R[s,a]:=R[s,a]+r-R[s,a]C[s,a]
23:            s:=s
24:            select action a
25:            do(a)
26:            repeat
27:                 select state s1, action a1 such that C[s1,a1]0
28:                 Q[s1,a1]:=R[s1,a1]+γ*s2T[s1,a1,s2]C[s1,a1]*maxa2Q[s2,a2]
29:            until  an observation arrives
30:      until termination
Figure 12.6: Model-based reinforcement learner

Figure 12.6 shows a generic model-based reinforcement learner. As with other reinforcement learning programs, it keeps track of Q[S,A], but it also maintains a model of the dynamics, represented here as T, where T[s,a,s] is the count of the number of times that the agent has done a in state s and ended up in state s. This also maintains C[s,a], which is the count of the number of times action a was carried out in state s. Note that C[s,a]=sT[s,a,s], and so we could save space but increase runtime by not storing C, but computing it when needed. The R[s,a] array maintains the average reward obtained when doing action a in state s.

After each action, the agent observes the reward r and the resulting state s. It then updates the transition-count matrices T and C as well as the average reward R. It then does a number of steps of asynchronous value iteration, using the updated probability model derived from T and the updated reward model.

There are three main undefined parts to this algorithm:

  • Which Q-values should be updated? It seems reasonable that the algorithm should at least update Q[s,a], because more data have been received on the transition probability and reward. From there it can either do random updates or determine which Q-values would change the most. The elements that potentially have their values changed the most are the Q[s1,a1] with the highest probability of ending up at a Q-value that has changed the most (i.e., where Q[s2,a2] has changed the most). This can be implemented by keeping a priority queue of Q-values to consider. To ensure these is no divide-by-zero error, it should only choose s1,a1 state–action pair for which C[s1,a1]0, or include pseudocounts for the transitions.

  • How many steps of asynchronous value iteration should be done between actions? An agent could continue doing Q-updates until it has to act or until it gets new information. Figure 12.6 assumes that the agent acts and then does Q-updates until an observation arrives. When an observation arrives, the agent acts as soon as possible. There are may variants, including doing a fixed number of updates, which may be appropriate in games where it can act at any time. It is also possible to run the update in parallel with observing and acting.

  • What should be the initial values for Q[S,A]? It requires some value for the transitions it has never experienced when updating Q. If it is using the exploration strategy of optimism in the face of uncertainty, it can use Rmax, the maximum reward possible, as the initial value for R, to encourage exploration. However, as in value iteration, the algorithm converges faster if it initializes Q to be as close as possible to the final Q-value.

This algorithm assumes that the rewards depend on the initial state and the action. If there are separate action costs and rewards for being in a state, and the agent can separately observe the costs and rewards, the reward function can be decomposed into C[A] and R[S], leading to more efficient learning.

It is difficult to directly compare the model-based and model-free reinforcement learners. Typically, model-based learners are much more efficient in terms of experience; many fewer experiences are needed to learn well. However, the model-free methods use less memory and often use less computation time. If experience was cheap, such as in a computer game, a different comparison would be needed than if experience was expensive, such as for a robot.