foundations of computational agents
Agents are situated in time; they receive sensory data in time and do actions in time. The action that an agent does at a particular time is a function of its inputs. We first consider the notion of time.
Let $T$ be the set of time points. Assume that $T$ is totally ordered and has some metric that can be used to measure the temporal distance between any two time points. Basically, we assume that $T$ can be mapped to some subset of the real line.
$T$ is discrete if there are only a finite number of time points between any two time points; for example, there is a time point every hundredth of a second, or every day, or there may be time points whenever interesting events occur. $T$ is dense if there is always another time point between any two time points; this implies there must be infinitely many time points between any two points. Discrete time has the property that, for all times, except perhaps a last time, there is always a next time. Dense time does not have a “next time.” Initially, we assume that time is discrete and goes on forever. Thus, for each time there is a next time. We write $t+1$ as the next time after time $t$. The time points do not need to be equally spaced.
Assume that $T$ has a starting point, which we arbitrarily call $0$.
Suppose $P$ is the set of all possible percepts. A percept trace, or percept stream, is a function from $T$ into $P$. It specifies what is observed at each time.
Suppose $C$ is the set of all commands. A command trace is a function from $T$ into $C$. It specifies the command for each time point.
Consider a household trading agent that monitors the price of some commodity (e.g., it checks online for special deals and for price increases for snacks or toilet paper) and how much the household has in stock. It must decide whether to order more and how much to order. The percepts are the price and the amount in stock. The command is the number of units the agent decides to order (which is zero if the agent does not order any). A percept trace specifies for each time point (e.g., each day) the price at that time and the amount in stock at that time. Percept traces are given in Figure 2.2. A command trace specifies how much the agent decides to order at each time point. An example command trace is given in Figure 2.3.
The action of actually buying depends on the command but may be different. For example, the agent could issue a command to buy 12 rolls of paper at a particular price. This does not mean that the agent actually buys 12 rolls because there could be communication problems, the store could have run out of paper, or the price could change between deciding to buy and actually buying. However, in this example we can see that the buy orders are all successfully executed, as the amount in stock went up immediately after the order to buy.
A percept trace for an agent is thus the sequence of all past, present, and future percepts received by the controller. A command trace is the sequence of all past, present, and future commands issued by the controller. The commands can be a function of the history of percepts. This gives rise to the concept of a transduction, a function from percept traces into command traces.
Because all agents are situated in time, an agent cannot actually observe full percept traces; at any time it has only experienced the part of the trace up to now. At time $t\in T$, an agent can only observe the value of the trace up to time $t$, and its commands cannot depend on percepts after time $t$.
A transduction is causal if, for all times $t$, the command at time $t$ depends only on percepts up to and including time $t$. The causality restriction is needed because agents are situated in time; their command at any time cannot depend on future percepts.
A controller is an implementation of a causal transduction.
The history of an agent at time $t$ is the percept trace of the agent for all times before or at time $t$ and the command trace of the agent before time $t$.
Thus, a causal transduction maps the agent’s history at time $t$ into the command at time $t$. It can be seen as the most general specification of a controller.
Continuing Example 2.1, a causal transduction specifies, for each time, how much of the commodity the agent should buy depending on the price history, the history of how much of the commodity is in stock (including the current price and amount in stock) and the past history of buying.
An example of a causal transduction is as follows: buy four dozen rolls if there are fewer than five dozen in stock and the price is less than 90% of the average price over the last 20 days; buy a dozen rolls if there are fewer than a dozen in stock; otherwise, do not buy any.
Although a causal transduction is a function of an agent’s history, it cannot be directly implemented because an agent does not have direct access to its entire history. It has access only to its current percepts and what it has remembered.
The memory or belief state of an agent at time $t$ is all the information the agent has remembered from the previous times. An agent has access only to the history that it has encoded in its belief state. Thus, the belief state encapsulates all of the information about its history that the agent can use for current and future commands. At any time, an agent has access to its belief state and its current percepts.
The belief state can contain any information, subject to the agent’s memory and processing limitations. This is a very general notion of belief.
Some instances of belief state include the following:
The belief state for an agent that is following a fixed sequence of instructions may be a program counter that records its current position in the sequence.
The belief state can contain specific facts that are useful – for example, where the delivery robot left a parcel when it went to find a key, or where it has already checked for the key. It may be useful for the agent to remember any information that it might need for the future that is reasonably stable and that cannot be immediately observed.
The belief state could encode a model or a partial model of the state of the world. An agent could maintain its best guess about the current state of the world or could have a probability distribution over possible world states; see Section 8.5.2.
The belief state could be a representation of the dynamics of the world – how the world changes – and the meaning of its percepts. Given its percepts, the agent could reason about what is true in the world.
The belief state could encode what the agent desires, the goals it still has to achieve, its beliefs about the state of the world, and its intentions, or the steps it intends to take to achieve its goals. These can be maintained as the agent acts and observes the world, for example, removing achieved goals and replacing intentions when more appropriate steps are found.
A controller maintains the agent’s belief state and determine what command to issue at each time. The information it has available when it must do this are its belief state and its current percepts.
A belief state transition function for discrete time is a function
$$\text{remember}:S\times P\to S$$ |
where $S$ is the set of belief states and $P$ is the set of possible percepts; ${s}_{t+1}=\text{remember}({s}_{t},{p}_{t})$ means that ${s}_{t+1}$ is the belief state following belief state ${s}_{t}$ when ${p}_{t}$ is observed.
A command function is a function
$$\text{command}:S\times P\to C$$ |
where $S$ is the set of belief states, $P$ is the set of possible percepts, and $C$ is the set of possible commands; ${c}_{t}=\text{command}({s}_{t},{p}_{t})$ means that the controller issues command ${c}_{t}$ when the belief state is ${s}_{t}$ and when ${p}_{t}$ is observed.
The belief-state transition function and the command function together specify a causal transduction for the agent. Note that a causal transduction is a function of the agent’s history, which the agent does not necessarily have access to, but a command function is a function of the agent’s belief state and percepts, which it does have access to.
To implement the causal transduction of Example 2.2, a controller must keep track of the rolling history of the prices for the previous 20 days. By keeping track of the average (average), it can update the average using
$${\text{\mathit{a}\mathit{v}\mathit{e}\mathit{r}\mathit{a}\mathit{g}\mathit{e}}}{:=}{\text{\mathit{a}\mathit{v}\mathit{e}\mathit{r}\mathit{a}\mathit{g}\mathit{e}}}{+}\frac{{\text{\mathit{n}\mathit{e}\mathit{w}}}{-}{\text{\mathit{o}\mathit{l}\mathit{d}}}}{{20}}$$ |
where new is the new price and old is the oldest price remembered. It can then discard old. It must do something special for the first 20 days.
A simpler controller could, instead of remembering a rolling history in order to maintain the average, remember just a running estimate of the average and use that value as a surrogate for the oldest item. The belief state can then contain one real number (ave), with the state transition function
$${\text{\mathit{a}\mathit{v}\mathit{e}}}{:=}{\text{\mathit{a}\mathit{v}\mathit{e}}}{+}\frac{{\text{\mathit{n}\mathit{e}\mathit{w}}}{-}{\text{\mathit{a}\mathit{v}\mathit{e}}}}{{20}}{.}$$ |
This controller is much easier to implement and is not as sensitive to what happened exactly 20 time units ago. It does not actually compute the average, as it is biased towards recent data. This way of maintaining estimates of averages is the basis for temporal differences in reinforcement learning.
If there are a finite number of possible belief states, the controller is called a finite state controller or a finite state machine. A factored representation is one in which the belief states, percepts, or commands are defined by features. If there are a finite number of features, and each feature can only have a finite number of possible values, the controller is a factored finite state machine. Richer controllers can be built using an unbounded number of values or an unbounded number of features. A controller that has an unbounded but countable number of states can compute anything that is computable by a Turing machine.