foundations of computational agents
The belief network probability models of Chapter 8 were defined in terms of features. Many domains are best modeled in terms of individuals and relations. Agents must often build models before they know what individuals are in the domain and, therefore, before they know what random variables exist. When the probabilities are being learned, the probabilities often do not depend on the individuals. Although it is possible to learn about an individual, an agent must also learn general knowledge that it can apply when it finds out about a new individual.
Consider the problem of predicting how well students will do in courses they have not taken. Figure 15.6 shows some fictional data designed to show what can be done. Students and have the same averages, on courses with the same averages. However, we may be able to distinguish them as we know something about the courses they have taken.
This is a different problem than the cases consider in Chapter 7 because the values of properties and are individuals, and we want to make predictions based on the properties of the individuals. None of the methods in that chapter would work on such data.
Consider the problem of an intelligent tutoring system diagnosing students’ arithmetic errors. From observing a student’s performance on a number of examples, the tutor should try to determine whether or not the student understands the task and, if not, work out what the student is doing wrong so that appropriate remedies can be applied.
Consider the case of diagnosing two-digit addition of the form
The student is given the values for the s and the s and provides values for the s.
Students’ answers depend on the problem (the s and s) and whether they know basic addition and whether they know how to carry.
A belief network for this example is shown in Figure 15.7. The carry into digit , given by the variable , depends on the , , and , the carry for the previous digit (except for the initial case), and on whether the student knows how to carry. The -value for digit , given by the variable , depends on the , , , and whether the student knows basic addition.
By observing the value of the s and the s in the problem and the value of s given by the student, the posterior probability that the student knows addition and knows how to carry can be inferred. One feature of the belief network model is that it allows students to make random errors; even though they know how to perform two-digit arithmetic, they can still get the wrong answer occasionally.
The problem with this representation is that it is inflexible. A flexible representation would allow for the addition of multiple digits, multiple problems, multiple students, and multiple times. Multiple digits require the replication of the network for the digits. Multiple times allow for modeling how students’ knowledge and their answers change over time, even if the problems do not change over time.
If the conditional probabilities were stored as tables, the size of those tables would be enormous. For example, if the , , and variables each have a domain size of 11 (the digits 0 to 9 or the blank), and the and variables are binary, a tabular representation of
would have a size greater than 4000. There is much more structure in the conditional probability than is expressed in the tabular representation. Tabular representations are not the only representation of conditional probabilities. We present a probabilistic extension of logic programs below that allows for both relational probabilistic models and compact descriptions of conditional probabilities.
A relational probability model (RPM) or probabilistic relational model is a model in which the probabilities are specified on the relations, independently of the actual individuals. Different individuals share the probability parameters.
A parameterized random variable is of the form , where each is a term (a logical variable or a constant). Thus it corresponds to either an atomic symbol or a term. The parameterized random variable is said to be parameterized by the logical variables that appear in it. A ground instance of a parameterized random variable is obtained by substituting constants for the logical variables in the parameterized random variable. The ground instances of a parameterized random variable correspond to random variables. The domain of the random variable is the range of . A Boolean parameterized random variable corresponds to a predicate symbol.
We use the Datalog convention that logical variables start with an upper-case letter and constants start with a lower-case letter. Random variables and functions are written starting with an upper-case letter, with the corresponding proposition in lower case (e.g., is written as , and is written as ).
For a relational probability model of the multidigit arithmetic problem outlined above, there is a separate -variable for each digit and for each problem , represented by the parameterized random variable . Thus, for example, may be a random variable representing the -value of the first digit of problem 17. Similarly there is a parameterized random variable, , which represents a random variable for each digit and problem .
There is a variable for each student and time that represents whether knows how to add properly at time . The parameterized random variable represents whether student knows addition at time . The random variable is true if Fred knows addition on March 23. Similarly, there is a parameterized random variable .
There is a different -value and a different carry for each digit, problem, student, and time. These values are represented by the parameterized random variables and . So, is a random variable representing the answer Fred gave on March 23 for digit 1 of problem 17. Function has range , so this set is the domain of the random variables that are the ground instances of .
A plate model consists of
a directed graph in which the nodes are parameterized random variables,
a population of individuals for each logical variable, and
a conditional probability of each node given its parents.
We draw a rectangle – a plate – around the parameterized random variables that share a logical variable. There is a plate for each logical variable. A plate model means its grounding – the belief network in which nodes are all ground instances of the parameterized random variables (each logical variable replaced by an individual in its population). That is, the variables in each plate are replicated for each individual. The conditional probabilities of the grounded belief network are the same as the corresponding instances of the plate model. This notation is redundant, as the logical variables are specified in both the plates and the arguments. Sometimes one of these is omitted; often the arguments are omitted when they can be inferred from the plates.
Figure 15.8 gives a plate model for predicting student grades. There is a plate for the courses and a plate for the students. The parameterized random variables are
which represents whether student is intelligent
, which represents whether course is difficult,
which represents the grade of student in course .
The probabilities for , , and need to be specified. If and are Boolean (with range and ) and has range then there are 10 parameters that define the probability distribution. Suppose and and is defined by the following table:
Eight parameters are required to define because there are four cases, and each case requires two numbers to be specified; the third can be inferred to ensure the probabilities sum to one.
Figure 15.9 shows a grounding for 3 students , and , and 2 courses, and . If there were students and courses, in the grounding there would be instances of , instances of and instances of . So there would be random variables in the grounding.
Consider conditioning on the data given in Figure 15.6, and querying the variables corresponding to the last two rows. There are 4 courses and 4 students, and so there would be 24 variables in the grounding. All of the instances of that are not observed or queried can be pruned or never constructed in the first place, resulting in the belief network of Figure 15.10. From this network, conditioned on the , the observed grades of Figure 15.6, and using the probabilities above, the following posterior probabilities can be derived:
Thus, this model predicts that is likely to do better than in course .
A plate model for the multidigit addition problem of Example 15.19 is shown in Figure 15.11. The rectangles correspond to plates. For the plate labeled with , an instance of each variable exists for each digit and problem . One way to view this is that the instances come out of the page, like a stack of plates. Similarly, for the plate labeled , there is a copy of the variables for each student and each time . For the variables in the intersection of the plates, there is a random variable for each digit , problem , student , and time .
The plate representation denotes the same independence as a belief network; each node is independent of its non-descendants given its parents. This dependence is inherited by a corresponding ground belief network. Thus, for particular values , , , and , is a random variable, with parents , , and . There is a loop in the plate model on the parameterized random variable because the carry for one digit depends on the carry for the previous digit for the same problem, student, and time. Similarly, whether students know how to carry at some time depends on whether they knew how to carry at the previous time. The ground network needs to be acyclic.
There is a conditional probability of each parameterized random variable, given its parents. This conditional probability is shared among its ground instances.
Unfortunately, the plate representation is not adequate when the dependency occurs among different instances of the same relation. In the preceding example, depends, in part, on , that is, on the carry from the previous digit (and there is some other case for the first digit). To represent such examples, it is useful to be able to specify how the logical variables interact, as is done in logic programs.
One representation that combines the ideas of belief networks, plates, and logic programs is the independent choice logic (ICL). The ICL consists of a set of independent choices, a logic program that gives the consequences of the choices, and probability distributions over the choices. In more detail, the ICL is defined as follows:
An alternative is a set of atoms all sharing the same logical variables. A choice space is a set of alternatives such that none of the atoms in the alternatives unify with each other. An ICL theory contains
a choice space . Let be the set of ground instances of the alternatives. Thus, is a set of sets of ground atoms.
an acyclic logic program (that can include negation as failure), in which the head of the clauses does not unify with an element of an alternative in the choice space.
a probability distribution over each alternative. All instances of an alternative have the same probability.
The atoms in the logic program and the choice space can contain constants, variables, and function symbols.
A selector function selects a single element from each alternative in . There is a possible world for each selector function. The logic program specifies what is true in each possible world. Atom is true in a possible world if it follows from the atoms selected by the selector function added to the logic program. The probability of proposition is given by a measure over sets of possible worlds, where the atoms in different ground instances of the alternatives are probabilistically independent. The instances of an alternative share the same probabilities, and the probabilities of different instances are multiplied.
Consider the choice space , the logic program :
and the distribution over the first alternative, , and over the second alternative , , .
There are 6 possible worlds:
so, under this model, , and .
You do not need to enumerate all possible worlds to compute probabilities. Abduction can be used to find descriptions of the sets of worlds in which is true. The atoms in the alternatives are made assumable, with different atoms in the same alternative declared to be inconsistent. If the explanations are pairwise inconsistent, the probability of can be computed by adding the probabilities of the explanations. If they are not pairwise inconsistent they can be made pairwise consistent.
An ICL theory can be seen as a causal model in which the causal mechanism is specified as a logic program and the background variables, corresponding to the alternatives, have independent probability distributions over them. It may seem that this logic, with only unconditionally independent atoms and a deterministic logic program, is too weak to represent the sort of knowledge required. However, even without logical variables, the independent choice logic can represent anything that can be represented in a belief network, as in the following example:
Consider representing the belief network of Example 8.15 in the ICL. The same technique works for any belief network.
Fire and tampering have no parents, so they can be represented directly as alternatives:
The probability distribution over the first alternative is , . Similarly, , .
The dependence of on can be represented using two alternatives:
with and . Two rules can be used to specify when there is smoke:
where is negation as failure, and so these clauses mean their completion.
To represent how depends on and , there are four alternatives:
where , , and similarly for the other atoms using the probabilities from Example 8.15. There are also rules specifying when is true, depending on tampering and fire:
Other random variables are represented analogously, using the same number of alternatives as there are assignments of values to the parents of a node.
An ICL representation of a conditional probability can be seen as a rule form of a decision tree with probabilities at the leaves. There is a rule and an alternative for each branch. Non-binary alternatives are useful when non-binary variables are involved.
The independent choice logic may not seem very intuitive for representing standard belief networks, but it can make complicated relational models much simpler, as in the following example.
Consider the parameterized version of the multidigit addition of Example 15.19. The plates correspond to logical variables.
There are three cases for the value of . The first is when the student knows addition at this time, and the student did not make a mistake. In this case, they get the correct answer:
We use the convention that the last variable in the atom corresponds to the value. Thus the atom is true when the parameterized random variable has value , and similarly for the other atoms.
There is an alternative for whether or not the student happened to make a mistake in this instance:
where the probability of is , assuming students make an error in 5% of the cases even when they know how to do arithmetic.
The second case is when the student knows addition at this time but makes a mistake. In this case, we assume that the students are equally likely to pick each of the digits:
There is an alternative that specifies which digit the student chose:
Suppose that, for each , the probability of is .
The final case is when the student does not know addition. In this case, the student selects a digit at random:
These three rules cover all of the rules for ; it is much simpler than the table of size greater than 4000 that was required for the tabular representation and it also allows for arbitrary digits, problems, students, and times. Different digits and problems give different values for , and different students and times have different values for whether they know addition.
The rules for are similar. The main difference is that the carry in the body of the rule depends on the previous digit.
Whether a student knows addition at any time depends on whether they knew addition at the previous time. Presumably, the student’s knowledge also depends on what actions occur (what the student and the teacher do). Because the ICL allows standard logic programs (with “noise”), either of the representations for modeling change introduced at the start of this chapter can be used.
AILog, as used in the previous chapters, also implements ICL.