Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

## 6.3 Belief Networks

The notion of conditional independence can be used to give
a concise representation of many domains. The idea is that, given a
random variable *X*, a small set of variables may exist that
directly affect the variable's value in the sense that *X* is
conditionally independent of
other variables given values for the directly
affecting variables. The set of locally affecting variables is called
the **Markov blanket**. This locality is what is exploited in
a belief
network. A **belief network** is a directed model of conditional dependence among a set of random
variables. The precise statement of conditional independence in a
belief network takes into account the directionality.

To define a belief network, start with a set of random variables
that represent all of the features of the model. Suppose
these variables are *{X _{1},...,X_{n}}*.
Next, select a total ordering of the variables,

*X*.

_{1},...,X_{n}The chain rule (Proposition 6.3) shows how to decompose a conjunction into conditional probabilities:

P(X _{1}=v_{1}∧X_{2}=v_{2}∧···∧X_{n}=v_{n})= ∏ _{i=1}^{n}P(X_{i}=v_{i}|X_{1}=v_{1}∧···∧X_{i-1}=v_{i-1}).

Or, in terms of random variables and probability distributions,

P(X _{1}, X_{2},···, X_{n})= ∏ _{i=1}^{n}P(X_{i}|X_{1}, ···, X_{i-1}).

Define the **parents** of random variable *X _{i}*, written

*parents(X*, to be a minimal set of predecessors of

_{i})*X*in the total ordering such that the other predecessors of

_{i}*X*are conditionally independent of

_{i}*X*given

_{i}*parents(X*. That is,

_{i})*parents(X*such that

_{i}) ⊆{X_{1},...,X_{i-1}}P(X_{i}|X_{i-1}...X_{1}) = P(X_{i}|parents(X_{i})).

If more than one minimal set exists, any minimal set can be chosen to be the parents. There can be more than one minimal set only when some of the predecessors are deterministic functions of others.

We can put the chain rule and the definition of parents together, giving

P(X _{1}, X_{2},···, X_{n})= ∏ _{i=1}^{n}P(X_{i}|parents(X_{i})).

The probability over all of the variables, *P(X _{1}, X_{2},···,
X_{n})*, is called the

**joint probability distribution**. A belief network defines a

**factorization**of the joint probability distribution, where the conditional probabilities form factors that are multiplied together.

A **belief network**, also called a **Bayesian
network**, is an acyclic
directed graph (DAG), where
the
nodes are random variables. There is an arc from each element of
*parents(X _{i})* into

*X*. Associated with the belief network is a set of conditional probability distributions - the conditional probability of each variable given its parents (which includes the prior probabilities of those variables with no parents).

_{i}Thus, a belief network consists of

- a DAG, where each node is labeled by a random variable;
- a domain for each random variable; and
- a set of conditional probability distributions giving
*P(X|parents(X))*for each variable*X*.

A belief network is acyclic by construction. The way the chain rule decomposes the conjunction gives the ordering. A variable can have only predecessors as parents. Different decompositions can result in different belief networks.

**Example 6.10:**Suppose we want to use the diagnostic assistant to diagnose whether there is a fire in a building based on noisy sensor information and possibly conflicting explanations of what could be going on. The agent receives a report about whether everyone is leaving the building. Suppose the report sensor is noisy: It sometimes reports leaving when there is no exodus (a false positive), and it sometimes does not report when everyone is leaving (a false negative). Suppose the fire alarm going off can cause the leaving, but this is not a deterministic relationship. Either tampering or fire could affect the alarm. Fire also causes smoke to rise from the building.

Suppose we use the following variables, all of which are Boolean, in the following order:

*Tampering*is true when there is tampering with the alarm.*Fire*is true when there is a fire.*Alarm*is true when the alarm sounds.*Smoke*is true when there is smoke.*Leaving*is true if there are many people leaving the building at once.*Report*is true if there is a report given by someone of people leaving.*Report*is false if there is no report of leaving.

The variable *Report* denotes the sensor report that people are
leaving. This information is unreliable because the person issuing such a report could be playing a practical
joke, or no one who could have given such a report may have been paying attention. This variable is introduced to allow conditioning on
unreliable sensor data. The agent knows what the sensor reports,
but it only has unreliable evidence about people leaving the building.
As part of the domain, assume the following conditional independencies:

*Fire*is conditionally independent of*Tampering*(given no other information).*Alarm*depends on both*Fire*and*Tampering*. That is, we are making no independence assumptions about how*Alarm*depends on its predecessors given this variable ordering.*Smoke*depends only on*Fire*and is conditionally independent of*Tampering*and*Alarm*given whether there is a*Fire*.*Leaving*only depends on*Alarm*and not directly on*Fire*or*Tampering*or*Smoke*. That is,*Leaving*is conditionally independent of the other variables given*Alarm*.*Report*only directly depends on*Leaving*.

The belief network of Figure 6.1 expresses these dependencies.

This network represents the factorization

P(Tampering,Fire,Alarm,Smoke,Leaving,Report) = P(Tampering) ×P(Fire) ×P(Alarm|Tampering,Fire) ×P(Smoke|Fire) ×P(Leaving|Alarm) ×P(Report|Leaving).

We also must define the domain of each variable. Assume that the
variables are Boolean; that is, they have domain *{true,false}*. We
use the lower-case variant of the variable to represent the true value
and use negation for the false value. Thus, for example,
*Tampering=true* is written as *tampering*, and
*Tampering=false* is written as *¬tampering*.

The examples that follow assume the following conditional probabilities:

*P(tampering) = 0.02*

*P(fire) = 0.01*

*P(alarm | fire ∧tampering) = 0.5*

*P(alarm | fire ∧¬tampering) = 0.99*

*P(alarm | ¬fire ∧tampering) = 0.85*

*P(alarm | ¬fire ∧¬tampering) = 0.0001*

*P(smoke | fire ) = 0.9*

*P(smoke | ¬fire ) = 0.01*

*P(leaving | alarm) = 0.88*

*P(leaving | ¬alarm ) = 0.001*

*P(report | leaving ) = 0.75*

*P(report | ¬leaving ) = 0.01*

- For each wire
*w*, there is a random variable,_{i}*W*, with domain_{i}*{live,dead}*, which denotes whether there is power in wire*w*._{i}*W*means wire_{i}=live*w*has power._{i}*W*means there is no power in wire_{i}=dead*w*._{i} *Outside_power*with domain*{live,dead}*denotes whether there is power coming into the building.- For each switch
*s*, variable_{i}*S*denotes the position of_{i}_pos*s*. It has domain_{i}*{up,down}*. - For each switch
*s*, variable_{i}*S*denotes the state of switch_{i}_st*s*. It has domain_{i}*{ok,upside_down,short,intermittent,broken}*.*S*means switch_{i}_st=ok*s*is working normally._{i}*S*means switch_{i}_st=upside_down*s*is installed upside-down._{i}*S*means switch_{i}_st=short*s*is shorted and acting as a wire._{i}*S*means switch_{i}_st=broken*s*is broken and does not allow electricity to flow._{i} - For each circuit breaker
*cb*, variable_{i}*Cb*has domain_{i}_st*{on,off}*.*Cb*means power can flow through_{i}_st=on*cb*and_{i}*Cb*means that power cannot flow through_{i}_st=off*cb*._{i} - For each light
*l*, variable_{i}*L*with domain_{i}_st*{ok,intermittent,broken}*denotes the state of the light.*L*means light_{i}_st=ok*l*will light if powered,_{i}*L*means light_{i}_st=intermittent*l*intermittently lights if powered, and_{i}*L*means light_{i}_st=broken*l*does not work._{i}

**Example 6.11:**Consider the wiring example of Figure 1.8. Suppose we decide to have variables for whether lights are lit, for the switch positions, for whether lights and switches are faulty or not, and for whether there is power in the wires. The variables are defined in Figure 6.2.

Let's select an ordering where the causes of a variable are before the variable in the ordering. For example, the variable for whether a light is lit comes after variables for whether the light is working and whether there is power coming into the light.

Whether light *l _{1}* is lit depends only on whether
there is power in wire

*w*and whether light

_{0}*l*is working properly. Other variables, such as the position of switch

_{1}*s*, whether light

_{1}*l*is lit, or who is the Queen of Canada, are irrelevant. Thus, the parents of

_{2}*L*are

_{1}_lit*W*and

_{0}*L*.

_{1}_stConsider variable *W _{0}*, which represents whether there is power in
wire

*w*. If we knew whether there was power in wires

_{0}*w*and

_{1}*w*, and we knew the position of switch

_{2}*s*and whether the switch was working properly, the value of the other variables (other than

_{2}*L*) would not affect our belief in whether there is power in wire

_{1}_lit*w*. Thus, the parents of

_{0}*W*should be

_{0}*S*,

_{2}_Pos*S*,

_{2}_st*W*, and

_{1}*W*.

_{2}Figure 6.2 shows the resulting belief network after the independence of each variable has been considered. The belief network also contains the domains of the variables, as given in the figure, and conditional probabilities of each variable given its parents.

For the variable *W _{1}*, the following conditional probabilities must be specified:

P(W _{1}=live|S_{1}_pos=up ∧S_{1}_st=ok ∧W_{3}=live)P(W _{1}=live|S_{1}_pos=up ∧S_{1}_st=ok ∧W_{3}=dead)P(W _{1}=live|S_{1}_pos=up ∧S_{1}_st=upside_down ∧W_{3}=live)... P(W _{1}=live|S_{1}_pos=down ∧S_{1}_st=broken ∧W_{3}=dead).

There are two values for *S _{1}_pos*, five values for

*S*, and two values for

_{1}_ok*W*, so there are

_{3}*2×5 ×2 = 20*different cases where a value for

*W*must be specified. As far as probability theory is concerned, the probability for

_{1}=live*W*for these 20 cases could be assigned arbitrarily. Of course, knowledge of the domain constrains what values make sense. The values for

_{1}=live*W*can be computed from the values for

_{1}=dead*W*for each of these cases.

_{1}=liveBecause the variable *S _{1}_st* has no parents, it requires a
prior distribution, which can be specified as the probabilities
for all but one of the values; the remaining value can be derived from
the constraint that all of the probabilities sum to 1. Thus, to
specify the distribution of

*S*, four of the following five probabilities must be specified:

_{1}_st

P(S _{1}_st=ok)P(S _{1}_st=upside_down)P(S _{1}_st=short)P(S _{1}_st=intermittent)P(S _{1}_st=broken)

The other variables are represented analogously.

A belief network is a graphical representation of conditional independence. The independence allows us to depict direct effects within the graph and prescribes which probabilities must be specified. Arbitrary posterior probabilities can be derived from the network.

The independence assumption embedded in a belief network is as follows:
*Each random variable is conditionally independent of its non-descendants given its
parents*. That is, if *X* is a random variable with parents *Y _{1},..., Y_{n}*,
all random variables that are not descendants of

*X*are conditionally independent of

*X*given

*Y*:

_{1},..., Y_{n}P(X|Y_{1},..., Y_{n},R)=P(X|Y_{1},..., Y_{n}),

if *R* does not involve a descendant of *X*. For this definition, we
include *X* as a descendant of itself. The
right-hand side of this equation is the form of the probabilities that
are specified as part of the belief network.
*R* may involve ancestors of *X* and other nodes as long as they are
not descendants of *X*. The independence
assumption states that all of the influence of non-descendant
variables is captured by knowing the value of *X*'s parents.

Often, we refer to just the labeled DAG as a belief network. When this is done, it is important to remember that a domain for each variable and a set of conditional probability distributions are also part of the network.

The number of probabilities that must be specified for each variable is exponential in the number of parents of the variable. The independence assumption is useful insofar as the number of variables that directly affect another variable is small. You should order the variables so that nodes have as few parents as possible.

**Belief Networks and Causality**

Belief networks have often been called **causal
networks** and
have been claimed to be a good representation of causality.
Recall that a causal model predicts the result of interventions.
Suppose you have in mind
a causal model of a domain, where the domain is specified in terms of
a set of random variables. For each pair of random variables *X _{1}*
and

*X*, if a direct causal connection exists from

_{2}*X*to

_{1}*X*(i.e., intervening to change

_{2}*X*in some context of other variables affects

_{1}*X*and this cannot be modeled by having some intervening variable), add an arc from

_{2}*X*to

_{1}*X*. You would expect that the causal model would obey the independence assumption of the belief network. Thus, all of the conclusions of the belief network would be valid.

_{2}You would also expect such a graph to be acyclic; you do not
want something eventually causing itself. This assumption is reasonable if you
consider that the random variables represent particular events rather than
event types. For example, consider a causal chain that "being
stressed" causes you to "work inefficiently," which, in turn, causes
you to "be stressed." To break the apparent cycle, we can represent "being stressed" at different stages
as different random variables that refer to different times. Being
stressed in the past causes you to not work well at the moment which
causes you to be stressed in the future. The variables should
satisfy the **clarity principle** and
have a well-defined meaning. The variables
should not be seen as event types.

The belief network itself has nothing to say about causation, and it can represent non-causal independence, but it seems particularly appropriate when there is causality in a domain. Adding arcs that represent local causality tends to produce a small belief network. The belief network of Figure 6.2 shows how this can be done for a simple domain.

A
causal network models **interventions**. If someone were to artificially force a variable to have a particular value, the variable's descendants - but no other nodes - would be
affected. Finally, you can see how the causality in belief networks relates to
the causal and evidential reasoning discussed in
Section 5.7. A causal belief network can be seen as a
way of axiomatizing in a causal direction. Reasoning in belief networks
corresponds to abducing to causes and then
predicting from these. A direct mapping exists between the logic-based
abductive view discussed in Section 5.7 and belief
networks: Belief networks can be modeled as logic programs with
probabilities over possible hypotheses. This is described in Section 14.3.

Note the restriction "each random variable is conditionally independent of its non-descendants given its
parents" in the definition of the independence
encoded in a belief network. If
*R* contains a descendant of variable *X*, the independence assumption is not
directly applicable.

**Example 6.12:**In Figure 6.2, variables

*S*,

_{3}_pos*S*, and

_{3}_st*W*are the parents of variable

_{3}*W*. If you know the values of

_{4}*S*,

_{3}_pos*S*, and

_{3}_st*W*, knowing whether or not

_{3}*l*is lit or knowing the value of

_{1}*Cb*will not affect your belief in whether there is power in wire

_{1}_st*w*. However, even if you knew the values of

_{4}*S*,

_{3}_pos*S*, and

_{3}_st*W*, learning whether

_{3}*l*is lit potentially changes your belief in whether there is power in wire

_{2}*w*. The independence assumption is not directly applicable.

_{1}The variable *S _{1}_pos* has no parents. Thus, the independence embedded
in the belief network specifies that

*P(S*for any

_{1}_pos=up|A) = P(S_{1}_pos=up)*A*that does not involve a descendant of

*S*. If

_{1}_pos*A*includes a descendant of

*S*- for example, if

_{1}_pos=up*A*is

*S*- the independence assumption cannot be directly applied.

_{2}_pos=up∧L_{1}_lit=trueThis network can be used in a number of ways:

- By conditioning on the knowledge that the switches and circuit breakers are ok, and on the values of the outside power and the position of the switches, this network can simulate how the lighting should work.
- Given values of the outside power and the position of the
switches, the network can infer the likelihood of any outcome - for example, how
likely it is that
*l*is lit._{1} - Given values for the switches and whether the lights are lit, the posterior probability that each switch or circuit breaker is in any particular state can be inferred.
- Given some observations, the network can be used to reason backward to determine the most likely position of switches.
- Given some switch positions, some outputs, and some intermediate values, the network can be used to determine the probability of any other variable in the network.

A belief network specifies a joint probability distribution from which arbitrary conditional probabilities can be derived. A network can be queried by asking for the conditional probability of any variables conditioned on the values of any other variables. This is typically done by providing observations on some variables and querying another variable.

**Example 6.13:**Consider Example 6.10. The prior probabilities (with no evidence) of each variable can be computed using the methods of the next section. The following conditional probabilities follow from the model of Example 6.10, to about three decimal places:

*P(tampering ) = 0.02*

*P(fire) = 0.01*

*P(report ) = 0.028*

*P(smoke) = 0.0189*

Observing the report gives the following:

*P(tampering|report) = 0.399*

*P(fire |report)= 0.2305*

*P(smoke |report) = 0.215*

As expected, the probability of both
*tampering* and
*fire* are increased by the
report. Because *fire* is increased, so
is the probability of *smoke*.

Suppose instead that *smoke* were observed:

*P(tampering|smoke) = 0.02*

*P(fire|smoke) = 0.476*

*P(report |smoke) = 0.320*

Note that the probability of *tampering* is not affected by observing
*smoke*; however, the probabilities of *report* and *fire* are increased.

Suppose that both *report* and *smoke* were observed:

*P(tampering |report ∧smoke) = 0.0284*

*P(fire |report ∧smoke) = 0.964*

Observing both makes *fire* even more likely. However, in the
context of the *report*, the presence of *smoke* makes *tampering* less
likely. This is because the *report* is **explained away** by *fire*, which
is now more likely.

Suppose instead that *report*, but not *smoke*, was observed:

*P(tampering |report ∧¬smoke) = 0.501*

*P(fire|report ∧¬smoke) = 0.0294*

In the context of the *report*, *fire* becomes much less likely and so
the probability of *tampering* increases to explain the *report*.

This example illustrates how the belief net independence assumption gives commonsense conclusions and also demonstrates how explaining away is a consequence of the independence assumption of a belief network.