# 8.4.1 Variable Elimination for Belief Networks

The variable elimination (VE) algorithm, as used for finding solutions to CSPs and for optimization with soft constraints, can be adapted to find the posterior distribution for a variable in a belief network given conjunctive evidence. Many of the efficient exact methods are variants of this algorithm.

The algorithm is based on the notion that a belief network specifies a factorization of the joint probability distribution.

Before we provide the algorithm, we define factors and the operations that will be performed on them. Recall that $P(X\mid Y)$ is a function from variables (or sets of variables) $X$ and $Y$ into the real numbers that, given a value for $X$ and a value for $Y$, returns the conditional probability of the value for $X$, given the value for $Y$. A function of variables is called a factor. The VE algorithm for belief networks manipulates factors to compute posterior probabilities.

## Conditional Probability Tables

A conditional probability, $P(Y\mid X_{1},\dots,X_{k})$ is a function from the variables $Y,X_{1},\dots,X_{k}$ into non-negative numbers that satisfies the constraints that for each assignment of values to all of $X_{1},\dots,X_{k}$ the values for $Y$ sum to 1. That is, given values to all of the variables, the function returns a number that satisfies the constraint:

 $\forall x_{1}\dots\forall x_{k}\sum_{y\in domain(Y)}P(Y{=}y\mid X_{1}{=}x_{1},% \dots,X_{k}{=}x_{k})=1$ (8.1)

With a finite set of variables with finite domains, conditional probabilities can be implemented as arrays. If there is an ordering of the variables (e.g., alphabetical) and the values in the domains are mapped into non-negative integers, there is a unique representation of each factor as a one-dimensional array that is indexed by natural numbers. This representation for a conditional probability is called a conditional probability table or CPT.

If the child variable is treated the same as the parent variables, the information is redundant; more numbers are specified than is required and a table could be inconsistent if it does not satisfy the above constraint. Using the redundant representation is common, but the following two methods are also used to specify and store probabilities:

• Store unnormalized probabilities, which are non-negative numbers that are proportional to the probability. The probability can be computed by normalizing: dividing each value by the sum of the values, summing over all values for the domain of $Y$.

• Store the probability for all-but-one of the values of $Y$. In this case, the probability of this other value can be computed to obey the constraint above. In particular, if $Y$ is binary, we only need to represent the probability for one value, say $Y=true$, and the probability for other other value, $Y=false$, can be computed from this.

###### Example 8.18.

Figure 8.6 shows three conditional probabilities tables. On the top left is $P(Smoke\mid Fire)$ and on the top right is $P(Alarm\mid Fire,Tampering)$, from Example 8.15, which use Boolean variables.

These tables do not specify the probability for the child being false. This can be computed from the given probabilities, for example,

 $P(Alarm{=}false\mid Fire{=}false,Tampering{=}{true})=1-0.85=0.15$

On the bottom is a simple example, with domains $\{t,f\}$, which will be used in the following examples.

Given a total ordering of the parents, such as $Fire$ is before $Tampering$ in the right table, and a total ordering of the values, such as $true$ is before $false$, the table can be specified by giving the array of numbers in lexicographic order, such as $[0.5,0.99,0.85,0.0001]$.

## Factors

A factor is a function from a set of random variables into a number. A factor $f$ on variables $X_{1},\ldots,X_{j}$ is written as $f(X_{1},\ldots,X_{j})$. The variables $X_{1},\ldots,X_{j}$ are the variables of the factor $f$, and $f$ is a factor on $X_{1},\ldots,X_{j}$.

Conditional probabilities are factors that also obey the constraint of Equation 8.1. This section describes some operations on factors, including conditioning, multiplying factors and summing out variables. The operations can be used for conditional probabilities, but do not necessarily result in conditional probabilities.

Suppose $f(X_{1},\ldots,X_{j})$ is a factor and each $v_{i}$ is an element of the domain of $X_{i}$. $f(X_{1}{\,{=}\,}v_{1},X_{2}{\,{=}\,}v_{2},\ldots,X_{j}{\,{=}\,}v_{j})$ is a number that is the value of $f$ when each $X_{i}$ has value $v_{i}$. Some of the variables of a factor can be assigned to values to make a new factor on the other variables. This operation is called conditioning on the values of the variables that are assigned. For example, $f(X_{1}{\,{=}\,}v_{1},X_{2},\ldots,X_{j})$, sometimes written as $f(X_{1},X_{2},\ldots,X_{j})_{X_{1}{\,{=}\,}v_{1}}$, where $v_{1}$ is an element of the domain of variable $X_{1}$, is a factor on $X_{2},\ldots,X_{j}$.

###### Example 8.19.

Figure 8.7 shows a factor $r(X,Y,Z)$ on variables $X$, $Y$ and $Z$ as a table. This assumes that each variable is binary with domain $\{t,f\}$. This factor could be obtained from the last conditional probability table given in Figure 8.6. Figure 8.7 also gives a table for the factor $r(X{=}t,Y,Z)$, which is a factor on $Y,Z$. Similarly, $r(X{=}t,Y,Z{=}f)$ is a factor on $Y$, and $r(X{=}t,Y{=}f,Z{=}f)$ is a number.

Factors can be multiplied together. Suppose $f_{1}$ and $f_{2}$ are factors, where $f_{1}$ is a factor that contains variables $X_{1},\ldots,X_{i}$ and $Y_{1},\ldots,Y_{j}$, and $f_{2}$ is a factor with variables $Y_{1},\ldots,Y_{j}$ and $Z_{1},\ldots,Z_{k}$, where $Y_{1},\ldots,Y_{j}$ are the variables in common to $f_{1}$ and $f_{2}$. The product of $f_{1}$ and $f_{2}$, written $f_{1}*f_{2}$, is a factor on the union of the variables, namely $X_{1},\ldots,X_{i},Y_{1},\ldots,Y_{j},Z_{1},\ldots,Z_{k}$, defined by:

 $\displaystyle(f_{1}*f_{2})$ $\displaystyle(X_{1},\ldots,X_{i},Y_{1},\ldots,Y_{j},Z_{1},\ldots,Z_{k})$ $\displaystyle\mbox{}=f_{1}(X_{1},\ldots,X_{i},Y_{1},\ldots,Y_{j})*f_{2}(Y_{1},% \ldots,Y_{j},Z_{1},\ldots,Z_{k}).$
###### Example 8.20.

Figure 8.8 shows the product of $f_{1}(A,B)$ and $f_{2}(B,C)$, which is a factor on $A,B,C$. Note that $(f_{1}*f_{2})(A{=}t,B{=}f,C{=}f)=f_{1}(A{=}t,B{=}f)*f_{2}(B{=}f,C{=}f)=0.9*0.4% =0.36$.

The remaining operation is to sum out a variable in a factor. Given factor $f(X_{1},\ldots,X_{j})$, summing out a variable, say $X_{1}$, results in a factor on the other variables, $X_{2},\ldots,X_{j}$, defined by

 $(\sum_{X_{1}}f)(X_{2},\ldots,X_{j})=f(X_{1}{\,{=}\,}v_{1},X_{2},\ldots,X_{j})+% \cdots+f(X_{1}{\,{=}\,}v_{k},X_{2}\ldots,X_{j}),$

where $\{v_{1},\ldots,v_{k}\}$ is the set of possible values of variable $X_{1}$.

###### Example 8.21.

Figure 8.9 gives an example of summing out variable $B$ from a factor $f_{3}(A,B,C)$, which is a factor on $A,C$. Notice how

 $\displaystyle(\sum_{B}f_{3})(A{\,{=}\,}t,C{\,{=}\,}f)=\mbox{}$ $\displaystyle f_{3}(A{\,{=}\,}t,B{\,{=}\,}t,C{\,{=}\,}f)+f_{3}(A{\,{=}\,}t,B{% \,{=}\,}f,C{\,{=}\,}f)$ $\displaystyle\mbox{}=\mbox{}$ $\displaystyle 0.07+0.36$ $\displaystyle\mbox{}=\mbox{}$ $\displaystyle 0.43$

## Variable Elimination

Given evidence $Y_{1}{\,{=}\,}v_{1}$, …, $Y_{j}{\,{=}\,}v_{j}$, and query variable or variables $Q$, the problem of computing the posterior distribution on $Q$ can be reduced to the problem of computing the probability of conjunctions:

 $\displaystyle P(Q\mid Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})$ $\displaystyle\mbox{}=\frac{P(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})% }{P(Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})}$ $\displaystyle\mbox{}=\frac{P(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})% }{\displaystyle\sum_{Q}P(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})}.$

The algorithm computes the factor $P(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})$ and normalizes. Note that this is a factor only of $Q$; given a value for $Q$, it returns a number that is the probability of the conjunction of the evidence and the value for $Q$.

Suppose the variables of the belief network are $X_{1},\ldots,X_{n}$. To compute the factor $P(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})$, sum out the other variables from the joint distribution. Suppose $Z_{1},\ldots,Z_{k}$ is an enumeration of the other variables in the belief network, that is,

 $\{Z_{1},\ldots,Z_{k}\}=\{X_{1},\ldots,X_{n}\}\setminus\{Q,Y_{1},\ldots,Y_{j}\}$

and the variables $Z_{i}$ are ordered according to an elimination ordering.

The probability of $Q$ conjoined with the evidence is

 $p(Q,Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j})=\sum_{Z_{k}}\cdots\sum_{Z_% {1}}P(X_{1},\ldots,X_{n})_{Y_{1}{\,{=}\,}v_{1},\ldots,Y_{j}{\,{=}\,}v_{j}}.$

By the chain rule and the definition of a belief network,

 $\displaystyle P(X_{1},\ldots,X_{n})=\prod_{i=1}^{n}P(X_{i}\mid parents(X_{i}))$

where $parents(X_{i})$ is the set of parents of variable $X_{i}$.

The belief network inference problem is thus reduced to a problem of summing out a set of variables from a product of factors. The distribution law specifies that a sum of products such as $xy+xz$, can be simplified by distributing out the common factors (here $x$), which results in $x(y+z)$. The resulting form is more efficient to compute. Distributing out common factors is the essence of the VE algorithm. The elements multiplied together are called “factors” because of the use of the term in algebra. Initially, the factors represent the conditional probability distributions, but the intermediate factors are just functions on variables created by adding and multiplying factors.

To compute the posterior distribution of a query variable given observations

1. 1.

Construct a factor for each conditional probability distribution.

2. 2.

Eliminate each of the non-query variables:

• if the variable is observed, its value is set to the observed value in each of the factors in which the variable appears,

• otherwise the variable is summed out.

3. 3.

Multiply the remaining factors and normalize.

To sum out a variable $Z$ from a product $f_{1},\ldots,f_{k}$ of factors, first partition the factors into those not containing $Z$, say $f_{1},\ldots,f_{i}$, and those containing $Z$, $f_{i+1},\ldots,f_{k}$; then distribute the common factors out of the sum:

 $\sum_{Z}\>f_{1}*\cdots*f_{k}=f_{1}*\cdots*f_{i}*\left(\sum_{Z}\>f_{i+1}*\cdots% *f_{k}\right).$

VE explicitly constructs a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor.

Figure 8.10 gives pseudocode for the VE algorithm. The elimination ordering could be given a priori or computed on the fly. It is worthwhile to select observed variables first in the elimination ordering, because eliminating these simplifies the problem.

This algorithm assumes that the query variable is not observed. If it is observed to have a particular value, its posterior probability is just 1 for the observed value and 0 for the other values.

###### Example 8.22.

Consider Example 8.15 with the query

 $P(Tampering\mid Smoke{\,{=}\,}true\wedge Report{\,{=}\,}true).$

Suppose it first eliminates the observed variables, $Smoke$ and $Report$. After these are eliminated, the following factors remain:

 $\begin{array}[]{l|l}ConditionalProbability&Factor\\ \hline P(Tampering)&f_{0}(Tampering)\\ P(Fire)&f_{1}(Fire)\\ P(Alarm\mid Tampering,Fire)&f_{2}(Tampering,Fire,Alarm)\\ P(Smoke=yes\mid Fire)&f_{3}(Fire)\\ P(Leaving\mid Alarm)&f_{4}(Alarm,Leaving)\\ P(Report=yes\mid Leaving)&f_{5}(Leaving)\end{array}$

Suppose $Fire$ is next in the elimination ordering. To eliminate $Fire$, collect all of the factors containing $Fire$, namely $f_{1}(Fire)$, $f_{2}(Tampering,Fire,Alarm)$, and $f_{3}(Fire)$, multiply them together, and sum out $Fire$ from the resulting factor. Call this factor $F_{6}(Tampering,Alarm)$. At this stage, $Fs$ contains the factors:

 $f_{0}(Tampering),f_{4}(Alarm,Leaving),f_{5}(Leaving),f_{6}(Tampering,Alarm).$

Suppose $Alarm$ is eliminated next. VE multiplies the factors containing $Alarm$ and sums out $Alarm$ from the product, giving a factor, call it $f_{7}$:

 $f_{7}(Tampering,Leaving)=\sum_{Alarm}f_{4}(Alarm,Leaving)*f_{6}(Tampering,Alarm)$

$Fs$ then contains the factors:

 $\begin{array}[]{l}f_{0}(Tampering),f_{5}(Leaving),f_{7}(Tampering,Leaving).% \end{array}$

Eliminating $Leaving$ results in the factor

 $f_{8}(Tampering)=\sum_{Leaving}f_{5}(Leaving)*f_{7}(Tampering,Leaving).$

To determine the distribution over $Tampering$, multiply the remaining factors, giving

 $f_{9}(Tampering)=f_{0}(Tampering)*f_{8}(Tampering).$

The posterior distribution over tampering is given by

 $\frac{f_{9}(Tampering)}{\sum_{Tampering}f_{9}(Tampering)}~{}.$

Note that the denominator is the prior probability of the evidence, namely $P(Smoke{\,{=}\,}true\wedge Report{\,{=}\,}true)$

###### Example 8.23.

Consider the same network as in the previous example but with the following query:

 $P(Alarm\mid Fire{=}true).$

When $Fire$ is eliminated, the factor $P(Fire)$ becomes a factor of no variables; it is just a number, $P(Fire{=}true)$.

Suppose $Report$ is eliminated next. It is in one factor, which represents $P(Report\mid Leaving)$. Summing over all of the values of $Report$ gives a factor on $Leaving$, all of whose values are 1. This is because $P(Report{=}true\mid Leaving{=}v)+P(Report{=}false\mid Leaving{=}v)=1$ for any value $v$ of $Leaving$.

If $Leaving$ is eliminated next, a factor that is all 1 is multiplied by a factor representing $P(Leaving\mid Alarm)$ and $Leaving$ is summed out. This, again, results in a factor all of whose values are 1.

Similarly, eliminating $Smoke$ results in a factor of no variables, whose value is $1$. Note that even if smoke had also been observed, eliminating $Smoke$ would result in a factor of no variables, which would not affect the posterior distribution on $Alarm$.

Eventually, there is only the factor on $Alarm$ that represents its prior probability and a constant factor that will cancel in the normalization.

To speed up the inference, variables that are irrelevant to answer a query given the observations can be pruned. In particular, any node that has no observed or queried descendants and is itself not observed or queried may be pruned. This may result in a smaller network with fewer factors and variables. For example, to compute $P(Alarm\mid Fire{=}true)$, the variables $Report$, $Leaving$ and $Smoke$ may be pruned.

The complexity of the algorithm depends on a measure of complexity of the network. The size of a tabular representation of a factor is exponential in the number of variables in the factor. The treewidth of a network, given an elimination ordering, is the maximum number of variables in a factor created by summing out a variable when using the elimination ordering. The treewidth of a belief network is the minimum treewidth over all elimination orderings. The treewidth depends only on the graph structure and is a measure of the sparseness of the graph. The complexity of VE is exponential in the treewidth and linear in the number of variables. Finding the elimination ordering with minimum treewidth is NP-hard, but there are some good elimination ordering heuristics, as discussed for CSP VE.

###### Example 8.24.

Consider the belief network of Figure 8.4. To compute the probability of $Sneezing$, the variables $Fever$ and $"Achoo"sound$ may be pruned, as they have no children and are not observed or queried. Summing out $Season$ involves multiplying the factors

 $P(Season),P(Pollen\mid Season),P(Influenza\mid Season)$

and results in a factor on $Influenza$ and $Pollen$. The treewidth of this belief network is two; there is an ordering of the variables that only constructs factors of size one or two, and there is no ordering of the variables that has a smaller treewidth.

The moral graph of a belief network is the undirected graph where there is an arc between any two nodes that appear in the same initial factor. This is obtained by “marrying” the parents of a node, and removing the directions. If we prune as outlined in the previous paragraph, moralize the graph, and remove all observed variables, only those variables connected to the query in this graph are relevant to answering the query. The other variables can be pruned.

Many modern exact algorithms use what is essentially variable elimination, and they speed it up by preprocessing as much as possible into a secondary structure before any evidence arrives. This is appropriate when, for example, the same belief network may be used for many different queries, and where observations are added incrementally. The algorithms save intermediate results so that evidence is incrementally added. Unfortunately, extensive preprocessing, allowing arbitrary sequences of observations and deriving the posterior on each variable, precludes pruning the network. So for each application you need to choose whether you will save more by pruning irrelevant variables for each query and observation or by preprocessing before you have any observations.