foundations of computational agents
Suppose a learning agent has complete data and no hidden variables, but is not given the structure of the belief network. This is the setting for structure learning of belief networks.
There are two main approaches to structure learning:
The first is to use the definition of a belief network in terms of conditional independence. Given a total ordering of variables, the parents of a variable $X$ are defined to be a subset of the predecessors of $X$ in the total ordering that render the other predecessors independent of $X$. Using the definition directly has two main challenges: the first is to determine the best total ordering; the second is to find a way to measure independence. It is difficult to determine conditional independence when there is limited data.
The second method is to have a score for networks, for example, using the MAP model, which takes into account fit to the data and model complexity. Given such a measure, it is feasible to search for the structure that minimizes this error.
This section presents the second method, often called a search and score method.
Assume that the data is a set $E$ of examples, where each example has a value for each variable. The aim of the search and score method is to choose a model $m$ that maximizes
$$P(m\mid E)\propto P(E\mid m)*P(m).$$ |
The likelihood, $P(E\mid m)$, is the product of the probability of each example. Using the product decomposition, the product of each example given the model is the product of the probability of each variable given its parents in the model. Thus,
$P(E\mid m)*P(m)$ | $=\left({\displaystyle \prod _{e\in E}}P(e\mid m)\right)*P(m)$ | ||
$=\left({\displaystyle \prod _{e\in E}}{\displaystyle \prod _{{X}_{i}}}{P}_{m}^{e}({X}_{i}\mid par({X}_{i},m))\right)*P(m)$ |
where $par({X}_{i},m)$ denotes the parents of ${X}_{i}$ in the model $m$, and ${P}_{m}^{e}(\cdot )$ denotes the probability of example $e$ as specified in the model $m$.
This is maximized when its logarithm is maximized. When taking logarithms, products become sums:
$\mathrm{log}P(E\mid m)+\mathrm{log}P(m)=\left({\displaystyle \sum _{e\in E}}{\displaystyle \sum _{{X}_{i}}}\mathrm{log}{P}_{m}^{e}({X}_{i}\mid par({X}_{i},m))\right)+\mathrm{log}P(m).$ |
To make this approach feasible, assume that the prior probability of the model decomposes into components for each variable. That is, we assume the probability of the model decomposes into a product of probabilities of local models for each variable. Let $m({X}_{i})$ be the local model for variable ${X}_{i}$.
Thus, we want to maximize
$\left({\displaystyle \sum _{e\in E}}{\displaystyle \sum _{{X}_{i}}}\mathrm{log}{P}_{m}^{e}({X}_{i}\mid par({X}_{i},m))\right)+{\displaystyle \sum _{{X}_{i}}}\mathrm{log}P(m({X}_{i}))$ | ||
$={\displaystyle \sum _{{X}_{i}}}\left({\displaystyle \sum _{e\in E}}\mathrm{log}{P}_{m}^{e}({X}_{i}\mid par({X}_{i},m))\right)+{\displaystyle \sum _{{X}_{i}}}\mathrm{log}P(m({X}_{i}))$ | ||
$={\displaystyle \sum _{{X}_{i}}}({\displaystyle \sum _{e\in E}}\mathrm{log}{P}_{m}^{e}({X}_{i}\mid par({X}_{i},m))+\mathrm{log}P(m({X}_{i}))).$ |
Each variable could be optimized separately, except for the requirement that a belief network is acyclic. However, if you had a total ordering of the variables, there is an independent supervised learning problem to predict the probability of each variable given the predecessors in the total ordering. To approximate $\mathrm{log}P(m({X}_{i})$, the BIC score is suitable. To find a good total ordering of the variables, a learning agent could search over total orderings, using search techniques such as local search or branch-and-bound search.