Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

#### 7.5.1.1 MAP Learning of Decision Trees

To understand MAP learning, consider how it can be used to learn decision trees. If there are no examples with the same values for the input features and different values for the target features, there are always decision trees that fit the data perfectly. If the training examples do not cover all of the assignments to the input variables, multiple trees will fit the data perfectly. However, with noise, none of these may be the best model. Not only do we want to compare the models that fit the data perfectly; we also want to compare those models with the models that do not necessarily fit the data perfectly. MAP learning provides a way to compare these models.

Suppose there are multiple decision trees that accurately fit the data.
If *model*
denotes one of those decision trees,
*P(data|model)=1*. The preference for one decision tree over another
depends on the prior probabilities of the decision trees; the prior probability
encodes the learning bias. The preference for simpler decision trees
over more complicated decision trees occurs because simpler decision trees
have a higher prior probability.

Bayes' rule gives a way to trade off simplicity and ability to handle noise. Decision trees can handle noisy data by having probabilities at the leaves. When there is noise, larger decision trees fit the training data better, because the tree can account for random regularities (noise) in the training data. In decision-tree learning, the likelihood favors bigger decision trees; the more complicated the tree, the better it can fit the data. The prior distribution can favor smaller decision trees. When there is a prior distribution over decision trees, Bayes' rule specifies how to trade off model complexity and accuracy: The posterior probability of the model given the data is proportional to the product of the likelihood and the prior.

**Example 7.18:**Consider the data of Figure 7.1, where the learner is to predict the user's actions.

One possible decision tree is the one given on the left of
Figure 7.4. Call this
decision tree *d _{2}*. The likelihood of the data is

*P(data|d*. That is,

_{2})=1*d*accurately fits the data.

_{2}Another possible decision tree is one with no internal nodes, and a
leaf that says to predict *reads* with probability *(1)/(2)*. This
is the most likely tree with no internal nodes, given the data. Call
this decision tree *d _{0}*. The likelihood of the data given this model
is

P(data|d_{0}) = ((1)/(2))^{9}×((1)/(2))^{9}approx 0.00000149.

Another possible decision tree is one on the right of
Figure 7.4, which just splits on *Length*,
and with probabilities on the leaves given by *P(reads|Length=long)=0*
and *P(reads|Length=short)=(9)/(11)*. Note that *(9)/(11)*
is the empirical frequency of *reads* among the training set with
*Length=short*. Call this decision tree
*d _{1a}*. The likelihood of the data given this model is

P(data|d_{1a}) = 1^{7}×((9)/(11))^{9}×((2)/(11))^{2}approx 0.0543.

Another possible decision tree is one that just splits on *Thread*,
and with probabilities on the leaves given by
*P(reads|Thread=new)=(7)/(10)* (as 7 out of the 10 examples with
*Thread=new* have *User Action=reads*), and
*P(reads|Thread=follow Up)=(2)/(8)*. Call this decision tree
*d _{1t}*. The likelihood of the data given

*d*is

_{1t}P(data|d_{1t}) = ((7)/(10))^{7}×((3)/(10))^{3}×((6)/(8))^{6}×((2)/(8))^{2}approx 0.000025.

These are just four of the possible decision trees. Which is best depends on the prior on trees. The likelihood of the data is multiplied by the prior probability of the decision trees to determine the posterior probability of the decision tree.