10.1 Probabilistic Learning

10.1.3 MAP Learning of Decision Trees

The previous examples did not need the prior on the structure of models, as all the models were equally complex. However, for learning decision trees, you need a bias, typically in favor of smaller decision trees. The prior probability provides this bias.

If there are no examples with the same values for the input features but different values for the target feature, there are always multiple decision trees that fit the data perfectly. If the training examples do not cover every assignment to the input variables, multiple trees will fit the data perfectly. Moreover, for every assignment of values not observed, there are decision trees that perfectly fit the training set, and make opposite predictions on the unseen examples.

If there is a possibility of noise, none of the trees that perfectly fit the training set may be the best model. Not only do we want to compare the models that fit the data perfectly; we also want to compare those models with the models that do not necessarily fit the data perfectly. MAP learning provides a way to compare these models.

Suppose there are multiple decision trees that accurately fit the data. If m denotes one of those decision trees, P(Em)=1. The preference for one decision tree over another depends on the prior probabilities of the decision trees; the prior probability encodes the learning bias. The preference for simpler decision trees over more complicated decision trees reflect the fact that simpler decision trees have a higher prior probability.

Bayes’ rule gives a way to trade off simplicity and the ability to handle noise. Decision trees handle noisy data by having probabilities at the leaves. When there is noise, larger decision trees fit the training data better, because the tree can model random regularities (noise) in the training data. In decision tree learning, the likelihood favors bigger decision trees; the more complicated the tree, the better it can fit the data. The prior distribution typically favors smaller decision trees. When there is a prior distribution over decision trees, Bayes’ rule specifies how to trade off model complexity and accuracy. The posterior probability of the model given the data is proportional to the product of the likelihood and the prior.

Example 10.6.

Consider the data of Figure 7.1, where the learner is required to predict the user’s actions.

One possible decision tree is the one given on the left of Figure 7.6. Call this decision tree d2. The likelihood of the data is P(Ed2)=1. That is, d2 accurately fits the data.

Another possible decision tree is one with no internal nodes, and a single leaf that predicts reads with probability 12. This is the most likely tree with no internal nodes, given the data. Call this decision tree d0. The likelihood of the data given this model is

P(Ed0)=(12)9*(12)90.00000149.

Another possible decision tree is one on the right of Figure 7.6, with one split on Length, and with probabilities on the leaves given by P(readsLength=long)=0 and P(readsLength=short)=911. Note that 911 is the empirical frequency of reads among the training set with Length=short. Call this decision tree d1a. The likelihood of the data given this model is

P(Ed1a)=17*(911)9*(211)20.0543.

Another possible decision tree is one that just splits on Thread, and with probabilities on the leaves given by P(readsThread=new)=710 (as 7 out of the 10 examples with Thread=new have User_action=reads), and P(readsThread=follow_up)=28. Call this decision tree d1t. The likelihood of the data given d1t is

P(Ed1t)=(710)7*(310)3*(68)6*(28)20.000025.

These are just four of the possible decision trees. Which is best depends on the prior on trees. The likelihood of the data is multiplied by the prior probability of the decision trees to determine the posterior probability of the decision tree.