#### 7.3.1.1 Searching for a Good Decision Tree

1: Procedure DecisionTreeLearner(X,Y,E)
2:           Inputs
3:                     X: set of input features, X={X1,...,Xn}
4:                     Y: target feature
5:                     E: set of training examples
6:           Output
7:                     decision tree
8:           if stopping criterion is true then
9:                     return pointEstimate(Y,E)
10:           else
11:                     Select feature Xi∈X, with domain {v1,v2}
12:                     let E1={e∈E: val(e,Xi)=v1}
13:                     let T1=DecisionTreeLearner(X \ {Xi},Y,E1)
14:                     let E2={e∈E: val(e,Xi)=v2}
15:                     let T2=DecisionTreeLearner(X \ {Xi},Y,E2)
16:                     return ⟨Xi=v1, T1, T2
17:
18: Procedure DecisionTreeClassify(e,X,Y,DT)
19:           Inputs
20:                     X: set of input features, X={X1,...,Xn}
21:                     Y: target feature
22:                     e: example to classify
23:                     DT: decision tree
24:           Output
25:                     prediction on Y for example e
26:           Local
27:                     S subtree of DT
28:           S←DT
29:           while S is an internal node of the form ⟨Xi=v,T1,T2 do
30:                     if val(e,Xi)=v then
31:                               S←T1
32:                     else
33:                               S←T2
34:
35:
36:           return S
Figure 7.5: Decision tree learning and classification for binary features

A decision tree can be incrementally built from the top down by recursively selecting a feature to split on and partitioning the training examples with respect to that feature. In Figure 7.5, the procedure DecisionTreeLearner learns a decision tree for binary attributes. The decisions regarding when to stop and which feature to split on are left undefined. The procedure DecisionTreeClassify takes in a decision tree produced by the learner and makes predictions for a new example.

The algorithm DecisionTreeLearner builds a decision tree from the top down as follows: The input to the algorithm is a set of input features, a target feature, and a set of examples. The learner first tests if some stopping criterion is true. If the stopping criterion is true, it returns a point estimate for Y, which is either a value for Y or a probability distribution over the values for Y. If the stopping criterion is not true, the learner selects a feature Xi to split on, and for each value v of this feature, it recursively builds a subtree for those examples with Xi=v. The returned tree is represented here in terms of triples representing an if-then-else structure.

Example 7.7: Consider applying DecisionTreeLearner to the classification data of Figure 7.1. The initial call is
[e1,e2,...,e18]).

Suppose the stopping criterion is not true and the algorithm selects the feature Length to split on. It then calls

[e1,e3,e4,e6,e9,e10,e12]).

All of these examples agree on the user action; therefore, the algorithm returns the prediction skips. The second step of the recursive call is

[e2,e5,e7,e8,e11,e13,e14,e15,e16,e17,e18]).

Not all of the examples agree on the user action, so the algorithm selects a feature to split on. Suppose it selects Thread. Eventually, this recursive call returns the subtree for the case when Length is short, such as

The final result is

<Length=long,skips,

which is a representation of the tree of Figure 7.4.

The learning algorithm of Figure 7.5 leaves three choices unspecified:

• The stopping criterion is not defined. The learner should stop when there are no input features, when all of the examples have the same classification, or when no splitting would improve the classification ability of the resulting tree. The last is the most difficult criterion to test for; see below.
• What should be returned at the leaves is not defined. This is a point estimate because, at this step, all of the other input features are ignored. This prediction is typically the most likely classification, the median or mean value, or a probability distribution over the classifications. [See Exercise 7.9.]
• Which feature to select to split on is not defined. The aim is to choose the feature that will result in the smallest tree. The standard way to do this is to choose the myopically optimal split: if the learner were only allowed one split, which single split would result in the best classification? With the sum-of-squares error, for each feature, determine the error of the resulting tree based on a single split on that feature. For the likelihood or the entropy, the myopically optimal split is the one that gives the maximum information gain. Sometimes information gain is used even when the optimality criterion is the sum-of-squares error. An alternative, the Gini index, is investigated in Exercise 7.10.
Example 7.8: Consider learning the user action from the data of Figure 7.1, where we split on the feature with the maximum information gain or we myopically choose the split that minimizes the entropy or maximizes the likelihood of the data. See Section 6.1.5 for the definition of information used here.

The information content of all examples with respect to feature User Action is 1.0, because there are 9 examples with User Action=reads and 9 examples with User Action=skips.

Splitting on Author partitions the examples into [e1, e4, e5, e6, e9, e10, e12, e13, e14, e15, e16, e17] with Author=known and [e2, e3, e7, e8, e11, e18] with Author=unknown, each of which is evenly split between the different user actions. The information gain for the test Author is zero. In this case, finding out whether the author is known, by itself, provides no information about what the user action will be.

Splitting on Thread partitions the examples into [e1, e2, e5, e8, e10, e12, e14, e15, e17, e18] and [e3, e4, e6, e7, e9, e11, e13, e16]. The first set of examples, those with Thread=new, contains 3 examples with User Action=skips and 7 examples with User Action=reads; thus, the information content of this set with respect to the user action is

-0.3 × log 2 0.3 - 0.7 × log 2 0.7 = 0.881

and so the information gain is 0.119.

Similarly, the examples with Thread=old divide up 6:2 according to the user action and thus have information content 0.811. The expected information gain is thus 1.0 - [(10/18)×0.881 + (8/18) ×0.811] = 0.150.

The test Length divides the examples into [e1, e3, e4, e6, e9, e10, e12] and [e2, e5, e7, e8, e11, e13, e14, e15, e16, e17, e18]. The former all agree on the value of User Action and so have information content zero. The user action divides the second set 9:2, and so the information is 0.684. Thus, the expected information gain by the test length is 1.0 - 11/18 ×0.684 = 0.582. This is the highest information gain of any test and so Length is chosen to split on.

In choosing which feature to split on, the information content before the test is the same for all tests, and so the learning agent can choose the test that results in the minimum expected information after the test.

The algorithm of Figure 7.5 assumes each input feature has only two values. This restriction can be lifted in two ways:

• Allow a multiway split. To split on a multivalued variable, there would be a child for each value in the domain of the variable. This means that the representation of the decision tree becomes more complicated than the simple if-then-else form used for binary features. There are two main problems with this approach. The first is what to do with values for which there are no training examples. The second is that, for most myopic splitting heuristics, including information gain, it is generally better to split on a variable with a larger domain because it produces more children and so can fit the data better than splitting on a feature with a smaller domain. [See Exercise 7.8.] However, splitting on a feature with a smaller domain keeps the representation more compact.
• Partition the domain into two disjoint subsets. When the domain is totally ordered, such as if the domain is a subset of the real numbers, the domain can be split into values less than some threshold and those greater than the threshold. For example, the children could correspond to X<v and X ≥ v for some value v in the domain of X. A myopically optimal value for v can be chosen in one sweep through the data by sorting the data on the value of X and considering each split that partitions the values. When the domain does not have a natural ordering, a split can be performed on arbitrary subsets of the domain. In this case, the myopically optimal split can be found by sorting values that appear in the data on the probability of classification.

If there is noise in the data, a major problem of the preceding algorithm is overfitting the data. Overfitting occurs when the algorithm tries to fit distinctions that appear in the training data but do not appear in the unseen examples. This occurs when random correlations exist in the training data that are not reflected in the data set as a whole. Section 7.5 discusses ways to detect overfitting. There are two ways to overcome the problem of overfitting in decision trees:

• Restrict the splitting to split only when the split is useful.
• Allow unrestricted splitting and then prune the resulting tree where it makes unwarranted distinctions.

The second method seems to work better in practice. One reason is that it is possible that two features together predict well but one of them, by itself, is not very useful, as shown in the following example.

Example 7.9: Suppose the aim is to predict whether a game of matching pennies is won or not. The input features are A, whether the first coin is heads or tails; B, whether the second coin is heads or tails; and C, whether there is cheering. The target feature, W, is true when there is a win, which occurs when both coins are heads or both coins are tails. Suppose cheering is correlated with winning. This example is tricky because A by itself provides no information about W, and B by itself provides no information about W. However, together they perfectly predict W. A myopic split may first split on C, because this provides the most myopic information. If all the agent is told is C, this is much more useful than A or B. However, if the tree eventually splits on A and B, the split on C is not needed. Pruning can remove C as being useful, whereas stopping early will keep the split on C.

A discussion of how to trade off model complexity and fit to the data is presented in Section 7.5.