7.3 Basic Models for Supervised Learning

7.3.2 Linear Regression and Classification

Linear functions provide a basis for many learning algorithms. This section first covers regression – the problem of predicting a real-valued function from training examples – then considers the discrete case of classification.

Linear regression is the problem of fitting a linear function to a set of training examples, in which the input and target features are numeric.

Suppose the input features, X1,,Xn, are all numeric and there is a single target feature Y. A linear function of the input features is a function of the form

Y^w¯(e) =w0+w1*X1(e)++wn*Xn(e)
=i=0nwi*Xi(e)

where w¯=w0,w1,,wn is a tuple of weights. To make w0 not be a special case, we invent a new feature, X0, whose value is always 1.

Suppose E is a set of examples. The sum-of-squares error on examples E for target Y is

error(E,w¯) =eE(Y(e)-Y^w¯(e))2
=eE(Y(e)-i=0nwi*Xi(e))2. (7.1)

In this linear case, the weights that minimize the error can be computed analytically (see Exercise 10). A more general approach, which can be used for wider classes of functions, is to compute the weights iteratively.

Gradient descent is an iterative method to find the minimum of a function. Gradient descent for minimizing error starts with an initial set of weights; in each step, it decreases each weight in proportion to its partial derivative:

wi:=wi-η*wierror(E,w¯)

where η, the gradient descent step size, is called the learning rate. The learning rate, as well as the features and the data, is given as input to the learning algorithm. The partial derivative specifies how much a small change in the weight would change the error.

The sum-of-squares error for a linear function is convex and has a unique local minimum, which is the global minimum. As gradient descent with small enough step size will converge to a local minimum, this algorithm will converge to the global minimum.

Consider minimizing the sum-of-squares error. The partial derivative of the error in Equation 7.1 with respect to weight wi is

wierror(E,w¯)=eE-2*δ(e)*Xi(e) (7.2)

where δ(e)=Y(e)-Y^w¯(e).

Gradient descent will update the weights after sweeping through all examples. An alternative is to update each weight after each example. Each example e can update each weight wi using:

wi:=wi+η*δ(e)*Xi(e), (7.3)

where we have ignored the constant 2, because we assume it is absorbed into the learning rate η.

1:procedure Linear_learner(Xs,Y,Es,η)
2:      Inputs
3:            Xs: set of input features, Xs={X1,,Xn}
4:            Y: target feature
5:            Es: set of training examples
6:            η: learning rate       
7:      Output
8:            function to make prediction on examples
9:      Local
10:            w0,,wn: real numbers       
11:      initialize w0,,wn randomly
12:      define pred(e)=iwi*Xi(e)
13:      repeat
14:            for each example e in Es do
15:                 error:=Y(e)-pred(e)
16:                 update:=η*error
17:                 for each i[0,n] do
18:                       wi:=wi+update*Xi(e)                              
19:      until termination
20:      return pred
Figure 7.8: Incremental gradient descent for learning a linear function

Figure 7.8 gives an algorithm, Linear_learner(Xs,Y,Es,η), for learning the weights of a linear function that minimize the sum-of-squares error. This algorithm returns a function that makes predictions on examples. In the algorithm, X0(e) is defined to be 1 for all e.

Termination is usually after some number of steps, when the error is small or when the changes get small.

Updating the weights after each example does not strictly implement gradient descent because the weights are changing beween examples. To implement gradient descent, we should save up all of the changes and update the weights after all of the examples have been processed. The algorithm presented in Figure 7.8 is called incremental gradient descent because the weights change while it iterates through the examples. If the examples are selected at random, this is called stochastic gradient descent. These incremental methods have cheaper steps than gradient descent and so typically become more accurate more quickly when compared to saving all of the changes to the end of the examples. However, it is not guaranteed that they will converge as individual examples can move the weights away from the minimum.

Batched gradient descent updates the weights after a batch of examples. The algorithm computes the changes to the weights after every example, but only applies the changes after the batch. If a batch is all of the examples, it is equivalent to gradient descent. If a batch consists of just one example, it is equivalent to incremental gradient descent. It is typical to start with small batches to learn quickly and then increase the batch size so that it converges.

A similar algorithm can be used for other error functions that are (almost always) differentiable and where the derivative has some signal (is not 0). For the absolute error, which is not differentiable at zero, the derivative can be defined to be zero at that point because the error is already at a minimum and the weights do not have to change. See Exercise 9. It does not work for other errors such as the 0/1 error, where the derivative is either 0 (almost everywhere) or undefined.

Squashed Linear Functions

Consider binary classification, where the domain of the target variable is {0,1}. Multiple binary target variables can be learned separately.

The use of a linear function does not work well for such classification tasks; a learner should never make a prediction of greater than 1 or less than 0. However, a linear function could make a prediction of, say, 3 for one example just to fit other examples better.

A squashed linear function is of the form

Y^w¯(e) =f(w0+w1*X1(e)++wn*Xn(e))
=f(iwi*Xi(e))

where f, an activation function, is a function from the real line [-,] into some subset of the real line, such as [0,1].

A prediction based on a squashed linear function is a linear classifier.

A simple activation function is the step function, step0(x), defined by

step0(x)={1 if x00 if x<0.

A step function was the basis for the perceptron [Rosenblatt, 1958], which was one of the early methods developed for learning. It is difficult to adapt gradient descent to step functions because gradient descent takes derivatives and step functions are not differentiable.

If the activation is (almost everywhere) differentiable, gradient descent can be used to update the weights. The step size might need to converge to zero to guarantee convergence.

Figure 7.9: The sigmoid or logistic function

One differentiable activation function is the sigmoid or logistic function:

sigmoid(x)=11+e-x.

This function, depicted in Figure 7.9, squashes the real line into the interval (0,1), which is appropriate for classification because we would never want to make a prediction of greater than 1 or less than 0. It is also differentiable, with a simple derivative – namely, ddxsigmoid(x)=sigmoid(x)*(1-sigmoid(x)).

The problem of determining weights for the sigmoid of a linear function that minimize an error on a set of examples is called logistic regression.

To optimize the log loss error for logistic regression, minimize the negative log-likelihood

LL(E,w¯)=-(eE(Y(e)*logY^(e)+(1-Y(e))*log(1-Y^(e))))

where Y^(e)=sigmoid(i=0nwi*Xi(e)).

wiLL(E,w¯)=eE-δ(e)*Xi(e)

where δ(e)=Y(e)-Y^w¯(e). This is, essentially, the same as Equation 7.2, the only differences being the definition of the predicted value and the constant “2” which can be absorbed into the step size.

The Linear_learner algorithm of Figure 7.8 can be modified to carry out logistic regression to minimize log loss by changing the prediction to be sigmoid(iwi*Xi(e)). The algorithm is show in Figure 7.10.

1:procedure Logistic_regression_learner(Xs,Y,Es,η)
2:      Inputs
3:            Xs: set of input features, Xs={X1,,Xn}
4:            Y: target feature
5:            Es: set of training examples
6:            η: learning rate       
7:      Output
8:            function to make prediction on examples
9:      Local
10:            w0,,wn: real numbers       
11:      initialize w0,,wn randomly
12:      define pred(e)=sigmoid(iwi*Xi(e))
13:      repeat
14:            for each example e in Es in random order do
15:                 error:=Y(e)-pred(e)
16:                 update:=η*error
17:                 for each i[0,n] do
18:                       wi:=wi+update*Xi(e)                              
19:      until termination
20:      return pred
Figure 7.10: Stochastic gradient descent for logistic regression
Example 7.11.

Consider learning a squashed linear function for classifying the data of Figure 7.1. One function that correctly classifies the examples is

Reads^(e)=sigmoid(-8+7*Short(e)+3*New(e)+3*Known(e)),

where f is the sigmoid function. A function similar to this can be found with about 3000 iterations of gradient descent with a learning rate η=0.05. According to this function, Reads^(e) is true (the predicted value for example e is closer to 1 than 0) if and only if Short(e) is true and either New(e) or Known(e) is true. Thus, the linear classifier learns the same function as the decision tree learner. To see how this works, see the “mail reading” example of the Neural AIspace.org applet.

To minimize sum-of-squares error, instead, the prediction is the same, but the derivative is different. In particular, line 16 of Figure 7.10 should become

update:=η*error*pred(e)*(1-pred(e)).

Consider each input feature as a dimension; if there are n features, there will be n dimensions. A hyperplane in an n-dimensional space is a set of points that all satisfy a constraint that some linear function of the variables is zero. The hyperplane forms an (n-1)-dimensional space. For example, in a (two-dimensional) plane, a hyperplane is a line, and in a three-dimensional space, a hyperplane is a plane. A classification is linearly separable if there exists a hyperplane where the classification is true on one side of the hyperplane and false on the other side.

The Logistic_regression_learner algorithm can learn any linearly separable classification. The error can be made arbitrarily small for arbitrary sets of examples if, and only if, the target classification is linearly separable. The hyperplane is the set of points where iwi*Xi=0 for the learned weights w¯. On one side of this hyperplane, the prediction is greater than 0.5; on the other side, the prediction is less than 0.5.

Figure 7.11: Linear separators for Boolean functions
Example 7.12.

Figure 7.11 shows linear separators for “or” and “and”. The dashed line separates the positive (true) cases from the negative (false) cases. One simple function that is not linearly separable is the exclusive-or (xor) function. There is no straight line that separates the positive examples from the negative examples. As a result, a linear classifier cannot represent, and therefore cannot learn, the exclusive-or function.

Consider a learner with three input features x, y and z, each with domain {0,1}. Suppose that the ground truth is the function “if x then y else z”. This is depicted on the right of Figure 7.11 by a cube in the standard coordinates, where the x, y and z range from 0 to 1. This function is not linearly separable.

Often it is difficult to determine a priori whether a data set is linearly separable.

Example 7.13.

Consider the data set of Figure 7.12 (a), which is used to predict whether a person likes a holiday as a function of whether there is culture, whether the person has to fly, whether the destination is hot, whether there is music, and whether there is nature. In this data set, the value 1 means true and 0 means false. The linear classifier requires the numerical representation.

Culture Fly Hot Music Nature Likes lin Likes^
0 0 1 0 0 0 -9.09 0.00011
0 1 1 0 0 0 -9.08 0.00011
1 1 1 1 1 0 -4.48 0.01121
0 1 1 1 1 0 -6.78 0.00113
0 1 1 0 1 0 -2.28 0.09279
1 0 0 1 1 1 4.61 0.99015
0 0 0 0 0 0 0.01 0.50250
0 0 0 1 1 1 2.31 0.90970
1 1 1 0 0 0 -6.78 0.00113
1 1 0 1 1 1 4.62 0.99024
1 1 0 0 0 1 2.32 0.91052
1 0 1 0 1 1 0.01 0.50250
0 0 0 1 0 0 -4.49 0.01110
1 0 1 1 0 0 -11.29 0.00001
1 1 1 1 0 0 -11.28 0.00001
1 0 0 1 0 0 -2.19 0.10065
1 1 1 0 1 0 0.02 0.50500
0 0 0 0 1 1 6.81 0.99890
0 1 0 0 0 1 0.02 0.50500

(a)
(b)
Figure 7.12: Predicting what holiday a person likes (a) training data (b) prediction

After 10,000 iterations of gradient descent with a learning rate of 0.05, the prediction found is (to one decimal point)

lin(e)= 2.3*Culture(e)+0.01*Fly(e)-9.1*Hot(e)
-4.5*Music(e)+6.8*Nature(e)+0.01
Likes^(e)= sigmoid(lin(e)).

The linear function lin and the prediction for each example are shown in Figure 7.12 (b). All but four examples are predicted reasonably well, and for those four it predicts a value of approximately 0.5. This function is quite stable with different initializations. Increasing the number of iterations makes it predict the other tuples more accurately, but does not improve on these four. This data set is not linearly separable.

When the domain of the target variable has more than two values – there are more than two classes – indicator variables can be used to convert the classification to binary variables. These binary variables could be learned separately. The predictions of the individual classifiers can be combined to give a prediction for the target variable. Because exactly one of the values must be true for each example, a learner should not predict that more than one will be true or that none will be true. Suppose we separately learned the predicted values for Y1Yk are q1qk, where the qi0. A learner that predicts a probability distribution could predict Y=yi with probability qi/jqj. A learner that must make a definitive prediction can predict the mode, a yi for which qi is the maximum.