Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

### 7.2.2 Point Estimates with No Input Features

The simplest case for learning is when there are no input features and where there is a single target feature. This is the base case for many of the learning algorithms and corresponds to the case where all inputs are ignored. In this case, a learning algorithm predicts a single value for the target feature for all of the examples. The prediction that minimizes the error depends on the error that is being minimized.

Suppose *E* is a set of examples and *Y* is a numeric
feature. The best an agent can do is to make a single point estimate
for all examples. Note that it is possible for the agent to make
stochastic predictions, but these are not better; see Exercise 7.2.

The sum-of-squares error on *E* of prediction *v* is

∑_{e∈E}(val(e,Y)-v)^{2}.

The absolute error on *E* of prediction *v* is

∑_{e∈E}|val(e,Y)-v|.

The worst-case error on *E* of prediction *v* is

max_{e∈E}|val(e,Y)-v|.

**Proposition.**

*Suppose*

*V*is the multiset of values of*val(e,Y)*for*e∈E*.- The prediction that minimizes the sum-of-squares error on
*E*is the mean of*V*(the average value). - The value that minimizes the absolute error is the median of
*V*. In particular, any number*v*such that there is the same number of values of*V*less than*v*as there are values greater than*v*minimizes the error. - The value that minimizes the worst-case error is
*(max+min)/2*, where*max*is the maximum value and*min*is the minimum value.

**Proof.**

*The details of the proof are left as an exercise. The basic idea follows:*

- Differentiate the formula for the sum-of-squares error with
respect to
*v*and set to zero. This is elementary calculus. To make sure the point(s) with a derivative of zero is(are) a minimum, the end points also must be checked. - The absolute error is a piecewise linear function of
*v*. The slope for a value that is not in*V*depends on the number of elements greater minus the number of elements less than that value:*v*is a minimum if there are the same number of elements greater than*v*as there are less than*v*. - This prediction has an error of
*(max-min)/2*; increasing or decreasing the prediction will increase the error.

When the target feature has domain
*{0,1}*, the training examples can be summarized in two
numbers: *n _{0}*, the number of examples with the value 0, and

*n*, the number of examples with value 1. The prediction for each new case is the same number,

_{1}*p*.

The optimal prediction *p* depends on the optimality
criteria. The value of the optimality criteria for the training examples can be computed
analytically and can be optimized analytically. The results are
summarized in Figure 7.3.

Prediction | Measure of | Optimal | ||||||

measure | prediction p for | prediction for | ||||||

the training data | training data | |||||||

absolute error | n _{0} p+n_{1}(1-p) | median(n _{0},n_{1}) | ||||||

sum squares | n _{0} p^{2}+n_{1}(1-p)^{2} | (n _{1})/(n_{0}+n_{1}) | ||||||

worst case | {
| {
| ||||||

likelihood | p ^{n1}(1-p)^{n0} | (n_{1})/(n_{0}+n_{1}) | ||||||

entropy | -n _{1} log p-n_{0} log (1-p) |
(n _{1})/(n_{0}+n_{1}) | ||||||

Notice that optimizing the absolute error means predicting the median,
which in this case is also the mode;
this should not be surprising because the error is linear in *p*.

The
optimal prediction for the training data for the other criteria is to
predict the **empirical frequency**: the proportion of 1's in the
training data, namely *(n _{1})/(n_{0}+n_{1})*. This can be seen as a prediction
of the

**probability**. The empirical frequency is often called the

**maximum-likelihood estimate**.

This analysis does
not specify the optimal prediction for the test data.
We would *not* expect the empirical frequency of the training
data to be the optimal prediction for
the test data for maximizing the likelihood or minimizing the entropy. If *n _{0}=0* or if

*n*, all of the training data are classified the same. However, if just one of the test examples is not classified in this way, the likelihood would be 0 (its lowest possible value) and the entropy would be infinite. See Exercise 7.1.

_{1}=0