### 7.2.3 Learning Probabilities

For many of the prediction measures, the optimal prediction on the training data is the empirical frequency. Thus, making a point estimate can be interpreted as learning a probability. However, the empirical frequency is typically not a good estimate of the probability of new cases; just because an agent has not observed some value of a variable does not mean that the value should be assigned a probability of zero. A probability of zero means that the value is impossible.

Typically, we do not have data without any prior knowledge. There is typically a great deal of knowledge about a domain, either in the meaning of the symbols or in experience with similar examples that can be used to improve predictions.

A standard way both to solve the zero-probability problem and to take prior knowledge into account is to use a pseudocount or prior count for each value to which the training data is added.

Suppose there is a binary feature Y, and an agent has observed n0 cases where Y=0 and n1 cases where Y=1. The agent can use a pseudocount c0 ≥ 0 for Y=0 and a pseudocount c1 ≥ 0 for Y=1 and estimate the probability as

P(Y=1)=(n1+c1)/(n0+c0+n1+c1).

This takes into account both the data and the prior knowledge. This formula can be justified in terms of a prior on the parameters (see Section 7.8). Choosing pseudocounts is part of designing the learner.

More generally, suppose Y has domain {y1,...,yk}. The agent starts with a pseudocount ci for each yi. These counts are chosen before the agent has seen any of the data. Suppose the agent observes some training examples, where ni is the number of data points with Y=yi. It can then use

P(Y=yi)=(ci+ni)/(∑i' ci'+ni').

To determine the pseudocounts, consider the question, "How much more should an agent believe yi if it had seen one example with yi true than if it had seen no examples with yi true?" If, with no examples of yi true, the agent believes that yi is impossible, ci should be zero. If not, the ratio chosen in answer to that question should be equal to the ratio (1+ci):ci. If the pseudocount is 1, a value that has been seen once would be twice as likely as one that has been seen no times. If the pseudocount is 10, a value observed once would be 10% more likely than a value observed no times. If the pseudocount is 0.1, a value observed once would be 11 times more likely than a value observed no times. If there is no reason to choose one value in the domain of Y over another, all of the values of ci should be equal.

If there is no prior knowledge, Laplace(1812) suggested that it is reasonable to set ci=1. See Section 7.8 for a justification of why this may be reasonable. We will see examples where it is not appropriate, however.

The same idea can be used to learn a conditional probability distribution. To estimate a conditional distribution, P(Y|X), of variable Y conditioned on variable(s) X, the agent can maintain a count for each pair of a value for Y and a value for X. Suppose cij is a non-negative number that will be used as a pseudocount for Y=yi ∧X=xj. Suppose nij is the number of observed cases of Y=yi ∧X=xj. The agent can use

P(Y=yi|X=xj)=(cij+nij)/(∑i' ci'j+ni'j),

but this does not work well when the denominator is small, which occurs when some values of X are rare. When X has structure - for example, when it is composed of other variables - it is often the case that some assignments to X are very rare or even do not appear in the training data. In these cases, the learner must use other methods that are discussed in this chapter.