10.1 Probabilistic Learning

10.1.1 Learning Probabilities

The above formulation of probabilities of models can be used to learn probabilities, as in the following example.

Example 10.1.

Consider the problem of predicting the next toss of a thumbtack (drawing pin), where we define the outcomes Tails and Heads as follows:

Suppose you tossed a thumbtack a number of times and observed E, a particular sequence of n0 instances of Tails and n1 instances of Heads. Assume the tosses are independent, and that Heads occurs with probability p. The likelihood is

P(Ep)=pn1*(1-p)n0.

This is a maximum when the log-likelihood

logP(Ep)=n1*logp+n0*log(1-p)

is a maximum, which occurs when p=n1n0+n1.

Note that if n0=0 or n1=0, this predicts a zero probability for an event that is possible.

The maximum likelihood estimator on the training data is not necessarily the best predictor on the test set.

A simple way both to solve the zero-probability problem and to take prior knowledge into account is to use a real-valued pseudocount or prior count to which the training data are added.

Suppose an agent must predict a value for Y with domain {y1,,yk}, and there are no inputs. The agent starts with a pseudocount ci for each yi. These counts are chosen before the agent has seen any of the data. Suppose the agent observes some training examples, where ni is the number of data points with Y=yi. The probability of Y is estimated using

P(Y=yi)=ci+niici+ni.

This may be used to estimate probabilities even before any data have been seen, when the ni are all 0.

If there is no prior knowledge, Laplace [1812] suggested that it is reasonable to set ci=1. This method with a prior count of 1 is called Laplace smoothing. Laplace smoothing can be justified in terms of averaging over the probabilities [see Section 10.4].

Example 10.2.

In Example 10.1, one possible prior for the model that says that heads occurs with probability p is

P(p)=pc1*(1-p)c0

in which case the probability of the model given examples E which consists of a particular sequence of n0 tails and n1 heads is

P(pE)pc1+n1*(1-p)c0+n0.

In this case, the MAP estimate for p, the probability of heads is:

p=c1+n1c0+n0+c1+n1.

One reason for this prior is that you might not think that 0 or 1 is a reasonable estimate for the probability of heads (no matter how much data you have). It also reflects prior knowledge to be used when there are no data, and can be integrated with data. Moreover this prior has the same form as the posterior; both are described in terms of counts (a prior that has the same form as a posterior is called a conjugate prior).

To determine appropriate pseudocounts, consider the question, “How much more should an agent believe yi if it had seen one example with yi true than if it had seen no examples with yi true?” If, with no examples of yi true, the agent believes that yi is impossible, ci should be zero. If not, the ratio chosen in answer to that question should be equal to the ratio (1+ci):ci. If the pseudocount is 1, a value that has been seen once would be twice as likely as one that has been seen no times. If the pseudocount is 10, a value observed once would be 10% more likely than a value observed no times. If the pseudocount is 0.1, a value observed once would be 11 times more likely than a value observed no times. If there is no reason to choose one value in the domain of Y over another, all the values of ci should be equal.

Typically, we do not have data without any prior knowledge. There is often a great deal of knowledge about a domain, either in the meaning of the symbols or in experience with similar examples, that can be used to improve predictions.

Probabilities from Experts

The use of pseudocounts also gives us a way to combine expert opinion and data. Often a single agent does not have good data but may have access to multiple experts who have varying levels of expertise and who give different probabilities.

There are a number of problems with obtaining probabilities from experts:

  • experts’ reluctance to give an exact probability value that cannot be refined,

  • representing the uncertainty of a probability estimate,

  • combining the numbers from multiple experts, and

  • combining expert opinion with actual data.

Rather than expecting experts to give probabilities, the experts provide counts. Instead of giving a real number for the probability of A, an expert gives a pair of numbers as n,m that is interpreted as though the expert had observed n occurrences of A out of m trials. Essentially, the experts provide not only a probability but also an estimate of the size of the data set on which their opinion is based. Note that you should not necessarily believe an expert’s sample size, as people are often overconfident in their abilities.

The counts from different experts can be combined together by adding the components to give the pseudocounts for the system. Whereas the ratio reflects the probability, different levels of confidence are reflected in the absolute values. Consider different ways to represent the probability 2/3. The pair 2,3 reflects extremely low confidence that would quickly be dominated by data or other experts’ estimates. The pair 20,30 reflects more confidence – a few examples would not change it much, but tens of examples would. Even hundreds of examples would have little effect on the prior counts of the pair 2000,3000. However, with millions of data points, even these prior counts would have little impact on the resulting probability estimate.