A brief introduction to Bayes' Rule

by Kevin Murphy.

Intuition

Here is a simple introduction to Bayes' rule from an article in the Economist (9/30/00).

"The essence of the Bayesian approach is to provide a mathematical rule explaining how you should change your existing beliefs in the light of new evidence. In other words, it allows scientists to combine new data with their existing knowledge or expertise. The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child's degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise."

In symbols

Mathematically, Bayes' rule states

                 likelihood * prior
posterior = ------------------------------
                  marginal likelihood

or, in symbols,

             P(e | R=r) P(R=r)
P(R=r | e) = -----------------
                   P(e)

where P(R=r|e) denotes the probability that random variable R has value r given evidence e. The denominator is just a normalizing constant that ensures the posterior adds up to 1; it can be computed by summing up the numerator over all possible values of R, i.e.,

P(e) = P(R=0, e) + P(R=1, e) + ... = sum_r P(e | R=r) P(R=r)

This is called the marginal likelihood (since we marginalize out over R), and gives the prior probability of the evidence.

Example of Bayes' rule

Here is a simple example, based on Mike Shor's Java applet. Suppose you have been tested positive for a disease; what is the probability that you actually have the disease? It depends on the accuracy and sensitivity of the test, and on the background (prior) probability of the disease.

Let P(Test=+ve | Disease=true) = 0.95, so the false negative rate, P(Test=-ve | Disease=true), is 5%. Let P(Test=+ve | Disease=false) = 0.05,, so the false positive rate is also 5%. Suppose the disease is rare: P(Disease=true) = 0.01 (1%). Let D denote Disease (R in the above equation) and "T=+ve" denote the positive Test (e in the above equation). Then


                                 P(T=+ve|D=true) * P(D=true)
P(D=true|T=+ve) =  ------------------------------------------------------------
                     P(T=+ve|D=true) * P(D=true)+ P(T=+ve|D=false) * P(D=false)
 
                   0.95  * 0.01              0.0095
               =  -------------------    =  -------     = 0.161
                  0.95*0.01 + 0.05*0.99      0.0590

So the probability of having the disease given that you tested positive is just 16%. This seems too low, but here is an intuitive argument to support it. Of 100 people, we expect only 1 to have the disease, and that person will probably test positive. But we also expect about 5% of the others (about 5 people in total) to test positive by accident. So of the 6 people who test positive, we only expect 1 of them to actually have the disease; and indeed 1/6 is approximately 0.16. (If you still don't believe this result, try reading An Intuitive Explanation of Bayesian Reasoning by Eliezer Yudkowsky.)

In other words, the reason the number is so small is that you believed that this is a rare disease; the test has made it 16 times more likely you have the disease (p(D=1|T=1)/p(D=1)=0.16/0.01=16), but it is still unlikely in absolute terms. If you want to be "objective", you can set the prior to uniform (i.e. effectively ignore the prior), and then get


                                 P(T=+ve|D=true) * P(D=true)
P(D=true|T=+ve) =  ------------------------------------------------------------
                                    P(T=+ve)
 
                   0.95  * 0.5              0.475
               =  -------------------    =  -------     = 0.95
                  0.95*0.5 + 0.05*0.5        0.5

This, of course, is just the true positive rate of the test. However, this conclusion relies on your belief that, if you did not conduct the test, half the people in the world have the disease, which does not seem reasonable.

A better approach is to use a plausible prior (eg P(D=true)=0.01), but then conduct multiple independent tests; if they all show up positive, then the posterior will increase. For example, if we conduct two (conditionally independent) tests T1, T2 with the same reliability, and they are both positive, we get


                                 P(T1=+ve|D=true) * P(T2=+ve|D=true) * P(D=true)
P(D=true|T1=+ve,T2=+ve) =  ------------------------------------------------------------
                                           P(T1=+ve, T2=+ve)
 
                   0.95  * 0.95 * 0.01                 0.009
               =  -----------------------------    =  -------     = 0.7826
                  0.95*0.95*0.01 + 0.05*0.05*0.99       0.0115

The assumption that the pieces of evidence are conditionally independent is called the naive Bayes assumption. This model has been successfully used for classifying email as spam (D=true) or not (D=false) given the presence of various key words (Ti=+ve if word i is in the text, else Ti=-ve). It is clear that the words are not independent, even conditioned on spam/not-spam, but the model works surprisingly well nonetheless.

What is the relationship between graphical models and Bayes' rule?

For complicated probabilistic models, computing the normalizing constant P(e) is computationally intractable, either because there are an exponential number of (discrete) values of R to sum over, or because the integral over R cannot be solved in closed form (e.g., if R is a high-dimensional vector). Graphical models can help because they represent the joint probability distribution as a product of local terms, which can sometimes be exploited computationally (e.g., using dynamic programming or Gibbs sampling). Bayes nets (directed graphical models) are a natural way to represent many hierarchical Bayesian models.

Bayesian vs Frequentist statistics

Bayes theorem and, in particular, its emphasis on prior probabilities has caused considerable controversy. The great statistician Ronald Fisher was very critical of the ``subjectivist'' aspects of priors. By contrast, a leading proponent I.J. Good argued persuasively that ``the subjectivist (i.e. Bayesian) states his judgements, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science''.

Frequentist statistics suffers from various undesirable counter intuitive properties, due to the fact that it relies on the notion of hypothetical repeated trials, rather than conditioning on the observed evidence. The following cartoon (from xkcd) summarizes the situation quite well:

History of Bayes' rule

(The following section was written by Alan Yuille.)

Bayes Theorem is commonly ascribed to the Reverent Thomas Bayes (1701-1761) who left one hundred pounds in his will to Richard Price ``now I suppose Preacher at Newington Green.'' Price discovered two unpublished essays among Bayes's papers which he forwarded to the Royal Society. This work made little impact, however, until it was independently discovered a few years later by the great French mathematician Laplace. English mathematicians then quickly rediscovered Bayes' work.

Little is known about Bayes and he is considered an enigmatic figure. One leading historian of statistics, Stephen Stigler, has even suggested that Bayes Theorem was really discovered by Nicolas Saunderson, a blind mathematician who was the fourth Lucasian Professor of Mathematics at Cambridge University. (Saunderson was recommended to this chair by Isaac Netwon, the second Lucasian Professor. Recent holders of the chair include the great physicist Paul Dirac and the current holder, Stephen Hawking).

Some readings on Bayesian statistics

Articles in the popular press

Science article, 19 November 1999, by David Malakoff. Excellent non-technical overview.
Economist article (9/30/00) about Bayesian approaches to clinical trials.
New York Times article (4/28/01) about Bayesian statistics.

Entry-level Books

"A first course in Bayesian statistical methods", Peter Hoff, Springer 2009
"Bayesian computation in R", Jim Albert, Springer 2009 (2nd edition)
Bayesian Statistics: An Introduction, Peter M. Lee. Third Edition 2004.

More advanced Books

"Bayesian data analysis", A. Gelman et al
"The Bayesian choice", C. Robert
"Bayesian theory", Bernardo et al
"Probability theory: the logic of science", Jaynes