### 6.1.5 Information Theory

Probability forms the basis of information theory. In this section, we give a brief overview of information theory.

A bit is a binary digit. Because a bit has two possible values, it can be used to distinguish two items. Often the two values are written as 0 and 1, but they can be any two different values.

Two bits can distinguish four items, each associated with either 00, 01, 10, or 11. Similarly, three bits can distinguish eight items. In general, n bits can distinguish 2n items. Thus, we can distinguish n items with  log 2 n bits. It may be surprising, but we can do better than this by taking probabilities into account.

Example 6.9: Suppose you want to design a code to distinguish the elements of the set {a,b,c,d}, with P(a)=(1)/(2), P(b)=(1)/(4), P(c)=(1)/(8), and P(d)=(1)/(8). Consider the following code:
 a 0 c 110 b 10 d 111

This code sometimes uses 1 bit and sometimes uses 3 bits. On average, it uses

 P(a) ×1 + P(b) ×2 + P(c) ×3 + P(d) ×3 = (1)/(2) + (2)/(4) + (3)/(8) + (3)/(8)= 1(3)/(4) bits.

For example, the string aacabbda with 8 characters has code 00110010101110, which uses 14 bits.

With this code, - log 2 P(a)=1 bit is required to distinguish a from the other symbols. To distinguish b, you must have - log 2 P(b)=2 bits. To distinguish c, you must have - log 2 P(c)=3 bits.

It is possible to build a code that, to identify x, requires - log 2 P(x) bits (or the integer greater than this, if x is the only thing being transmitted). Suppose there is a sequence of symbols we want to transmit or store and we know the probability distribution over the symbols. A symbol x with probability P(x) uses - log 2 P(x) bits. To transmit a sequence, each symbol requires, on average,

x -P(x)× log 2 P(x)

bits to send it. This value just depends on the probability distribution of the symbols. This is called the information content or entropy of the distribution.

Analogously to conditioning in probability, the expected number of bits it takes to describe a distribution given evidence e is

I(e) = ∑x -P(x|e)× log 2 P(x|e).

If a test exists that can distinguish the cases where α is true from the cases where α is false, the information gain from this test is

I(true) - ( P(α) ×I(α) + P(¬α) ×I(¬α)),

where I(true) is the expected number of bits needed before the test and P(α) ×I(α) + P(¬α) ×I(¬α) is the expected number of bits after the test.

In later sections, we use the notion of information for a number of tasks:

• In diagnosis, an agent can choose a test that provides the most information.
• In decision-tree learning, information theory provides a useful criterion for choosing which property to split on: split on the property that provides the greatest information gain. The elements it must distinguish are the different values in the target concept, and the probabilities are obtained from the proportion of each value in the training set remaining at each node.
• In Bayesian learning, information theory provides a basis for deciding which is the best model given some data.