7.8 Learning as Refining the Hypothesis Space

7.8.1 Version-Space Learning

Rather than enumerating all of the hypotheses, the set of elements of consistent with all of the examples can be found more efficiently by imposing some structure on the hypothesis space.

Hypothesis h1 is a more general hypothesis than hypothesis h2 if h2 implies h1. In this case, h2 is a more specific hypothesis than h1. Any hypothesis is both more general than itself and more specific than itself.

Example 7.29.

The hypothesis ¬academicmusic is more specific than music and is also more specific than ¬academic. Thus, music is more general than ¬academicmusic. The most general hypothesis is true. The most specific hypothesis is false.

The “more general than” relation forms a partial ordering over the hypothesis space. The version-space algorithm that follows exploits this partial ordering to search for hypotheses that are consistent with the training examples.

Given hypothesis space and examples E, the version space is the subset of that is consistent with the examples.

The general boundary of a version space, G, is the set of maximally general members of the version space (i.e., those members of the version space such that no other element of the version space is more general). The specific boundary of a version space, S, is the set of maximally specific members of the version space.

These concepts are useful because the general boundary and the specific boundary completely determine the version space, as shown by the following proposition.

Proposition 7.2.

The version space is the set of hH such that h is more general than an element of S and more specific than an element of G.

Candidate Elimination Learner

The candidate elimination learner incrementally builds the version space given a hypothesis space and a set E of examples. The examples are added one by one; each example possibly shrinks the version space by removing the hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the general and/or the specific boundary for each new example. This is described in Figure 7.21.

1:procedure Candidate_elimination_learner(Xs,Y,Es,)
2:      Inputs
3:            Xs: set of input features, Xs={X1,,Xn}
4:            Y: Boolean target feature
5:            Es: set of examples from which to learn
6:            : hypothesis space       
7:      Output
8:            general boundary G
9:            specific boundary S consistent with Es       
10:      Local
11:            G: set of hypotheses in
12:            S: set of hypotheses in       
13:      Let G={true}, S={false};
14:      for each eEs do
15:            if  Y(e)=true then
16:                 Elements of G that classify e as negative are removed from G;
17:                 Each element s of S that classifies e as negative is removed and replaced by the minimal generalizations of s that classify e as positive and are less general than some member of G;
18:                 Non-maximal hypotheses are removed from S;
19:            else
20:                 Elements of S that classify e as positive are removed from S;
21:                 Each element g of G that classifies e as positive is removed and replaced by the minimal specializations of g that classifies e as negative and are more general than some member of S.
22:                 Non-minimal hypotheses are removed from G.                   
Figure 7.21: Candidate elimination algorithm
Example 7.30.

Consider how the candidate elimination algorithm handles Example 7.26, where H is the set of conjunctions of literals.

Before it has seen any examples, G0={true} – the user reads everything – and S0={false} – the user reads nothing. Note that true is the empty conjunction and false is the conjunction of an atom and its negation.

After considering the first example, a1, G1={true} and

S1={crime¬academic¬localmusic}.

Thus, the most general hypothesis is that the user reads everything, and the most specific hypothesis is that the user only reads articles exactly like this one.

After considering the first two examples, G2={true} and

S2={crime¬academic¬local}.

Since a1 and a2 disagree on music, but have the same prediction it can be concluded that music cannot be relevant.

After considering the first three examples, the general boundary becomes

G3={crime,¬academic}

and S3=S2. Now there are two most general hypotheses; the first is that the user reads anything about crime, and the second is that the user reads anything non-academic.

After considering the first four examples,

G4={crime,¬academic¬local}

and S4=S3.

After considering all five examples,

G5={crime},
S5={crime¬local}.

Thus, after five examples, only two hypotheses exist in the version space. They differ only in their prediction on an example that has crimelocal true. If the target concept can be represented as a conjunction, only an example with crimelocal true will change G or S. This version space can make predictions about all other examples.

The Bias Involved in Version-Space Learning

Recall that a bias is necessary for any learning to generalize beyond the training data. There must have been a bias in Example 7.30 because, after observing only five of the 16 possible assignments to the input variables, an agent was able to make predictions about examples it had not seen.

The bias involved in version-space learning is a called a language bias or a restriction bias because the bias is obtained from restricting the allowable hypotheses. For example, a new example with crime false and music true will be classified as false (the user will not read the article), even though no such example has been seen. The restriction that the hypothesis must be a conjunction of literals is enough to predict its value.

This bias should be contrasted with the bias involved in decision tree learning. A decision tree can represent any Boolean function. Decision tree learning involves a preference bias, in that some Boolean functions are preferred over others; those with smaller decision trees are preferred over those with larger decision trees. A decision tree learning algorithm that builds a single decision tree top-down also involves a search bias in that the decision tree returned depends on the search strategy used.

The candidate elimination algorithm is sometimes said to be an unbiased learning algorithm because the learning algorithm does not impose any bias beyond the language bias involved in choosing . It is easy for the version space to collapse to the empty set, for example, if the user reads an article with crime false and music true. This means that the target concept is not in . Version-space learning is not tolerant to noise; just one misclassified example can throw off the whole system.

The bias-free hypothesis space is where is the set of all Boolean functions. In this case, G always contains one concept: the concept which says that all negative examples have been seen and every other example is positive. Similarly, S contains the single concept which says that all unseen examples are negative. The version space is incapable of concluding anything about examples it has not seen; thus, it cannot generalize. Without a language bias or a preference bias, no generalization and, therefore, no learning will occur.