Intelligence 2E

foundations of computational agents

Rather than enumerating all of the hypotheses, the set of elements of $\mathscr{H}$ consistent with all of the examples can be found more efficiently by imposing some structure on the hypothesis space.

Hypothesis ${h}_{1}$ is a more general hypothesis than hypothesis ${h}_{2}$ if ${h}_{2}$ implies ${h}_{1}$. In this case, ${h}_{2}$ is a more specific hypothesis than ${h}_{1}$. Any hypothesis is both more general than itself and more specific than itself.

The hypothesis ${\mathrm{\neg}}{\mathit{}}{a}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}{\mathit{}}{m}{\mathit{}}{i}{\mathit{}}{c}{\mathrm{\wedge}}{m}{\mathit{}}{u}{\mathit{}}{s}{\mathit{}}{i}{\mathit{}}{c}$ is more specific than ${m}{\mathit{}}{u}{\mathit{}}{s}{\mathit{}}{i}{\mathit{}}{c}$ and is also more specific than ${\mathrm{\neg}}{\mathit{}}{a}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}{\mathit{}}{m}{\mathit{}}{i}{\mathit{}}{c}$. Thus, ${m}{\mathit{}}{u}{\mathit{}}{s}{\mathit{}}{i}{\mathit{}}{c}$ is more general than ${\mathrm{\neg}}{\mathit{}}{a}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}{\mathit{}}{m}{\mathit{}}{i}{\mathit{}}{c}{\mathrm{\wedge}}{m}{\mathit{}}{u}{\mathit{}}{s}{\mathit{}}{i}{\mathit{}}{c}$. The most general hypothesis is ${t}{\mathit{}}{r}{\mathit{}}{u}{\mathit{}}{e}$. The most specific hypothesis is ${f}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{s}{\mathit{}}{e}$.

The “more general than” relation forms a partial ordering over the hypothesis space. The version-space algorithm that follows exploits this partial ordering to search for hypotheses that are consistent with the training examples.

Given hypothesis space $\mathscr{H}$ and examples $E$, the version space is the subset of $\mathscr{H}$ that is consistent with the examples.

The general boundary of a version space, $G$, is the set of maximally general members of the version space (i.e., those members of the version space such that no other element of the version space is more general). The specific boundary of a version space, $S$, is the set of maximally specific members of the version space.

These concepts are useful because the general boundary and the specific boundary completely determine the version space, as shown by the following proposition.

The version space is the set of $h\mathrm{\in}\mathrm{H}$ such that $h$ is more general than an element of $S$ and more specific than an element of $G$.

The candidate elimination learner incrementally builds the version space given a hypothesis space $\mathscr{H}$ and a set $E$ of examples. The examples are added one by one; each example possibly shrinks the version space by removing the hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the general and/or the specific boundary for each new example. This is described in Figure 7.21.

Consider how the candidate elimination algorithm handles Example 7.26, where ${\mathrm{H}}$ is the set of conjunctions of literals.

Before it has seen any examples, ${{G}}_{{\mathrm{0}}}{\mathrm{=}}{\mathrm{\{}}{t}{\mathit{}}{r}{\mathit{}}{u}{\mathit{}}{e}{\mathrm{\}}}$ – the user reads everything – and ${{S}}_{{\mathrm{0}}}{\mathrm{=}}{\mathrm{\{}}{f}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{s}{\mathit{}}{e}{\mathrm{\}}}$ – the user reads nothing. Note that ${t}{\mathit{}}{r}{\mathit{}}{u}{\mathit{}}{e}$ is the empty conjunction and ${f}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{s}{\mathit{}}{e}$ is the conjunction of an atom and its negation.

After considering the first example, ${{a}}_{{\mathrm{1}}}$, ${{G}}_{{\mathrm{1}}}{\mathrm{=}}{\mathrm{\{}}{t}{\mathit{}}{r}{\mathit{}}{u}{\mathit{}}{e}{\mathrm{\}}}$ and

$${{S}}_{{1}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{\wedge}{}{\mathrm{\neg}}{}{a}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}{\wedge}{}{\mathrm{\neg}}{}{l}{}{o}{}{c}{}{a}{}{l}{\wedge}{}{m}{}{u}{}{s}{}{i}{}{c}{\}}{.}$$ |

Thus, the most general hypothesis is that the user reads everything, and the most specific hypothesis is that the user only reads articles exactly like this one.

After considering the first two examples, ${{G}}_{{\mathrm{2}}}{\mathrm{=}}{\mathrm{\{}}{t}{\mathit{}}{r}{\mathit{}}{u}{\mathit{}}{e}{\mathrm{\}}}$ and

$${{S}}_{{2}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{\wedge}{}{\mathrm{\neg}}{}{a}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}{\wedge}{}{\mathrm{\neg}}{}{l}{}{o}{}{c}{}{a}{}{l}{\}}{.}$$ |

Since ${{a}}_{{\mathrm{1}}}$ and ${{a}}_{{\mathrm{2}}}$ disagree on music, but have the same prediction it can be concluded that music cannot be relevant.

After considering the first three examples, the general boundary becomes

$${{G}}_{{3}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{,}{\mathrm{\neg}}{}{a}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}{\}}$$ |

and ${{S}}_{{\mathrm{3}}}{\mathrm{=}}{{S}}_{{\mathrm{2}}}$. Now there are two most general hypotheses; the first is that the user reads anything about crime, and the second is that the user reads anything non-academic.

After considering the first four examples,

$${{G}}_{{4}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{,}{\mathrm{\neg}}{}{a}{}{c}{}{a}{}{d}{}{e}{}{m}{}{i}{}{c}{\wedge}{}{\mathrm{\neg}}{}{l}{}{o}{}{c}{}{a}{}{l}{\}}$$ |

and ${{S}}_{{\mathrm{4}}}{\mathrm{=}}{{S}}_{{\mathrm{3}}}$.

After considering all five examples,

${{G}}_{{5}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{\}}{,}$ | ||

${{S}}_{{5}}{=}{\{}{c}{}{r}{}{i}{}{m}{}{e}{\wedge}{}{\mathrm{\neg}}{}{l}{}{o}{}{c}{}{a}{}{l}{\}}{.}$ |

Thus, after five examples, only two hypotheses exist in the version space. They differ only in their prediction on an example that has ${c}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{m}{\mathit{}}{e}{\mathrm{\wedge}}{\mathit{}}{l}{\mathit{}}{o}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{l}$ true. If the target concept can be represented as a conjunction, only an example with ${c}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{m}{\mathit{}}{e}{\mathrm{\wedge}}{\mathit{}}{l}{\mathit{}}{o}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{l}$ true will change ${G}$ or ${S}$. This version space can make predictions about all other examples.

Recall that a bias is necessary for any learning to generalize beyond the training data. There must have been a bias in Example 7.30 because, after observing only five of the 16 possible assignments to the input variables, an agent was able to make predictions about examples it had not seen.

The bias involved in version-space learning is a called a language bias or a restriction bias because the bias is obtained from restricting the allowable hypotheses. For example, a new example with crime false and music true will be classified as false (the user will not read the article), even though no such example has been seen. The restriction that the hypothesis must be a conjunction of literals is enough to predict its value.

This bias should be contrasted with the bias involved in decision tree learning. A decision tree can represent any Boolean function. Decision tree learning involves a preference bias, in that some Boolean functions are preferred over others; those with smaller decision trees are preferred over those with larger decision trees. A decision tree learning algorithm that builds a single decision tree top-down also involves a search bias in that the decision tree returned depends on the search strategy used.

The candidate elimination algorithm is sometimes said to be an unbiased learning algorithm because the learning algorithm does not impose any bias beyond the language bias involved in choosing $\mathscr{H}$. It is easy for the version space to collapse to the empty set, for example, if the user reads an article with crime false and music true. This means that the target concept is not in $\mathscr{H}$. Version-space learning is not tolerant to noise; just one misclassified example can throw off the whole system.

The bias-free hypothesis space is where $\mathscr{H}$ is the set of all Boolean functions. In this case, $G$ always contains one concept: the concept which says that all negative examples have been seen and every other example is positive. Similarly, $S$ contains the single concept which says that all unseen examples are negative. The version space is incapable of concluding anything about examples it has not seen; thus, it cannot generalize. Without a language bias or a preference bias, no generalization and, therefore, no learning will occur.