CS 536A -- Lecture 19 on Mar 22, 2001

Lecture by Anne Condon. Notes by Ho Sen Yung
Last Lecture

Class Prediction

Let D={ (xi,li) } be a training set, where xi's are samples and li's are class labels.

A class predictor takes as input D and a query (sample) x and returns a label l for x.


  1. Nearest Neighbour: Labels x with the same label as its nearest neighbour in D.
  2. Clustering: Partitions samples into groups so as to maximize similarities within a group and maximize distances between groups. Trade-off in the CAST clustering algorithm is controlled by a parameter t.

    Large Margin Classifiers:

    1. Support Vector Machines (hyperplane and beyond): For hyperplane, it partitions the samples into 2 classes by using a hyperplane that maximizes the sum of the distances to the hyperplane, from the closest gene expression vector on one side of the hyperplane and from the closest gene expression vector on the other side of the hyperplane.
    2. Boosters: Constructs a sequence of very simple classifiers f1,f2,..., where fi attempts to improve fi-1. The final classifier is a weighted vote of fi's.

Clustering algorithm: CAST

While there are unclustered elements do
Pick one unclustered element
Add it to a new cluster C
Repeat ADD and REMOVE until no change occurs:
ADD: Add an unclustered element v with maximum similarity to C if sim(v,C) > t |C| (where sim(v,C) is the sum of correlations with samples in C)
REMOVE: Remove an element u with minimum similarity from C if sim(u,C) < t |C|
Add C to the set of clusters

Compatibility of a clustering is the number of pairs that have the same label and are assigned to the same cluster plus the number of pairs that have different labels and are assigned to different clusters.

CAST does a binary search for a good t.

Predictor: CAST is run on D (with labels removed) and on x. The label for x is the majority label in its cluster.

Large Margin Classifier: Boosters

A simple classifier is described by a gene g, a threshold t, and a direction (< , >)

A classifier outputs label
'ALL' if expression level of g in sample x is > t
'AML' if expression level of g in sample x is < t

Quality of a simple classifier, relative to a probability distance on traininng samples, is the weighted sum of correct predictions, where weights are probabilities.

Algorithms% correct
Nearest Neighbour80.671.4
Supporting Vector Machine
(linear hyperplane)
Boosters (100 iterations)72.689.3

Class Discovery

Find classes into which unlabelled samples fit. This is clustering.

Golub et al.: Self-organising maps (SOM,1997)

One cluster contained 25 samples, 24 of which were ALL. The other cluster contained mostly AML samples.


Reading Assignment: Clustering algorithms survey paper by Shamir and Sharon
SAGE (Serial Analysis of Gene Expression) 1995

Key ideas:

2 sources of errors: