CS 536A  Lecture 19 on Mar 22, 2001
Lecture by Anne Condon. Notes by Ho Sen Yung
Last Lecture
Class Prediction
 Metric for correlating genes with classes and with each other.
 Selecting informative genes
 Developing a class predictor: weighted voting (Golub et al.)
 Evaluation of class predictors 
Leaveoneout cross validation (LOOCV)
Let D={ (x_{i},l_{i}) } be a training set,
where x_{i}'s are samples and
l_{i}'s are class labels.
A class predictor takes as input D and a query (sample) x
and returns a label l for x.
Algorithms
 Nearest Neighbour:
Labels x with the same label as its nearest neighbour in D.
 Clustering: Partitions samples into groups so as to
maximize similarities within a group and
maximize distances between groups. Tradeoff in the
CAST clustering algorithm is controlled by a parameter t.
Large Margin Classifiers:
 Support Vector Machines (hyperplane and beyond):
For hyperplane, it partitions the samples into 2 classes by using
a hyperplane that maximizes the sum of the distances to the
hyperplane, from the closest gene expression vector on one side of
the hyperplane and from the closest gene expression vector on the
other side of the hyperplane.
 Boosters:
Constructs a sequence of very simple classifiers
f_{1},f_{2},..., where
f_{i} attempts to improve f_{i1}.
The final classifier is a weighted vote of f_{i}'s.
Clustering algorithm: CAST

While there are unclustered elements do
 Pick one unclustered element
 Add it to a new cluster C
 Repeat ADD and REMOVE until no change occurs:

ADD:
 Add an unclustered element v with maximum similarity to C
if sim(v,C) > t C
(where sim(v,C) is the sum of correlations with samples in C)

REMOVE:
 Remove an element u with minimum similarity from C
if sim(u,C) < t C

 Add C to the set of clusters
Compatibility of a clustering is the number of pairs
that have the same label and are assigned to the same cluster plus the
number of pairs that have different labels and are assigned to different
clusters.
CAST does a binary search for a good t.
Predictor: CAST is run on D (with labels removed) and on x.
The label for x is the majority label in its cluster.
Large Margin Classifier: Boosters
A simple classifier is described by a gene g, a threshold t, and
a direction (< , >)
 A classifier outputs label
 'ALL' if expression level of g in sample x is > t
 'AML' if expression level of g in sample x is < t
Quality of a simple classifier, relative to a probability distance on
traininng samples, is the weighted sum of correct predictions, where
weights are probabilities.
 f_{1} is an optimal classifier for the initial, equal
weights on training samples
 Reweigh; give higher weights to samples incorrectly classified by
f_{i}
 Use new weights to find an optimal classifier as f_{i+1}
Evaluation
Algorithms  % correct


Colon  Ovarian

Clustering  88.7  42.9

Nearest Neighbour  80.6  71.4

Supporting Vector Machine (linear hyperplane)
 77.4  67.9

Boosters (100 iterations)  72.6  89.3

Class Discovery
Find classes into which unlabelled samples fit. This is clustering.
Golub et al.: Selforganising maps (SOM,1997)
 User specifies how many clusters (2 clusters, based on all 6187 genes)
 Iterates to find a (optimal?) set of centroids around
which the data cluster
One cluster contained 25 samples, 24 of which were ALL.
The other cluster contained mostly AML samples.
Evaluation
 Class prediction was used to evaluate on 34 new samples:
30 accurate predictions, 3 errors, 3 unknown.
 Crossvalidation
Reading Assignment: Clustering algorithms survey paper by Shamir and Sharon
SAGE (Serial Analysis of Gene Expression) 1995
Key ideas:
 A short sequence tag (1014 bases) contains sufficient information to
uniquely identify a transcript (mRna, cDNA, etc.)
 Sequence tags can be linked together to form long molecules that can be
cloned and sequenced
 Quanitation of the number of times a particular tag is observed provides
the expression level of the corresponding transcript
2 sources of errors:
 Tags may not always uniquely identify transcripts
 One transcript may have more than one tag