# CS 536A -- Lecture 19 on Mar 22, 2001

Lecture by Anne Condon. Notes by Ho Sen Yung
Last Lecture

Class Prediction

• Metric for correlating genes with classes and with each other.
• Selecting informative genes
• Developing a class predictor: weighted voting (Golub et al.)
• Evaluation of class predictors -- Leave-one-out cross validation (LOOCV)

Let D={ (xi,li) } be a training set, where xi's are samples and li's are class labels.

A class predictor takes as input D and a query (sample) x and returns a label l for x.

## Algorithms

1. Nearest Neighbour: Labels x with the same label as its nearest neighbour in D.
2. Clustering: Partitions samples into groups so as to maximize similarities within a group and maximize distances between groups. Trade-off in the CAST clustering algorithm is controlled by a parameter t.

Large Margin Classifiers:

1. Support Vector Machines (hyperplane and beyond): For hyperplane, it partitions the samples into 2 classes by using a hyperplane that maximizes the sum of the distances to the hyperplane, from the closest gene expression vector on one side of the hyperplane and from the closest gene expression vector on the other side of the hyperplane.
2. Boosters: Constructs a sequence of very simple classifiers f1,f2,..., where fi attempts to improve fi-1. The final classifier is a weighted vote of fi's.

### Clustering algorithm: CAST

While there are unclustered elements do
Pick one unclustered element
Add it to a new cluster C
Repeat ADD and REMOVE until no change occurs:
 ADD: Add an unclustered element v with maximum similarity to C if sim(v,C) > t |C| (where sim(v,C) is the sum of correlations with samples in C) REMOVE: Remove an element u with minimum similarity from C if sim(u,C) < t |C|
Add C to the set of clusters

Compatibility of a clustering is the number of pairs that have the same label and are assigned to the same cluster plus the number of pairs that have different labels and are assigned to different clusters.

CAST does a binary search for a good t.

Predictor: CAST is run on D (with labels removed) and on x. The label for x is the majority label in its cluster.

### Large Margin Classifier: Boosters

A simple classifier is described by a gene g, a threshold t, and a direction (< , >)

A classifier outputs label
'ALL' if expression level of g in sample x is > t
'AML' if expression level of g in sample x is < t

Quality of a simple classifier, relative to a probability distance on traininng samples, is the weighted sum of correct predictions, where weights are probabilities.

• f1 is an optimal classifier for the initial, equal weights on training samples
• Reweigh; give higher weights to samples incorrectly classified by fi
• Use new weights to find an optimal classifier as fi+1
Evaluation
Algorithms% correct
ColonOvarian
Clustering88.742.9
Nearest Neighbour80.671.4
Supporting Vector Machine
(linear hyperplane)
77.467.9
Boosters (100 iterations)72.689.3

## Class Discovery

Find classes into which unlabelled samples fit. This is clustering.

Golub et al.: Self-organising maps (SOM,1997)

• User specifies how many clusters (2 clusters, based on all 6187 genes)
• Iterates to find a (optimal?) set of centroids around which the data cluster

One cluster contained 25 samples, 24 of which were ALL. The other cluster contained mostly AML samples.

Evaluation

• Class prediction was used to evaluate on 34 new samples: 30 accurate predictions, 3 errors, 3 unknown.
• Cross-validation

Reading Assignment: Clustering algorithms survey paper by Shamir and Sharon
SAGE (Serial Analysis of Gene Expression) 1995

Key ideas:

• A short sequence tag (10-14 bases) contains sufficient information to uniquely identify a transcript (mRna, cDNA, etc.)
• Sequence tags can be linked together to form long molecules that can be cloned and sequenced
• Quanitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript
2 sources of errors:
• Tags may not always uniquely identify transcripts
• One transcript may have more than one tag