CS 536A -- Lecture 19 on Mar 22, 2001

Lecture by Anne Condon. Notes by Ho Sen Yung

Last Lecture

Class Prediction

Metric for correlating genes with classes and with each other.
Selecting informative genes
Developing a class predictor: weighted voting (Golub et al.)
Evaluation of class predictors -- Leave-one-out cross validation (LOOCV)

Let D={ (x_i,l_i) } be a training set, where x_i's are samples and l_i's are class labels.

A class predictor takes as input D and a query (sample) x and returns a label l for x.

Algorithms

Nearest Neighbour: Labels x with the same label as its nearest neighbour in D.
Clustering: Partitions samples into groups so as to maximize similarities within a group and maximize distances between groups. Trade-off in the CAST clustering algorithm is controlled by a parameter t.
Large Margin Classifiers:
1. Support Vector Machines (hyperplane and beyond): For hyperplane, it partitions the samples into 2 classes by using a hyperplane that maximizes the sum of the distances to the hyperplane, from the closest gene expression vector on one side of the hyperplane and from the closest gene expression vector on the other side of the hyperplane.
2. Boosters: Constructs a sequence of very simple classifiers f₁,f₂,..., where f_i attempts to improve f_i-1. The final classifier is a weighted vote of f_i's.

Clustering algorithm: CAST

While there are unclustered elements do

Pick one unclustered element

Add it to a new cluster C

Repeat ADD and REMOVE until no change occurs:

ADD:	Add an unclustered element v with maximum similarity to C if `sim`(v,C) > t \|C\| (where `sim`(v,C) is the sum of correlations with samples in C)
REMOVE:	Remove an element u with minimum similarity from C if `sim`(u,C) < t \|C\|

Add C to the set of clusters

Compatibility of a clustering is the number of pairs that have the same label and are assigned to the same cluster plus the number of pairs that have different labels and are assigned to different clusters.

CAST does a binary search for a good t.

Predictor: CAST is run on D (with labels removed) and on x. The label for x is the majority label in its cluster.

Large Margin Classifier: Boosters

A simple classifier is described by a gene g, a threshold t, and a direction (< , >)

A classifier outputs label: 'ALL' if expression level of g in sample x is > t; 'AML' if expression level of g in sample x is < t

Quality of a simple classifier, relative to a probability distance on traininng samples, is the weighted sum of correct predictions, where weights are probabilities.

f₁ is an optimal classifier for the initial, equal weights on training samples
Reweigh; give higher weights to samples incorrectly classified by f_i
Use new weights to find an optimal classifier as f_i+1

Evaluation

Algorithms	% correct
Algorithms	Colon	Ovarian
Clustering	88.7	42.9
Nearest Neighbour	80.6	71.4
Supporting Vector Machine (linear hyperplane)	77.4	67.9
Boosters (100 iterations)	72.6	89.3

Class Discovery

Find classes into which unlabelled samples fit. This is clustering.

Golub et al.: Self-organising maps (SOM,1997)

User specifies how many clusters (2 clusters, based on all 6187 genes)
Iterates to find a (optimal?) set of centroids around which the data cluster

One cluster contained 25 samples, 24 of which were ALL. The other cluster contained mostly AML samples.

Evaluation

Class prediction was used to evaluate on 34 new samples: 30 accurate predictions, 3 errors, 3 unknown.
Cross-validation

Reading Assignment: Clustering algorithms survey paper by Shamir and Sharon

SAGE (Serial Analysis of Gene Expression) 1995

Key ideas:

A short sequence tag (10-14 bases) contains sufficient information to uniquely identify a transcript (mRna, cDNA, etc.)
Sequence tags can be linked together to form long molecules that can be cloned and sequenced
Quanitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript

2 sources of errors:

Tags may not always uniquely identify transcripts
One transcript may have more than one tag