In this paper we present a method for clustering SAGE (Serial
Analysis of Gene Expression) data to detect similarities and
dissimilarities between different types of cancer on the
sub-cellular level. The data, however, is extremely high
dimensional, and due to the method of measurement, there are many
errors as well as missing values in the data, challenging any
clustering algorithm. Therefore, we introduce special pre-processing
techniques to reduce these errors and to restore missing data. These
techniques are tailored to the process that generates the data,
making only very conservative changes. Furthermore, we present a new
subspace selection technique to identify a relevant subset of
attributes (genes) using the Wilconxon test. This is a general
technique that can be applied to select subspaces for the purpose of
clustering whenever some high-level categories of interest are known
for the data (such as cancerous and non-cancerous). Finally, we
discuss the results of the application of the clustering algorithm
OPTICS to the SAGE data, before and after our preproceessing steps.