The computation of covariance and correlation matrices are critical to
many data mining applications and processes. Unfortunately the
classical covariance and correlation matrices are very sensitive to
outliers. Robust methods, such as QC and the Maronna method, have been
proposed. However, existing algorithms for QC only give acceptable
performance when the dimensionality of the matrix is in the hundreds;
and the Maronna method is rarely used in practice because of its high
computational cost.
In this paper, we develop parallel algorithms for both QC and the
Maronna method. We evaluate these parallel algorithms using a real data
set of the gene expression of over 6,000 genes, giving rise to a matrix
of over 18 million entries. In our experimental evaluation, we explore
scalability in dimensionality and in the number of processors, and the
trade-offs between accuracy and computational efficiency. We also
compare the parallel behaviours of the two methods. From a statistical
standpoint, the Maronna method is more robust than QC. From a
computational standpoint, while QC requires less computation,
interestingly the Maronna method is much more parallelizeable than QC.
After thorough experimentation, we conclude that for many data
mining applications, both QC and Maronna are viable options.
Less robust, but faster, QC is the recommended choice for small
parallel platforms. On the other hand, the Maronna method is the
recommended choice when a high degree of robustness is required, or
when the parallel platform features a high number of processors.