A Bayesian Approach to Cluster Validation
By Hoyt Koepke
In this talk, we propose a novel approach to validating clusterings. We treat a
given clustering as a baseline and define a collection of perturbations of it
that give possibly different assignment of points to clusters. If these are
indexed by a hyperparameter, integrating with respect to a prior gives an
averaged assignment matrix. This matrix can be visualized as a heat map,
allowing clusterings and their stability properties to be readily seen. The
difference between an averaged assignment matrix and the baseline gives a
measure of the stability of the baseline. This approach motivates a general and
computationally fast algorithm for evaluating the stability of distance-based
and exponential-model type clusterings, including k-means. In addition,
these criteria can be used to choose the optimal number of clusters. Our method
compares favorably with data based perturbation procedures, such as subsampling,
in some conditions such as small sample size. In addition, there is evidence
that our method performs better relative to subsampling methods on some
problems.