Hierarchical Clustering and Tagging of Mostly Disconnected Data

TR-2012-01

Authors

Stephen Ingram, Tamara Munzner and Jonathan Stray

Publishing date

April 30, 2012

Length

10 pages

Abstract

We define the document set exploration task as the production of an application-specific categorization. Computers can help by producing visualizations of the semantic relationships in the corpus, but the approach of directly visualizing the vector space representation of the document set via multidimensional scaling (MDS) algorithms fails to reveal most of the structure because such datasets are mostly disconnected, that is, the preponderance of inter-item distances are large and roughly equal. Interpreting these large distances as disconnection between items yields a decomposition of the dataset into distinct components with small inter-item distances, that is, clusters. We propose the Disconnected Component Tree (DiscoTree) as a data structure that encapsulates the hierarchical relationship between components as the disconnection threshold changes, and present a sampling-based algorithm for efficiently computing the DiscoTree in O(N) time where N is the number of items. We present the MoDiscoTag application which combines the DiscoTree with an MDS view and a tagging system, and show that it succeeds in resolving structure that is not visible using previous dimensionality reduction methods. We validate our approach with real-world datasets of WikiLeaks Cables and War Logs from the journalism domain.

File(s)

View PDF version of TR-2012-01 (774640 bytes)