Document collections are often stored as sets of sparse, high-dimensional feature vectors. Performing dimensionality reduction (DR) on such high-dimensional datasets for the purposes of visualization presents algorithmic and qualitative challenges for existing DR techniques. We propose the Q-SNE algorithm for dimensionality reduction of document data, combining the scalable probability-based layout approach of BH-SNE with an improved component to calculate approximate nearest neighbors, using the query-based APQ approach that exploits an impact-ordered inverted file. We provide thorough experimental evidence that Q-SNE yields substantial quality improvements for layouts of large document collections with commensurate speed. Our experiments were conducted with six real-world benchmark datasets that range up to millions of documents and terms, and compare against three alternatives for nearest neighbor search and five alternatives for dimensionality reduction.
Dimensionality Reduction for Documents with Nearest Neighbor Queries
Neurocomputing (Special Issue for Workshop on
Visual Analytics using Multidimensional
Projections (VAMP) held at EuroVis 2013),
Volume 150 Part B, p 557-569, 2015.
Click on a Figure to open in a new tab.
Fig. 1 Log-Log plot of NN algorithm runtimes against dataset size, for computing the 30 approximate nearest neighbors. APQ (pink curve) is substantially faster than LSH, VPTree, or the hybrid PCA-VPTree for large datasets.
Figure 2: Precision-recall curves comparing the quality of the approximation for the 30 nearest neighbors across the four algorithms. APQ (pink curve) is
substantially better than the alternatives for all of the datasets, with the exception of the tiny warlogs dataset where VPTree matches its performance. Top left: warlogs
(N=3K). Top middle: metacombine (N=28K). Top right: textfiles (N=42K). Bottom left: cables (N=251K). Bottom middle: nytimes (N=300K). Bottom right: pubmed (N=2M).
Figure 3: Log-Log plot of the time to complete the full DR layout computation as a function of the number of documents in the dataset. The BH-SNE (brown) and Q-SNE
(pink) timings are identical for the datasets that both can handle. PCA (red) is fast, but poor quality.
Figure 4: Precision-recall curves comparing for DR layouts across the six algorithms. Q-SNE (pink) yields substantially better quality than the alternatives, except for
the tiny warlogs dataset where BH-SNE (brown) nearly matches its performance. Curves are absent for algorithms that could not scale to the larger datasets; for these,
BH-SNE yields intermediate quality, substantially better than the low-quality PCA performance (red). Top left: warlogs (N=3K). Top middle: metacombine (N=28K). Top right:
textfiles (N=42K). Bottom left: cables (N=251K). Bottom middle: nytimes (N=300K). Bottom right: pubmed (N=2M).
Figure 5: 2D layouts for the tiny warlogs (N=3K) dataset for all six DR algorithms plus a true baseline tested with cluster labeling from Ward's method (k=8). The
qualitative results match the quantitative analysis with PR curves in Figure 4: the layout quality for PCA, LSP, LAMP, and Glimmmer on the top row is poor, with no
visible cluster structure, as opposed to the higher-quality layouts for BH-SNE, Q-SNE, and True-SNE which correspond well to derived clusters.
Figure 6: Q-SNE layout of the pubmed dataset (N=2M). Left: Overview. Middle: Partial zoom. Right: Full zoom. The layout contains some large-scale cluster structure visible
in the overview, and many smaller clusters that are hard to distinguish at that scale. The cluster structure is more evident after zooming.
Figure 7: Layouts of the textfiles (N=42K) dataset, colored according to manually categorized ground truth. Left column: BH-SNE. Right column: Q-SNE. Visual document
clusters correspond well to the given ground truth for both.
Figure 8: 2D layouts for the cables (N=251K) dataset, with red dots marking the location of samples that had higher precision in Q-SNE than BH-SNE. Left: BH-SNE. Right:
Figure 9: Kernel density estimates comparing precision values for BH-SNE and Q-SNE, for the cables datasets with the k = 15 retrieved documents and r = 30 relevant
Figure 10: Analyzing nearest neighbors (blue) for selected point (red), for cables. Left column: BH-SNE. Right column: Q-SNE. Top: overview, with 99 NN. Middle row:
partial zoom, with closest 20 NN. Bottom row: full zoom.
APQ and QSNE source and documentation. → (GitHub repository)
TFIDF term-vectors. → (BZip2 archive 1MB)
TFIDF term-vectors. → (BZip2 archive 871KB)
TFIDF term-vectors. → (BZip2 archive 149MB)
TFIDF term-vectors. → (BZip2 archive 318MB)
TFIDF term-vectors. → (BZip2 archive 524MB)
TFIDF term-vectors. → (BZip2 archive 4GB)
Last modified: Jul 12, 2014