Visual Exploratory Analysis of Large Data Sets

Doctoral Dissertation

Heidi Lam

PDF | abstract



Large data sets are difficult to analyze. Visualization has been proposed to assist exploratory data analysis (EDA) as our visual systems can process signals in parallel to quickly detect patterns. Nonetheless, designing an effective visual analytic tool remains a challenge.

This challenge is partly due to our incomplete understanding of how common visualization techniques are used by human operators during analyses, either in laboratory settings or in the workplace.

This thesis aims to further understand how visualizations can be used to support EDA. More specifically, we studied techniques that display multiple levels of visual information resolutions (VIRs) for analyses using a range of methods.

The first study is a summary synthesis conducted to obtain a snapshot of knowledge in multiple-VIR use and to identify research questions for the thesis: (1) low-VIR use and creation; (2) spatial arrangements of VIRs. The next two studies are laboratory studies to investigate the visual memory cost of image transformations frequently used to create low-VIR displays and overview use with single-level data displayed in multiple-VIR interfaces.

For a more well-rounded evaluation, we needed to study these techniques in ecologically-valid settings. We therefore selected the application domain of web session log analysis and applied our knowledge from our first three evaluations to build a tool called Session Viewer. Taking the multiple coordinated view and overview + detail approaches, Session Viewer displays multiple levels of web session log data and multiple views of session populations to facilitate data analysis from the high-level statistical to the low-level detailed session analysis approaches.

Our fourth and last study for this thesis is a field evaluation conducted at Google Inc. with seven session analysts using Session Viewer to analyze their own data with their own tasks. Study observations suggested that displaying web session logs at multiple levels using the overview + detail technique helped bridge between high-level statistical and low-level detailed session analyses, and the simultaneous display of multiple session populations at all data levels using multiple views allowed quick comparisons between session populations. We also identified design and deployment considerations to meet the needs of diverse data sources and analysis styles.