For the many journalists who use data and computation to report the news, data wrangling is an integral part of their work. Despite an abundance of literature on data wrangling in the context of enterprise data analysis, little is known about the specific operations, processes, and pain points journalists encounter while performing this tedious, time-consuming task. To better understand the needs of this user group, we conduct a technical observation study of 50 public repositories of data and analysis code authored by 33 professional journalists at 26 news organizations. We develop two detailed and cross-cutting taxonomies of data wrangling in computational journalism, for actions and for processes. We observe the extensive use of multiple tables, a notable gap in previous wrangling analyses. We develop a concise, actionable framework for general multi-table data wrangling that includes wrangling operations documented in our taxonomy that are without clear parallels in other work. This framework, the first to incorporate tables as first-class objects, will support future interactive wrangling tools for both computational journalism and general-purpose use. We assess the generative and descriptive power of our framework through discussion of its relationship to our set of taxonomies.
Table Scraps: An Actionable Framework for Multi-Table DataWrangling From An Artifact Study of Computational Journalism
Fig. 1. Three-phase process: observation study of technical artifacts conducted through qualitative coding of journalist repos, resulting in two initial bottom-up taxonomies of 165 open and axial codes; literature search to align naming and assess novelty; reflective synthesis to create a concise top-down multi-table wrangling framework with 21 operations
Fig. 2. A sketch of data flow through a notebook authored by journalists atthe Los Angeles Times shows a wrangling process using more than two dozen tables before exporting two datasets for analysis and visualization.
Fig. 3. We cross check the descriptive power of our multi-table frameworkfor data wrangling by comparing against the high-level axial codes inour descriptive action taxonomy. We only include Actions codes that correspond with table operations, excluding codes in the Profile branch.
Fig. 4. Journalists at the Los Angeles Times employ multiple tables towrangle water usage data into tidy format. With water usage amounts ina separate column, common reshape operations that operate within the context of one table fail on this table.