Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science

Stephen Kasica Charles Berret, and Tamara Munzner

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems,
honorable mention

Fig. 1. Process, products, and contributions: An overview of the four phases in this study.

Abstract

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows, there has been little research on data preparation in data journalism. We address this gap with a hybrid form of thematic analysis that combines deductive codes derived from existing accounts of data science workflows and inductive codes arising from an interview study with 36 professional data journalists. We extend a previous model of data science work to incorporate detailed activities of data preparation. We synthesize 60 dirty data issues from 16 taxonomies on dirty data and our interview data, and we provide a novel taxonomy to characterize these dirty data issues as discrepancies between mental models. We also identify four challenges faced by journalists: diachronic, regional, fragmented, and disparate data sources.

Talk

Slides (PPTX, PDF)
Pre-recorded video talk (CHI'23)

Materials

Figures

Figure 1: Process, products, and contributions: Our hybrid deductive-inductive thematic analysis [77] began by analyzing 16 studies of data science workflows to generate a priori codes pertaining to data preparation (Phase 1). We then conducted an interview study with 36 data journalists on their preparation processes, generating a posteriori codes from those transcripts (Phase 2). The resulting artifacts yielded combined code sets of preparation activities and data quality issues. Our categorization of these activities extended a previous model of data preparation activities. We then analyzed 16 taxonomies of dirty data issues (Phase 3), noting disparate coverage compared to our interview data. We produced a new model-discrepancy taxonomy for classifying dirty data issues to encompass them all. Finally, we reflected upon emergent patterns of data issues and preparation activities within the nightmare stories section of our interviews to identify four challenges for data integration (Phase 4).

Figure 2: Data preparation activities: From our thematic analysis, we identify 23 activities that data scientists and data journalists perform when preparing data; blue and green backgrounds highlight divergences.

Figure 3: (a) Sixty data issues and which source of data they occur in (data science workflows, data journalism interviews, or dirty data taxonomies), the source phase they were identified in (1-3), and the object and quality the issue corresponds to within our model-discrepancy framework. See Supp. Section 4 for a detailed explanation of each data issue. (b) The distribution of issues above in total and in each group of qualitative source data according to our new taxonomy for classifying dirty data.

All hi-res figures (ZIP)

Press

Research in VR, health, haptics and data nets UBC Computer Science multiple accepted papers and projects at CHI'23