Data Cascades in High-Stakes AI

Citation

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. CHI ‘21.

Core Contribution

Defines and provides empirical evidence for Data Cascades: compounding events causing negative downstream effects from data issues, resulting in technical debt over time. Based on interviews with 53 AI practitioners across India, East/West Africa, and the USA working in high-stakes domains.

Key Findings

Prevalence: 92% of AI practitioners experienced at least one data cascade; 45.3% experienced two or more per project.

Characteristics:

Cascades originate early in the ML lifecycle (data definition and collection)
They are opaque and delayed, no clear indicators or metrics to detect them
They compound into major negative impacts: costly iterations, discarded projects, harm to communities
They are largely avoidable through intentional early practices

Four Cascade Triggers:

Physical world brittleness (noise-free training data vs. noisy production)
Inadequate application-domain expertise
Conflicting reward systems undervaluing data work
Poor cross-organizational documentation

Key Quote

“Paradoxically, data is the most under-valued and de-glamorised aspect of AI.”

>heyMHK

Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI

Data Cascades in High-Stakes AI

Citation

Core Contribution

Key Findings

Key Quote

Properties

Graph view

Table of Contents

Backlinks