Data Cascades in High-Stakes AI
Citation
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. CHI ‘21.
Core Contribution
Defines and provides empirical evidence for Data Cascades: compounding events causing negative downstream effects from data issues, resulting in technical debt over time. Based on interviews with 53 AI practitioners across India, East/West Africa, and the USA working in high-stakes domains.
Key Findings
Prevalence: 92% of AI practitioners experienced at least one data cascade; 45.3% experienced two or more per project.
Characteristics:
- Cascades originate early in the ML lifecycle (data definition and collection)
- They are opaque and delayed, no clear indicators or metrics to detect them
- They compound into major negative impacts: costly iterations, discarded projects, harm to communities
- They are largely avoidable through intentional early practices
Four Cascade Triggers:
- Physical world brittleness (noise-free training data vs. noisy production)
- Inadequate application-domain expertise
- Conflicting reward systems undervaluing data work
- Poor cross-organizational documentation
Key Quote
“Paradoxically, data is the most under-valued and de-glamorised aspect of AI.”
Related: 00-source—sculley-2015-ml-technical-debt, 04-molecule—data-cascades, 04-molecule—ooda-data-governance