Data Cascades in High-Stakes AI

Citation

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. CHI ‘21.

Core Contribution

Defines and provides empirical evidence for Data Cascades: compounding events causing negative downstream effects from data issues, resulting in technical debt over time. Based on interviews with 53 AI practitioners across India, East/West Africa, and the USA working in high-stakes domains.

Key Findings

Prevalence: 92% of AI practitioners experienced at least one data cascade; 45.3% experienced two or more per project.

Characteristics:

  • Cascades originate early in the ML lifecycle (data definition and collection)
  • They are opaque and delayed, no clear indicators or metrics to detect them
  • They compound into major negative impacts: costly iterations, discarded projects, harm to communities
  • They are largely avoidable through intentional early practices

Four Cascade Triggers:

  1. Physical world brittleness (noise-free training data vs. noisy production)
  2. Inadequate application-domain expertise
  3. Conflicting reward systems undervaluing data work
  4. Poor cross-organizational documentation

Key Quote

“Paradoxically, data is the most under-valued and de-glamorised aspect of AI.”

Related: 00-source—sculley-2015-ml-technical-debt, 04-molecule—data-cascades, 04-molecule—ooda-data-governance