GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Citation

Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374.

Core Contribution

A benchmark evaluating AI model capabilities on real-world economically valuable tasks, covering 44 occupations across 9 sectors contributing to U.S. GDP. Tasks constructed from actual work products of professionals with average 14 years experience.

Key Framing

The paper argues that capability evaluations are leading indicators of AI’s economic impact, while adoption rates and GDP growth are lagging indicators. Historical evidence from electricity, computers, and airplanes shows transitions from invention to economy-wide permeation take years or decades.

Headline Findings

  • Claude Opus 4.1: 47.6% win rate vs. human experts (best performer)
  • GPT-5 high: 38.8% win rate
  • Performance improves roughly linearly over time
  • Average task requires 7 hours of expert work
  • Models fail primarily on instruction-following (14-40%), then formatting (5-10%), then accuracy (5-7%)
  • Under-contextualized prompts significantly degrade performance
  • Automated grader achieves 66% agreement with humans (vs 71% human inter-rater)

Methodology Notes

  • 1,320 tasks in full set; 220 in open-sourced gold subset
  • Tasks tied to O*NET Work Activities and BLS wage data
  • Expert grading via blinded pairwise comparisons
  • Each task received average of 5 human reviews (minimum 3)
  • Experts required 4+ years experience, video interview, background check

Limitations Acknowledged

  • Focus on self-contained knowledge work (no physical tasks, tacit knowledge, PII access)
  • Tasks are precisely specified and one-shot, not interactive
  • Cost of expert grading limits scale

Extracted To

Related: [None yet]