GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Citation

Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374.

Core Contribution

A benchmark evaluating AI model capabilities on real-world economically valuable tasks, covering 44 occupations across 9 sectors contributing to U.S. GDP. Tasks constructed from actual work products of professionals with average 14 years experience.

Key Framing

The paper argues that capability evaluations are leading indicators of AI’s economic impact, while adoption rates and GDP growth are lagging indicators. Historical evidence from electricity, computers, and airplanes shows transitions from invention to economy-wide permeation take years or decades.

Headline Findings

Claude Opus 4.1: 47.6% win rate vs. human experts (best performer)
GPT-5 high: 38.8% win rate
Performance improves roughly linearly over time
Average task requires 7 hours of expert work
Models fail primarily on instruction-following (14-40%), then formatting (5-10%), then accuracy (5-7%)
Under-contextualized prompts significantly degrade performance
Automated grader achieves 66% agreement with humans (vs 71% human inter-rater)

Methodology Notes

1,320 tasks in full set; 220 in open-sourced gold subset
Tasks tied to O*NET Work Activities and BLS wage data
Expert grading via blinded pairwise comparisons
Each task received average of 5 human reviews (minimum 3)
Experts required 4+ years experience, video interview, background check

Limitations Acknowledged

Focus on self-contained knowledge work (no physical tasks, tacit knowledge, PII access)
Tasks are precisely specified and one-shot, not interactive
Cost of expert grading limits scale

Extracted To

Related: [None yet]

>heyMHK

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Citation

Core Contribution

Key Framing

Headline Findings

Methodology Notes

Limitations Acknowledged

Extracted To

Properties

Graph view

Table of Contents