Tracking AI Progress - AGI When?

Progress Legend:

Emoji Meaning
βœ… Completed
🚧 In Progress
⏳ Awaiting Progress
❌ Deadline Missed

Leopold Aschenbrenner - Situational Awareness Timeline

Leopold Aschenbrenner Base Scale Up

Yearly Predictions

2025/2026:

2027/2028:

Test Time Compute

Number of tokens Equivalent to me working on something for… OOMs Progress
100s A few minutes ChatGPT (we are here) βœ…
1,000s Half an hour +1 OOMs test-time compute βœ…(OpenAI's O1-preview thinks for several minutes)
10,000s Half a workday +2 OOMs ⏳
100,000s A workweek +3 OOMs ⏳
Millions Multiple months +4 OOMs ⏳

Training Compute

Observing the increase in model sizes and parameter counts to evaluate progress in AI capability.

Year OOMs H100s-equivalent Cost Power Power reference class Progress
2022 ~GPT-4 cluster ~10k ~$500M ~10 MW ~10,000 average homes βœ…
~2024 +1 OOM ~100k $billions ~100 MW ~100,000 homes βœ… (xAI Mephis Datacenter, Colossus in 2024)
~2026 +2 OOMs ~1M $10s of billions ~1 GW The Hoover Dam, or a large nuclear reactor 🚧(OpenAI Abilene Datacenter, eta mid 2026)
~2028 +3 OOMs ~10M $100s of billions ~10 GW A small/medium US state 🚧(OpenAI + Microsoft, eta 2028)
~2030 +4 OOMs ~100M $1T+ ~100 GW >20% of US electricity production ⏳

Source: Situational Awareness

OpenAI Levels

OpenAI has a 5 level system for benchmarking progress to AGI

Level Description Progress
Chatbots AI with conversational language βœ…
Reasoners Human-level problem-solving βœ… (OpenAI's O1)
Agents Systems that can take actions 🚧(OpenAI targeting January 2025)
Innovators AI that can aid in invention ⏳
Organizations AI that can do the work of an organization ⏳

Source: AXIOS

Benchmark Saturation

ARC PRIZE: 87.5% (on 12/20/2024 by OpenAI's O3)

MATH: 94.8% (on 9/12/2024 by OpenAI's O1)

GPQA Diamond: 87.7% (on 12/20/2024 by OpenAI's O3)

MMLU: 92.3% (on 9/12/2024 by OpenAI's O1)

AIME: 96.7% (on 12/20/2024 by OpenAI's O3)

EpochAI Frontier Math: 25.2% (on 12/20/2024 by OpenAI's O3)

SWE-bench Verified 71.7% (on 12/20/2024 by OpenAI's O3)

Note: We assume labs are not fabricating scores.