AI's Reality Check: Top Models Flunk Expert Exam
AI Models Struggle With True Expertise
The latest generation of artificial intelligence has hit an unexpected roadblock. When faced with the "Ultimate Human Exam" (HLE) - a grueling test developed by nearly 1,000 specialists from 50 countries - even the most advanced models performed shockingly poorly. GPT-4o managed just 2.7 points out of 100, while the best-performing AI barely reached 8% accuracy.
Why Traditional Tests Fail
For years, AI developers have relied on standardized benchmarks to measure progress. But these tests suffer from two critical flaws:
- Benchmark saturation: Models memorize common questions rather than developing genuine understanding
- Answer cheating: Many test solutions exist verbatim online, allowing AIs to simply retrieve rather than reason
The HLE was specifically designed to avoid these pitfalls through original questions requiring deep domain knowledge across mathematics, physics, chemistry and other disciplines.
The Making of a Tough Test
HLE creators went to extraordinary lengths to ensure their exam couldn't be gamed:
- Each question underwent rigorous peer review
- Problems demanded multi-step reasoning rather than factual recall
- Solutions required synthesizing concepts across domains
The chemistry section, for example, included reaction mechanisms too complex for simple pattern matching. Math problems needed creative logical approaches no training data could anticipate.
Performance Breakdown
The results paint a sobering picture:
| Model | Score (%) |
|---|
These numbers starkly contrast with the high scores these same models achieve on conventional benchmarks.
What This Means for AI Development
The HLE results suggest we may need to rethink how we evaluate artificial intelligence:
- Current benchmarks likely overstate true capabilities
- There's still a vast gap between human and machine reasoning
- Future progress may require fundamentally new approaches
The test doesn't mean AI isn't useful - just that its strengths lie in different areas than we sometimes assume.
Key Points:
- Expert-designed exam reveals limitations unseen in standard tests
- Top models scored under 10% on questions requiring deep reasoning
- Benchmark reform needed to measure true understanding
- Specialized vs general intelligence distinction becomes clearer

