AI Struggles with Physics Puzzles: Top Models Score Below 10%
When AI Meets Advanced Physics: The Reality Check
Image source note: The image is AI-generated, and the image licensing service provider is Midjourney
Imagine handing a stack of cutting-edge physics problems to a bright PhD candidate - then discovering they can't solve even one in ten correctly. That's essentially what happened when researchers tested today's most advanced AI systems against real scientific challenges.
The Tough Test Behind the Numbers
The "CritPt" benchmark wasn't playing nice. Developed by over 50 physicists worldwide, it presented 71 unpublished research problems across quantum physics, astrophysics, and other demanding fields. These weren't textbook exercises but fresh challenges designed to mimic what early-career researchers actually face.
"We wanted to eliminate any advantage from memorization or pattern recognition," explains the team behind CritPt. "Every question tests genuine understanding and problem-solving."
Surprising Shortfalls
When the scores came in, even optimists raised eyebrows:
- Google's Gemini3Pro: 9.1% accuracy
- OpenAI's GPT-5: Just 4.9%
The numbers got worse under stricter evaluation. When models had to get answers right four out of five tries (the "continuous resolution rate"), performance plummeted further.
"These systems can produce answers that look convincing at first glance," notes one physicist involved in testing. "But peer closer, and you'll find subtle errors that could derail real research if unchecked."
Why This Matters Beyond Labs
The implications stretch far beyond theoretical physics:
- Research workflows: Scientists using AI tools must budget extra time for verification
- Public perception: Temper expectations about AI replacing human experts anytime soon
- Development priorities: Highlights where future AI training should focus
A More Realistic Role Emerges
Rather than replacing researchers, leading labs now see AI as sophisticated assistants:
- OpenAI plans a "research intern" system by 2026
- Fully autonomous research isn't expected before 2028
- Current models already help save time on routine tasks
"Think of them like brilliant but error-prone grad students," suggests one team member. "Their ideas can spark breakthroughs, but you'd never let them run unsupervised."
Key Points:
- 🔬 Top AI models scored under 10% on unpublished physics challenges
- 🤯 Performance dropped further when requiring consistent accuracy
- 🛠️ Future role likely as assistive tools rather than independent researchers
- ⏳ Full autonomy in complex sciences remains years away

