AI D-A-M-N/GPT-5 and Top AI Models Fail New FormulaOne Benchmark

GPT-5 and Top AI Models Fail New FormulaOne Benchmark

GPT-5 and Top AI Models Score Zero in New FormulaOne Benchmark

August 15, 2025 — A groundbreaking AI evaluation benchmark called FormulaOne has exposed significant limitations in today's most advanced artificial intelligence systems. Developed by AAI, a research institution specializing in superintelligence, the test revealed that models including GPT-5, Grok4, and o3Pro failed to solve its most challenging problems.

The FormulaOne Challenge

The benchmark consists of 220 novel graph-structured dynamic programming problems, spanning moderate to research-level difficulty. These problems incorporate complex domains such as:

  • Topology
  • Geometry
  • Combinatorics

Image

The problems are based on Courcelle's algorithmic meta-theorem, which states that any problem definable in logic for tree-like graphs can be solved using dynamic programming algorithms. This requires sophisticated tree decomposition techniques—organizing graph vertices into overlapping sets arranged hierarchically.

Performance Breakdown

While current AI models demonstrated moderate success on simpler problems (50%-70% accuracy), their performance plummeted with increased complexity:

ModelShallow-Level SuccessDeep-Level SuccessDoctoral-Level Success

Image

Academic Reactions

The results have sparked debate about whether AI can truly achieve doctoral-level reasoning. Some researchers propose including human PhD students in future evaluations for comparison.

"This benchmark highlights a critical gap in AI's ability to handle deeply abstract problems," noted an AAI spokesperson. "While models excel at pattern recognition, structured logical deduction remains a challenge."

The full leaderboard is available at: FormulaOne-Leaderboard

Key Points:

All top AI models scored zero on FormulaOne's most difficult problems.\ ✅ The benchmark tests 220 high-difficulty dynamic programming questions.\ ✅ Performance declines sharply with problem complexity, revealing AI's reasoning limitations.