Skip to main content

Grok4 Outperforms GPT-5 in Reasoning, But at Higher Cost

AI Model Showdown: Performance vs. Cost in Latest Benchmarks

New testing data from the ARC Prize provides crucial insights into the evolving landscape of artificial intelligence, revealing stark differences in performance and operational costs between leading language models. The comprehensive evaluation compared xAI's Grok4 against OpenAI's GPT-5 across multiple benchmarks measuring general reasoning capabilities.

Benchmark Breakdown: Reasoning Capabilities Tested

In the demanding ARC-AGI-2 assessment, which evaluates complex reasoning:

  • Grok4 (Thinking) achieved 16% accuracy at $2-$4 per task
  • GPT-5 (Advanced) scored 9.9% at just $0.73 per task

Image Performance and cost comparison of leading language models on the ARC-AGI benchmark. | Image: ARC-AGI

The less intensive ARC-AGI-1 test showed:

  • Grok4 reached 68% accuracy ($1 per task)
  • GPT-5 achieved 65.7% ($0.51 per task)

"While Grok4 demonstrates superior reasoning capabilities, its cost structure makes GPT-5 more economically viable for many applications," noted an ARC Prize spokesperson.

Lightweight Contenders Emerge

The study also evaluated smaller model variants:

Model AGI-1 Score AGI-1 Cost AGI-2 Score AGI-2 Cost

Image Test results for Grok4, GPT-5, and smaller model variants on the ARC-AGI-1. | Image: ARC Prize

Surprise Performer and Future Tests

The discontinued o3-preview model from December 2024 surprisingly outperformed all current models on AGI-1 with nearly 80% accuracy, though at premium pricing. Meanwhile, development continues on ARC-AGI-3, which will test AI agents in interactive game environments - a challenge where most models still struggle compared to humans.

Key Points:

  1. Performance lead: Grok4 outperforms GPT-5 in reasoning tasks by significant margins (16% vs 9.9% on AGI-2)
  2. Cost efficiency: GPT-5 maintains better value proposition across all tests ($0.51 vs $1 on AGI-1)
  3. Lightweight options: Smaller GPT variants show promise for cost-sensitive applications
  4. Future benchmarks: New interactive testing environments may reshape performance rankings

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

Gemini Leads Global AI Vision Race While Chinese Models Gain Ground
News

Gemini Leads Global AI Vision Race While Chinese Models Gain Ground

Google's Gemini-3-pro dominates the latest multimodal vision benchmark with an impressive 83.64 score, while Chinese contenders SenseTime and ByteDance show remarkable progress. The evaluation reveals shifting power dynamics in AI's visual understanding capabilities, with surprises including Qwen3-vl becoming the first open-source model to break 70 points and GPT-5.2 unexpectedly lagging behind.

December 31, 2025
AI benchmarkscomputer visionmultimodal AI
News

Meta Makes Billion-Dollar Bet on AI Startup Manus

Meta has made a strategic move in the AI arms race, acquiring Singapore-based startup Manus for billions of dollars. The deal marks Meta's third-largest acquisition ever and brings aboard Manus' innovative 'general agent' technology that's been turning heads in Silicon Valley. Founder Shao Hong will join Meta as VP, signaling the social media giant's serious commitment to advancing its AI capabilities.

December 30, 2025
MetaAI startupsTech acquisitions
News

OpenAI's Soaring Valuation Raises Eyebrows as Experts Question Financial Stability

OpenAI's staggering $830 billion valuation and $100 billion funding push have economists questioning whether the AI powerhouse is truly 'too big to fail.' While the company dominates headlines with trillion-dollar deals, Harvard professor Jason Furman cautions against assuming government intervention would save the unprofitable startup if financial troubles arise. The rapid expansion comes with growing concerns about sustainability in an increasingly competitive AI market.

December 19, 2025
OpenAIAI economicsTech bubble
News

Zoom's AI Breakthrough Sparks Debate: Innovation or Clever Packaging?

Zoom has made waves by topping an AI benchmark with its 'Federated AI' approach, combining existing models rather than building its own. While some praise this as practical innovation, critics call it smoke and mirrors. The tech community is divided - is this the future of enterprise AI or just clever API integration?

December 17, 2025
AI innovationZoomEnterprise technology
Japanese Scientist Crafts Error-Proof Language for AI Coders
News

Japanese Scientist Crafts Error-Proof Language for AI Coders

Tokyo-based data scientist Takato Honda has developed Sui, a programming language designed specifically for large language models. By eliminating syntax errors and messy naming conventions, Sui promises perfect code generation every time. The minimalist language uses numbered variables and standalone instructions to create foolproof AI-written programs. Though already succeeded by Isu, Sui's 'AI-first' approach could reshape how machines write code.

December 16, 2025
AI programmingProgramming languagesMachine learning
Google's FACTS Benchmark Reveals AI Models Struggle with Accuracy
News

Google's FACTS Benchmark Reveals AI Models Struggle with Accuracy

Google's FACTS team and Kaggle have introduced a new benchmark suite to evaluate AI models' factual accuracy. Initial tests show even top models like Gemini 3 Pro and GPT-5 can't surpass 70% accuracy, highlighting significant challenges in fields requiring precision like law and healthcare. The benchmark includes four real-world scenario tests, with multimodal tasks proving particularly difficult for current AI systems.

December 12, 2025
AI benchmarksGoogle researchmachine learning