Skip to main content

Google's FACTS Benchmark Reveals AI Models Struggle with Accuracy

Google's New Benchmark Exposes AI Accuracy Limits

In a move that could reshape how we measure AI capabilities, Google's FACTS team has partnered with data science platform Kaggle to launch a comprehensive benchmark suite. This new tool aims to address a critical gap in AI evaluation: standardized testing for factual accuracy.

Image

Image source note: The image is AI-generated, provided by the AI image generation service Midjourney

What FACTS Measures

The FACTS benchmark breaks down "factualness" into two practical scenarios:

  • Contextual factualness: How well models generate accurate responses using provided data
  • World knowledge factualness: Their ability to retrieve correct information from memory or web searches

The results so far? Even the most advanced models—including Gemini 3 Pro, GPT-5, and Claude 4.5 Opus—haven't cracked the 70% accuracy barrier.

Beyond Simple Q&A

Unlike traditional benchmarks, FACTS simulates real-world challenges developers face through four distinct tests:

  1. Parameter benchmark (internal knowledge)
  2. Search benchmark (tool usage)
  3. Multimodal benchmark (visual understanding)
  4. Context benchmark

Google has made 3,513 test examples publicly available while keeping some data private on Kaggle to prevent artificial score inflation.

Surprising Performance Gaps

The preliminary rankings reveal interesting patterns:

  • Gemini 3 Pro leads with 68.8% overall accuracy
  • Followed by Gemini 2.5 Pro (62.1%) and GPT-5 (61.8%)

The standout? Gemini 3 Pro scored an impressive 83.8% on search tasks—but this dropped to just 76.4% when relying on internal parameters alone.

The takeaway? Companies building knowledge retrieval systems should consider combining models with search tools or vector databases for better results.

The most concerning finding involves multimodal tasks—even the best performer managed only 46.9% accuracy. "These numbers suggest we're still years away from reliable unsupervised data extraction," says one industry analyst who reviewed the findings. Companies using these models for product development should proceed with caution.

Key Points:

  • 🔍 Accuracy ceiling: No model surpassed 70% overall accuracy
  • 🏆 Top performer: Gemini 3 Pro leads but shows significant variation across test types
  • ⚠️ Multimodal warning: Current visual understanding capabilities remain unreliable

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

Skywork AI's Matrix-Game 3.0 cracks the code for seamless AI video worlds
News

Skywork AI's Matrix-Game 3.0 cracks the code for seamless AI video worlds

Skywork AI has made a giant leap in AI video generation with Matrix-Game 3.0, solving the persistent 'memory loss' problem that plagued previous models. Their system now produces smooth 720p video at 40 frames per second while maintaining perfect consistency across scenes. The breakthrough comes from feeding AI with massive amounts of 3A game data and innovative memory retrieval techniques. This could revolutionize everything from video games to virtual reality experiences.

April 14, 2026
AI video generationreal-time renderingvirtual worlds
News

Tencent's New Robot Brain Outsmarts Competitors in Key Tests

Tencent has unveiled HY-Embodied-0.5, a breakthrough AI model designed to give robots human-like spatial awareness and physical interaction skills. Unlike standard AI models that struggle with real-world tasks, this system combines specialized architecture with massive training to achieve top scores in 22 performance benchmarks. The technology could finally bridge the gap between virtual intelligence and practical robotics applications.

April 10, 2026
artificial intelligenceroboticsTencent
Claude's New Advisor Tool: Smart AI Help Without the Hefty Price Tag
News

Claude's New Advisor Tool: Smart AI Help Without the Hefty Price Tag

Anthropic has introduced a clever new feature for its Claude AI platform that combines efficiency with intelligence. The Advisor Tool lets faster, more affordable models handle routine tasks while automatically consulting the more powerful Claude Opus for tough decisions. Think of it like having a quick junior assistant who can discreetly tap a senior expert when needed. Early tests show significant performance boosts with surprising cost savings - in some cases doubling capabilities while keeping expenses low.

April 10, 2026
AI innovationClaude AIcost optimization
Alibaba's Secret AI Model Gallops to the Top of Video Generation Rankings
News

Alibaba's Secret AI Model Gallops to the Top of Video Generation Rankings

Alibaba's quietly developed 'HappyHorse' AI model has outpaced competitors to claim the top spot in global video generation benchmarks, scoring an impressive 1333 Elo points. Developed by the company's Future Life Lab, the model demonstrates Alibaba's growing strength in AI as the industry shifts from text-based systems to creative video generation. The breakthrough comes as tech giants race to develop more sophisticated AI agents.

April 10, 2026
AI video generationAlibabaHappyHorse
Zhiyuan Robotics' GO-2 Model Gives Robots Human-Like Planning Skills
News

Zhiyuan Robotics' GO-2 Model Gives Robots Human-Like Planning Skills

Zhiyuan Robotics has unveiled its groundbreaking GO-2 model, bringing robots closer than ever to human-like thinking. Unlike traditional systems that operate blindly, GO-2 plans actions step-by-step before moving - just like a basketball player visualizing a shot. The model smashed performance records with a 98.5% success rate, even in challenging conditions. More than just lab tech, GO-2 is already being deployed through Zhiyuan's development platform, marking a significant leap toward practical robot applications.

April 9, 2026
roboticsAImachine learning
GLM-5.1: The AI That Works Like a Human Developer
News

GLM-5.1: The AI That Works Like a Human Developer

The new GLM-5.1 open-source model is turning heads with its human-like work stamina - capable of tackling complex coding projects for 8 hours straight. Unlike previous models that needed constant hand-holding, this one can build an entire Linux system overnight while optimizing its own performance. Benchmarks show it outperforms top competitors in fixing tricky software bugs, potentially changing how we approach AI-assisted development.

April 8, 2026
AI developmentopen-source AIcoding assistants