Gemini Leads Global AI Vision Race While Chinese Models Gain Ground
The Battle for AI Vision Supremacy Heats Up
The latest SuperCLUE-VLM12 benchmark paints a fascinating picture of today's multimodal AI landscape. Google's Gemini-3-pro isn't just leading the pack - it's rewriting expectations with a commanding 83.64-point performance across all evaluation categories.

Domestic Challengers Rise
What makes this competition particularly intriguing is the strong showing from Chinese models. SenseTime's SenseNova V6.5Pro claimed second place (75.35 points), demonstrating particular strength in visual reasoning tasks. Meanwhile, ByteDance's Douyin visual version edged into third (73.15 points), even outperforming several international rivals in basic cognition tests.
"These results confirm China's growing capability in computer vision technologies," notes Dr. Li Wei, an AI researcher at Tsinghua University. "Three years ago, we wouldn't have seen domestic models competing at this level."
Surprises and Breakthroughs
The benchmark delivered several notable developments:
- Open-source milestone: Alibaba's Qwen3-vl became the first open-source model to crack the 70-point barrier (70.89 points), offering powerful visual analysis capabilities to the broader developer community.
- Established players stumble: Anthropic's Claude-opus-4-5 managed just 71.44 points, while OpenAI's GPT-5.2 (high) surprisingly fell short at 69.16 points - well below industry expectations.
- Baidu holds steady: ERNIE-5.0-Preview maintained China's strong representation by securing fifth place overall.
What This Means for AI Development
The results suggest we're entering a new phase where: 1) Visual understanding capabilities are becoming crucial differentiators between models 2) The gap between proprietary and open-source solutions is narrowing 3) Traditional power rankings in AI don't necessarily translate to vision capabilities
"We're seeing specialization emerge," explains MIT Professor Alan Chen. "Some models optimized for text struggle with visual tasks, while others like Gemini clearly prioritized multimodal training."
Key Points:
- Global leader: Gemini-3-pro dominates with top scores across basic cognition (84.2), visual reasoning (83.1), and application (83.6)
- Chinese advances: Two domestic models now rank among global top three in vision benchmarks
- Open-source progress: Qwen3-vl breaks new ground for community-developed vision models
- Shifting landscape: Established leaders like GPT show unexpected weaknesses in visual tasks


