Aliyun's WebAgent Outperforms Claude4-Sonnet in GAIA Benchmark

Alibaba Cloud Open-Sources Groundbreaking WebAgent Project

Alibaba Cloud's Tongyi Lab has officially released its WebAgent project as open-source software, with its core components WebShaper and WebSailor setting new standards in AI-powered web interaction. The system demonstrates end-to-end autonomous information retrieval with multi-step reasoning capabilities that rival human performance.

WebAgent simulates human perception and decision-making cycles in digital environments. Its architecture enables efficient handling of complex, ambiguous online tasks through:

Autonomous search functionality
Advanced information filtering
Structured report generation

The system has shown particular strength in academic database searches, news analysis, and professional forum monitoring. On the BrowseComp benchmark, WebSailor-72B outperformed commercial models like DeepSeek R1 and Grok-3, trailing only OpenAI's DeepResearch among all evaluated systems.

Formalization-Driven Data Synthesis

The WebShaper component introduces a novel "formalization-driven" approach to data synthesis, using mathematical set theory to model information search tasks. This framework:

Abstracts complex searches into entity set operations
Generates training data for multi-step reasoning scenarios
Covers diverse domains including sports (21% of dataset) and academia (17%)

Experiments show models trained with WebShaper data outperform those using traditional datasets like WebWalkerQA by significant margins.

Complex Task Handling Architecture

At the system's core lies WebSailor, a large language model that:

Interprets user intent with 72B parameters
Formulates browsing strategies dynamically
Enables 10-minute deployment via Alibaba Cloud's FunctionAI The model's training incorporated the innovative SailorFog-QA dataset, which simulates real-world knowledge graphs through subgraph sampling techniques.

Complete Ecosystem Approach

The project includes two additional critical components:

WebDancer: A four-stage training framework (data construction → trajectory sampling → supervised fine-tuning → reinforcement learning)
WebWalker: Benchmark testing for complex web traversal evaluation The system's hybrid reasoning mode uses a "thought budget mechanism" to optimize resource allocation between simple queries and complex tasks.

Industry Impact and Availability

The open-source release has already gained significant traction:

4,000+ GitHub stars within weeks of release
Top trending position on GitHub and Huggingface platforms The project is available through multiple channels including GitHub and Huggingface.

Key Points:

WebAgent achieves 60.19 score on GAIA benchmark, surpassing Claude4-Sonnet
WebShaper's formalization approach solves high-uncertainty reasoning challenges
Complete ecosystem supports both development and evaluation workflows
Open-source release lowers barriers for enterprise AI adoption
Demonstrated applications range from academic research to business intelligence

AI D-A-M-N