Aliyun's WebAgent Outperforms Claude4-Sonnet in GAIA Benchmark
Alibaba Cloud Open-Sources Groundbreaking WebAgent Project
Alibaba Cloud's Tongyi Lab has officially released its WebAgent project as open-source software, with its core components WebShaper and WebSailor setting new standards in AI-powered web interaction. The system demonstrates end-to-end autonomous information retrieval with multi-step reasoning capabilities that rival human performance.
Human-Like Web Navigation Capabilities
WebAgent simulates human perception and decision-making cycles in digital environments. Its architecture enables efficient handling of complex, ambiguous online tasks through:
- Autonomous search functionality
- Advanced information filtering
- Structured report generation
The system has shown particular strength in academic database searches, news analysis, and professional forum monitoring. On the BrowseComp benchmark, WebSailor-72B outperformed commercial models like DeepSeek R1 and Grok-3, trailing only OpenAI's DeepResearch among all evaluated systems.
Formalization-Driven Data Synthesis
The WebShaper component introduces a novel "formalization-driven" approach to data synthesis, using mathematical set theory to model information search tasks. This framework:
- Abstracts complex searches into entity set operations
- Generates training data for multi-step reasoning scenarios
- Covers diverse domains including sports (21% of dataset) and academia (17%)
Experiments show models trained with WebShaper data outperform those using traditional datasets like WebWalkerQA by significant margins.
Complex Task Handling Architecture
At the system's core lies WebSailor, a large language model that:
- Interprets user intent with 72B parameters
- Formulates browsing strategies dynamically
- Enables 10-minute deployment via Alibaba Cloud's FunctionAI The model's training incorporated the innovative SailorFog-QA dataset, which simulates real-world knowledge graphs through subgraph sampling techniques.
Complete Ecosystem Approach
The project includes two additional critical components:
- WebDancer: A four-stage training framework (data construction → trajectory sampling → supervised fine-tuning → reinforcement learning)
- WebWalker: Benchmark testing for complex web traversal evaluation The system's hybrid reasoning mode uses a "thought budget mechanism" to optimize resource allocation between simple queries and complex tasks.
Industry Impact and Availability
The open-source release has already gained significant traction:
- 4,000+ GitHub stars within weeks of release
- Top trending position on GitHub and Huggingface platforms The project is available through multiple channels including GitHub and Huggingface.
Key Points:
- WebAgent achieves 60.19 score on GAIA benchmark, surpassing Claude4-Sonnet
- WebShaper's formalization approach solves high-uncertainty reasoning challenges
- Complete ecosystem supports both development and evaluation workflows
- Open-source release lowers barriers for enterprise AI adoption
- Demonstrated applications range from academic research to business intelligence