AI D-A-M-N/Aliyun's WebAgent Outperforms Claude4-Sonnet in GAIA Benchmark

Aliyun's WebAgent Outperforms Claude4-Sonnet in GAIA Benchmark

Alibaba Cloud Open-Sources Groundbreaking WebAgent Project

Alibaba Cloud's Tongyi Lab has officially released its WebAgent project as open-source software, with its core components WebShaper and WebSailor setting new standards in AI-powered web interaction. The system demonstrates end-to-end autonomous information retrieval with multi-step reasoning capabilities that rival human performance.

Image

Human-Like Web Navigation Capabilities

WebAgent simulates human perception and decision-making cycles in digital environments. Its architecture enables efficient handling of complex, ambiguous online tasks through:

  • Autonomous search functionality
  • Advanced information filtering
  • Structured report generation

The system has shown particular strength in academic database searches, news analysis, and professional forum monitoring. On the BrowseComp benchmark, WebSailor-72B outperformed commercial models like DeepSeek R1 and Grok-3, trailing only OpenAI's DeepResearch among all evaluated systems.

Formalization-Driven Data Synthesis

The WebShaper component introduces a novel "formalization-driven" approach to data synthesis, using mathematical set theory to model information search tasks. This framework:

  • Abstracts complex searches into entity set operations
  • Generates training data for multi-step reasoning scenarios
  • Covers diverse domains including sports (21% of dataset) and academia (17%)

Experiments show models trained with WebShaper data outperform those using traditional datasets like WebWalkerQA by significant margins.

Complex Task Handling Architecture

At the system's core lies WebSailor, a large language model that:

  • Interprets user intent with 72B parameters
  • Formulates browsing strategies dynamically
  • Enables 10-minute deployment via Alibaba Cloud's FunctionAI The model's training incorporated the innovative SailorFog-QA dataset, which simulates real-world knowledge graphs through subgraph sampling techniques.

Complete Ecosystem Approach

The project includes two additional critical components:

  1. WebDancer: A four-stage training framework (data construction → trajectory sampling → supervised fine-tuning → reinforcement learning)
  2. WebWalker: Benchmark testing for complex web traversal evaluation The system's hybrid reasoning mode uses a "thought budget mechanism" to optimize resource allocation between simple queries and complex tasks.

Industry Impact and Availability

The open-source release has already gained significant traction:

  • 4,000+ GitHub stars within weeks of release
  • Top trending position on GitHub and Huggingface platforms The project is available through multiple channels including GitHub and Huggingface.

Key Points:

  • WebAgent achieves 60.19 score on GAIA benchmark, surpassing Claude4-Sonnet
  • WebShaper's formalization approach solves high-uncertainty reasoning challenges
  • Complete ecosystem supports both development and evaluation workflows
  • Open-source release lowers barriers for enterprise AI adoption
  • Demonstrated applications range from academic research to business intelligence