Skip to main content

Shanghai AI Lab Launches First Video-to-Web Benchmark

Shanghai AI Lab Unveils Groundbreaking Video-to-Web Benchmark

The Shanghai Artificial Intelligence Laboratory has launched IWR-Bench, the world's first evaluation framework designed to assess how well large language models can transform video demonstrations into functional web code. This innovative benchmark addresses a critical gap in assessing multimodal AI systems' capabilities for dynamic web reconstruction.

Breaking New Ground in AI Evaluation

Unlike traditional image-to-code tasks, IWR-Bench presents models with videos capturing complete user interactions alongside all necessary static webpage resources. The system then evaluates how accurately models can recreate the observed dynamic behaviors across various complexity levels - from basic web browsing to sophisticated applications like the 2048 game and flight booking systems.

Image

Surprising Performance Gaps Revealed

Initial testing of 28 leading AI models yielded sobering results:

  • GPT-5 emerged as top performer with just 36.35/100 overall score
  • Interaction Function Score (IFS): 24.39%
  • Visual Fidelity Score (VFS): 64.25%

The significant disparity between visual restoration (64.25%) and functional accuracy (24.39%) highlights fundamental challenges in translating observed behaviors into working code logic.

Innovative Evaluation Methodology

The benchmark employs several novel assessment techniques:

  1. Proxy-based automated testing verifies interactive functionality
  2. Complete but anonymized static resources force visual matching rather than semantic shortcuts
  3. Temporal understanding tests track state changes across video frames
  4. Multi-dimensional scoring evaluates both appearance and functionality

Image

Technical Challenges Identified

The research uncovered four major hurdles for current AI systems:

  1. Temporal understanding: Extracting key events from continuous video frames
  2. Logical abstraction: Converting behaviors into programming concepts like event listeners
  3. Resource matching: Correctly associating anonymized files with visual elements
  4. Code generation: Producing structurally sound HTML/CSS/JavaScript

The findings suggest that even advanced multimodal models struggle with causal reasoning and state management required for dynamic web reconstruction.

Image

Industry Implications

The benchmark's creators emphasize its dual significance:

  1. Research value: Provides new metrics for evaluating dynamic understanding capabilities
  2. Practical potential: Could eventually lower barriers to front-end development if technology matures However, researchers caution that high benchmark scores wouldn't immediately translate to production-ready tools, noting critical gaps in handling performance optimization, security, and edge cases.

Key Points:

  • First specialized benchmark for video-to-webpage conversion unveiled
  • GPT-5 leads but scores just 36.35/100 overall
  • Models show strong visual restoration (64%) but weak interaction logic (24%)
  • Reveals fundamental gaps in temporal reasoning and state management
  • Could shape future "what you see is what you get" development tools

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

News

Baidu's ERNIE 5.0 Breaks New Ground with Massive AI Upgrade

Baidu has unveiled ERNIE 5.0, its most advanced AI model yet featuring a staggering 2.4 trillion parameters. This multimodal powerhouse can process text, images, audio and video simultaneously, outperforming competitors in over 40 benchmark tests. With input from hundreds of experts across various fields, ERNIE 5.0 promises smarter responses and faster processing for both individual users and businesses.

January 22, 2026
Artificial IntelligenceBaiduMultimodal AI
Gemini Leads Global AI Vision Race While Chinese Models Gain Ground
News

Gemini Leads Global AI Vision Race While Chinese Models Gain Ground

Google's Gemini-3-pro dominates the latest multimodal vision benchmark with an impressive 83.64 score, while Chinese contenders SenseTime and ByteDance show remarkable progress. The evaluation reveals shifting power dynamics in AI's visual understanding capabilities, with surprises including Qwen3-vl becoming the first open-source model to break 70 points and GPT-5.2 unexpectedly lagging behind.

December 31, 2025
AI benchmarkscomputer visionmultimodal AI
Gemini-3-Pro Leads Multimodal AI Race as Chinese Models Gain Ground
News

Gemini-3-Pro Leads Multimodal AI Race as Chinese Models Gain Ground

Google's Gemini-3-Pro dominates the latest multimodal AI rankings with an impressive 83.64 score, while Chinese models from ByteDance and SenseTime show strong progress. The evaluation reveals surprising gaps between tech giants, with OpenAI's GPT-5.2 unexpectedly trailing behind. Notably, Alibaba's Qwen3-VL becomes the first open-source model to break the 70-point barrier.

December 31, 2025
AI RankingsMultimodal AIComputer Vision
Google's FACTS Benchmark Reveals AI Models Struggle with Accuracy
News

Google's FACTS Benchmark Reveals AI Models Struggle with Accuracy

Google's FACTS team and Kaggle have introduced a new benchmark suite to evaluate AI models' factual accuracy. Initial tests show even top models like Gemini 3 Pro and GPT-5 can't surpass 70% accuracy, highlighting significant challenges in fields requiring precision like law and healthcare. The benchmark includes four real-world scenario tests, with multimodal tasks proving particularly difficult for current AI systems.

December 12, 2025
AI benchmarksGoogle researchmachine learning
Alibaba Cloud's XiYan-SQL Takes Top Spot in Global Database Benchmark
News

Alibaba Cloud's XiYan-SQL Takes Top Spot in Global Database Benchmark

Alibaba Cloud's XiYan-SQL has outperformed competitors in the rigorous BIRD-CRITIC evaluation, setting new standards for SQL diagnosis and repair. The benchmark tests real-world database problem-solving across multiple platforms, with XiYan-SQL excelling in complex scenarios and cross-dialect adaptability. Its success stems from innovative approaches to schema filtering and SQL generation.

December 5, 2025
database technologyAI benchmarkscloud computing
News

Kling AI 2.6 Debuts with Game-Changing Audio Features

Kuaishou's Kling AI has unveiled version 2.6, marking a significant leap forward in AI-generated content. The update introduces native audio capabilities alongside its existing video tools, creating seamless multimodal experiences. With improved efficiency and quality metrics, this release promises to transform creative workflows for professionals across media industries.

December 3, 2025
AI Video GenerationMultimodal AICreative Technology