Skip to main content

VLM2Vec-V2: A Unified Framework for Multimodal Retrieval

Breakthrough in Multimodal Learning: VLM2Vec-V2 Bridges Visual Data Types

A collaborative research team from Salesforce Research, University of California, Santa Barbara, University of Waterloo, and Tsinghua University has unveiled VLM2Vec-V2, a revolutionary multimodal embedding learning framework designed to unify retrieval tasks across images, videos, and visual documents.

Addressing Current Limitations

Existing multimodal embedding models have primarily focused on natural images from datasets like MSCOCO, Flickr, and ImageNet. These models struggle with broader visual information types including documents, PDFs, websites, videos, and slides - creating performance gaps in practical applications like article search and video retrieval.

Image

Expanded Capabilities

The VLM2Vec-V2 framework introduces several key advancements:

  • Expanded MMEB dataset with five new task types
  • Support for visual document retrieval
  • Enhanced video retrieval capabilities
  • Temporal localization functionality
  • Integrated video classification and question answering

Technical Innovations

The model builds on the Qwen2-VL architecture, incorporating:

  1. Simple dynamic resolution
  2. Multi-modal rotation position embedding (M-RoPE)
  3. Unified framework combining 2D/3D convolution
  4. Flexible data sampling pipeline for stable contrastive learning

Performance Benchmarks

In comprehensive testing across 78 datasets, VLM2Vec-V2 achieved:

  • Highest average score of 58.0
  • Superior performance in both image and video tasks
  • Competitive results against specialized models like ColPali in document retrieval

The framework is now available on GitHub and Hugging Face.

Key Points:

  • 🚀 Unified framework for images, videos, and documents
  • 📊 Expanded evaluation dataset with diverse task types
  • ⚡ Outperforms existing benchmarks in comprehensive testing
  • 🔍 Open-source availability accelerates research adoption

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

News

Hikvision's AI Inspector Tackles Factory Packaging Errors

Hikvision has unveiled a smart quality control system powered by its Guanlan AI model that spots packaging mistakes instantly. Unlike traditional manual checks, this solution scans every item with precision, adapting to complex production environments. Already proving valuable in automotive and electronics plants, it marks another step toward smarter manufacturing.

January 30, 2026
industrial automationquality controlcomputer vision
News

Robots Get a Sense of Touch with Groundbreaking New Dataset

A major leap forward in robotics arrived this week with the release of Baihu-VTouch, the world's first cross-body visual-tactile dataset. Developed collaboratively by China's National-Local Co-built Humanoid Robot Innovation Center and multiple research teams, this treasure trove contains over 60,000 minutes of real robot interaction data. What makes it special? The dataset captures not just what robots see, but how objects feel - enabling machines to develop human-like tactile sensitivity across different hardware platforms.

January 27, 2026
roboticsAI researchtactile sensing
Robots Get a Sense of Touch: Groundbreaking Dataset Bridges Vision and Feeling
News

Robots Get a Sense of Touch: Groundbreaking Dataset Bridges Vision and Feeling

Scientists have unveiled Baihu-VTouch, the world's most comprehensive dataset combining robotic vision and touch. This collection spans over 60,000 minutes of interactions across various robot types, capturing delicate contact details with remarkable precision. The breakthrough could revolutionize how robots handle delicate tasks - imagine machines that can actually 'feel' what they're doing.

January 26, 2026
roboticsAI researchtactile sensors
Small AI Model Packs Big Punch: Step3-VL-10B Challenges Giants
News

Small AI Model Packs Big Punch: Step3-VL-10B Challenges Giants

StepZen's new open-source vision-language model Step3-VL-10B is turning heads in AI circles. Despite its compact 10 billion parameters, it's outperforming models twenty times its size in visual reasoning and math competitions. The secret? Innovative training techniques that could revolutionize how we deploy AI on everyday devices.

January 20, 2026
AI innovationcomputer visionedge computing
News

AI cracks famous math puzzle with a fresh approach

OpenAI's latest model has made waves in mathematics by solving a long-standing number theory problem. The solution to the Erdős problem caught the attention of Fields Medalist Terence Tao, who praised its originality. But behind this success lies a sobering reality - AI's overall success rate in solving such problems remains low, reminding us that these tools are assistants rather than replacements for human mathematicians.

January 19, 2026
AI researchmathematicsmachine learning
News

Rili Tech's UEX System Brings AI-Powered Clarity to Industrial X-ray Imaging

Chinese firm Rili Technology has unveiled UEX, a groundbreaking AI system that transforms industrial X-ray imaging. Capable of enhancing 1536×1536 pixel images in just 15 milliseconds, this technology promises to revolutionize quality control in semiconductors, batteries, and automotive manufacturing. The system combines noise reduction, sharpening, and contrast optimization while reducing radiation exposure—a game-changer for production lines demanding both speed and precision.

January 15, 2026
industrial AIX-ray technologyquality control