Skip to main content

Meituan's Open-Source Multimodal AI Model Tops Benchmarks

Meituan's Open-Source Multimodal AI Model Sets New Benchmark

In a significant move for the AI industry, Meituan has unveiled its LongCat-Flash-Omni multimodal large model as an open-source project. The model has already surpassed several closed-source competitors in benchmark tests, achieving a rare "open source as SOTA" (State-of-the-Art) breakthrough.

Technical Breakthroughs

The LongCat-Flash-Omni model stands out for its ability to handle complex cross-modal tasks with precision. For instance, when presented with questions combining physical logic and spatial reasoning—such as describing the motion trajectory of a ball in a hexagonal space—the model can accurately model the scenario and explain the dynamics in natural language.

Image

In addition, the model excels in speech recognition, even in high-noise environments, and can extract key information from blurry images or short video clips to generate structured answers.

Innovative Architecture

The model's success stems from its end-to-end unified architecture. Unlike traditional multimodal models that process each modality separately, LongCat integrates text, audio, and visual data into a single representation space. This design allows for seamless alignment and reasoning across modalities.

During training, Meituan's team employed a progressive multimodal injection strategy: first solidifying the language foundation, then gradually introducing image, speech, and video data. This approach ensures the model maintains strong language capabilities while improving cross-modal generalization.

Real-Time Performance

One of the most impressive features of LongCat-Flash-Omni is its near-zero latency interaction. Thanks to the Flash inference engine and lightweight design, the model delivers smooth conversations on consumer-grade GPUs. Users interacting with the model via Meituan's app or web version experience minimal delay, achieving a natural "what you ask is what you get" interaction.

Image

Availability and Impact

The model is now freely available on Meituan's platforms. Developers can access the weights through Hugging Face, while ordinary users can test it directly within the application. This move underscores Meituan's confidence in its AI infrastructure and signals its commitment to advancing China's multimodal AI ecosystem.

As AI competition shifts from single-modal accuracy to multimodal collaboration, LongCat-Flash-Omni represents both a technical milestone and a redefinition of application scenarios. Its emergence suggests that China's AI journey is entering a new phase of innovation.

Key Points:

  • Open-source SOTA: LongCat-Flash-Omni outperforms closed-source models in benchmarks.
  • Unified architecture: Integrates text, audio, and visual data into a single representation space.
  • Real-time interaction: Delivers near-zero latency responses on consumer-grade hardware.
  • Progressive training: Combines language foundations with gradual multimodal injection.
  • Ecosystem boost: Freely available to developers and users, fostering broader adoption.

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

News

Tencent's AI Push Gains Momentum as Top Scientist Tianyu Peng Joins Hunyuan Team

Tencent has made another strategic hire in its AI talent race, bringing on Tianyu Peng as Chief Research Scientist for its Hunyuan multimodal team. The Tsinghua PhD and former Sea AI Lab researcher will focus on advancing reinforcement learning capabilities within Tencent's flagship AI model. This move signals Tencent's continued commitment to competing at the forefront of multimodal AI development.

February 3, 2026
TencentAI ResearchReinforcement Learning
News

Baidu's ERNIE 5.0 Breaks New Ground with Massive AI Upgrade

Baidu has unveiled ERNIE 5.0, its most advanced AI model yet featuring a staggering 2.4 trillion parameters. This multimodal powerhouse can process text, images, audio and video simultaneously, outperforming competitors in over 40 benchmark tests. With input from hundreds of experts across various fields, ERNIE 5.0 promises smarter responses and faster processing for both individual users and businesses.

January 22, 2026
Artificial IntelligenceBaiduMultimodal AI
Meituan's New AI Model Thinks Like Humans - And It's Free to Try
News

Meituan's New AI Model Thinks Like Humans - And It's Free to Try

Meituan's LongCat team has unveiled its latest AI breakthrough - the LongCat-Flash-Thinking-2601 model. This open-source tool excels at complex problem-solving by mimicking human thought processes, scoring perfect marks in math tests and ranking among the top programming AIs. What makes it special? A unique 'rethinking mode' that breaks down problems like humans do. Developers can now access the technology for free, potentially changing how we approach AI-assisted tasks.

January 16, 2026
AI innovationopen-source techcognitive computing
Gemini-3-Pro Leads Multimodal AI Race as Chinese Models Gain Ground
News

Gemini-3-Pro Leads Multimodal AI Race as Chinese Models Gain Ground

Google's Gemini-3-Pro dominates the latest multimodal AI rankings with an impressive 83.64 score, while Chinese models from ByteDance and SenseTime show strong progress. The evaluation reveals surprising gaps between tech giants, with OpenAI's GPT-5.2 unexpectedly trailing behind. Notably, Alibaba's Qwen3-VL becomes the first open-source model to break the 70-point barrier.

December 31, 2025
AI RankingsMultimodal AIComputer Vision
News

Kling AI 2.6 Debuts with Game-Changing Audio Features

Kuaishou's Kling AI has unveiled version 2.6, marking a significant leap forward in AI-generated content. The update introduces native audio capabilities alongside its existing video tools, creating seamless multimodal experiences. With improved efficiency and quality metrics, this release promises to transform creative workflows for professionals across media industries.

December 3, 2025
AI Video GenerationMultimodal AICreative Technology
vLLM-Omni Breaks Barriers with Multi-Modal AI Processing
News

vLLM-Omni Breaks Barriers with Multi-Modal AI Processing

The vLLM team has unveiled vLLM-Omni, a groundbreaking framework that handles text, images, audio, and video seamlessly. This innovative solution uses a decoupled pipeline architecture to optimize resource allocation across different processing stages. Developers can now access this open-source tool to build more versatile AI applications.

December 2, 2025
AI FrameworksMultimodal AIMachine Learning