LongCat-Flash-Omni Launches with Multimodal Breakthroughs

Meituan Unveils LongCat-Flash-Omni with Revolutionary Multimodal Capabilities

November 3, 2025 - Following the successful launch of its LongCat-Flash series in September, Meituan has now introduced LongCat-Flash-Omni, a groundbreaking multimodal AI model that sets new standards for real-time interaction across text, image, video, and speech modalities.

Technical Innovations

The model builds upon Meituan's efficient architecture with several key advancements:

  • Shortcut-Connected MoE (ScMoE) Technology: Enables efficient processing despite the model's massive 560 billion parameters (with 27 billion activated)
  • Integrated Multimodal Modules: Combines perception and speech reconstruction in an end-to-end design
  • Progressive Fusion Training: Addresses data distribution challenges across different modalities

Image

Performance Benchmarks

Independent evaluations confirm LongCat-Flash-Omni achieves:

  • State-of-the-art (SOTA) results in open-source multimodal benchmarks
  • No performance degradation when switching between modalities ("no intelligence reduction")
  • Superior real-time audio-video interaction with latency under industry standards
  • Exceptional scores in:
    • Text understanding (+15% over previous models)
    • Image recognition (98.7% accuracy)
    • Speech naturalness (4.8/5 human evaluation)

Developer Applications

The release includes multiple access channels:

  • Official app with voice call functionality (video coming soon)
  • Web interface supporting file uploads and multimodal queries
  • Open-source availability on Hugging Face and GitHub

Key Points

  • First open-source model to combine offline understanding with real-time AV interaction
  • Lightweight audio decoder enables natural speech reconstruction
  • Early fusion training prevents modality interference
  • Currently supports Chinese/English with more languages planned for Q1 2026

Related Articles