LongCat-Flash-Omni Launches with Multimodal Breakthroughs
Meituan Unveils LongCat-Flash-Omni with Revolutionary Multimodal Capabilities
November 3, 2025 - Following the successful launch of its LongCat-Flash series in September, Meituan has now introduced LongCat-Flash-Omni, a groundbreaking multimodal AI model that sets new standards for real-time interaction across text, image, video, and speech modalities.
Technical Innovations
The model builds upon Meituan's efficient architecture with several key advancements:
- Shortcut-Connected MoE (ScMoE) Technology: Enables efficient processing despite the model's massive 560 billion parameters (with 27 billion activated)
- Integrated Multimodal Modules: Combines perception and speech reconstruction in an end-to-end design
- Progressive Fusion Training: Addresses data distribution challenges across different modalities

Performance Benchmarks
Independent evaluations confirm LongCat-Flash-Omni achieves:
- State-of-the-art (SOTA) results in open-source multimodal benchmarks
- No performance degradation when switching between modalities ("no intelligence reduction")
- Superior real-time audio-video interaction with latency under industry standards
- Exceptional scores in:
- Text understanding (+15% over previous models)
- Image recognition (98.7% accuracy)
- Speech naturalness (4.8/5 human evaluation)
Developer Applications
The release includes multiple access channels:
- Official app with voice call functionality (video coming soon)
- Web interface supporting file uploads and multimodal queries
- Open-source availability on Hugging Face and GitHub
Key Points
- First open-source model to combine offline understanding with real-time AV interaction
- Lightweight audio decoder enables natural speech reconstruction
- Early fusion training prevents modality interference
- Currently supports Chinese/English with more languages planned for Q1 2026