Meituan's New AI Can Clone Voices with Stunning Accuracy
Meituan Breaks New Ground in Voice Cloning Technology
In a significant leap for audio generation, Meituan's LongCat team has open-sourced their revolutionary LongCat-AudioDiT model. This technology skips the conventional middle steps used in text-to-speech systems, working directly with sound waves to create eerily accurate voice clones.

A Radical New Approach
Traditional voice synthesis relies on multiple stages of processing, which can degrade quality. LongCat-AudioDiT takes a bold shortcut with just two core components:
- Wav-VAE: This clever compressor shrinks audio files dramatically while preserving quality - imagine fitting a 24kHz recording into just 11.7 frames per second without losing clarity.
- Semantic-enhanced DiT: The model smartly blends text understanding with sound generation, catching subtle pronunciation details that often get lost in translation.
Solving Persistent Problems
The team tackled two major voice cloning challenges head-on:
- Voice Drift Fix: Ever noticed how some AI voices seem to change character mid-sentence? The new dual constraint mechanism puts a stop to that instability.
- Natural Sound Boost: Their adaptive projection guidance acts like an intelligent filter, keeping the good parts of the audio signal while ditching the parts that make speech sound robotic.
Performance That Speaks for Itself
Independent tests show LongCat-AudioDiT setting new standards:
- Achieved near-perfect similarity scores (0.818 for Chinese, 0.797 for challenging sentences)
- Maintains exceptional accuracy with just 1.5% word error rate in English
- Outperforms established models like Seed-TTS and CosyVoice3.5
The real kicker? It does all this using simpler training methods than competitors, proving that sometimes less really is more.
The technology is now available to developers worldwide through GitHub and HuggingFace.
Key Points:
- Direct waveform modeling eliminates quality loss from intermediate steps
- 2000x compression maintains audio fidelity through innovative techniques
- Top-tier performance in both Chinese and English voice cloning
- Open-source availability encourages community development and innovation

