AI D-A-M-N/Li Mu's Team Unveils Higgs Audio v2, Revolutionizing Speech Synthesis

Li Mu's Team Unveils Higgs Audio v2, Revolutionizing Speech Synthesis

Li Mu's Team Unveils Higgs Audio v2, Revolutionizing Speech Synthesis

Renowned AI entrepreneur Li Mu and his team at Boson.ai have launched Higgs Audio v2, a groundbreaking open-source text-to-speech (TTS) model. This release marks a significant leap in speech synthesis technology, offering capabilities like multilingual dialogue generation, automatic rhythm adjustment, and voice cloning.

Multimodal Capabilities

Higgs Audio v2 stands out for its multimodal functionality. Unlike traditional TTS systems, it can process text and generate speech while understanding context. For example, it can compose a song, sing it in a specific voice, and even add background music—a feat previously unimaginable in TTS technology.

Image

Performance Benchmarks

The model was trained on 10 million hours of speech data, ensuring exceptional performance across benchmarks. According to the EmergentTTS-Eval test, Higgs Audio v2 outperformed GPT-4o-mini-tts by 75.7% in the "emotion" category and 55.7% in the "question" category. It has set a new industry standard for traditional TTS tests.

Image

Technical Innovations

Higgs Audio v2 employs advanced data processing techniques. It converts audio signals at 25 frames per second into numerical sequences using a discrete audio tokenizer, capturing both semantic and acoustic features accurately. The model leverages a pre-trained large language model, enhancing its language comprehension and contextual understanding. Additionally, it supports zero-shot voice cloning, adapting to new tasks with minimal prompts.

Image

Practical Applications

The model excels in real-world scenarios:

  • Real-time voice chat: Ideal for virtual anchors and voice assistants due to low latency and emotional expressiveness.
  • Audio content creation: Generates natural dialogues for audiobooks, interactive training, and dynamic storytelling.
  • Voice cloning: Replicates specific voices, opening doors for entertainment and creative industries.

The code is now open-sourced on GitHub and Hugging Face, supporting local installation via GPU-enabled PyTorch or Docker.

Key Points:

  • Higgs Audio v2 introduces multimodal TTS with voice cloning and rhythm adjustment.
  • Trained on 10 million hours of data, it outperforms competitors in key benchmarks.
  • Advanced tokenization and pre-trained models ensure high accuracy and adaptability.
  • Open-source availability fosters innovation in real-time chat and content creation.