Alibaba Tongyi Unveils Qwen3-ASR-Toolkit for Advanced Transcription
Alibaba Tongyi Unveils Qwen3-ASR-Toolkit for Advanced Transcription
Alibaba's Tongyi Qwen team has released Qwen3-ASR-Toolkit, an open-source Python command-line tool designed to revolutionize audio and video transcription. This innovation breaks the previous three-minute limit of the Qwen3-ASR-Flash API, enabling seamless transcription for hours-long content.

Enhanced Capabilities
The toolkit leverages intelligent Voice Activity Detection (VAD) technology to ensure sentence integrity during transcription. It automatically resamples audio files to 16kHz mono for optimal processing and supports multi-threaded parallel uploads, significantly reducing processing time.
Broad Format Support
Built on FFmpeg, the toolkit supports nearly all mainstream audio and video formats, including:
- MP4, MOV, MKV (video)
- MP3, WAV, M4A (audio) This flexibility eliminates compatibility concerns for users.
Powered by Qwen3-ASR-Flash
The underlying Qwen3-ASR-Flash model was trained on:
- Massive multimodal datasets
- Tens of millions of hours of ASR data This foundation delivers industry-leading speech recognition accuracy.
The toolkit is available on GitHub: Qwen3-ASR-Toolkit
Key Points:
📌 Breaks hour-long transcription barrier previously limited to 3 minutes
🎤 Utilizes advanced VAD technology for accurate sentence segmentation
💻 Supports parallel processing for faster turnaround times
🔊 Compatible with virtually all major audio/video formats




