Qwen Lab Unveils OmniAudio: Spatial Sound from 360° Videos

The field of immersive technology has taken a significant leap forward with Qwen Lab's introduction of OmniAudio, a novel system capable of generating spatial audio from 360-degree videos. This development promises to transform virtual reality experiences by creating soundscapes that perfectly match panoramic visuals.

Breaking New Ground in Spatial Audio

Traditional audio generation techniques have struggled to keep pace with advances in 360° video technology. Most existing solutions produce flat, non-spatial sound that fails to capture the directional richness of panoramic content. OmniAudio changes this paradigm by outputting First-order Ambisonics (FOA) - a four-channel format (W, X, Y, Z) that accurately reproduces 3D sound localization.

The Challenge of Data Scarcity

One major obstacle in developing this technology was the lack of paired datasets containing both 360° video and corresponding spatial audio. Qwen Lab's research team addressed this by creating the Sphere360 dataset, an extensive collection featuring:

Over 103,000 real-world video clips
288 distinct audio event categories
288 hours of total content

The dataset underwent rigorous quality control measures to ensure precise alignment between visual and audio components.

How OmniAudio Works

The system employs a sophisticated two-stage training approach:

Self-supervised pretraining: Leverages large-scale non-spatial audio resources converted into "pseudo-FOA" format using advanced encoding techniques.
Supervised fine-tuning: Combines dual-branch video representation with masked flow-matching to refine spatial accuracy.

This hybrid approach allows the model to first learn general audio patterns before specializing in precise directional sound reproduction.

Performance That Speaks Volumes

Testing on benchmark datasets yielded impressive results:

Significant reductions in key metrics (FD, KL, ΔAngular) on YT360-Test set
Superior performance on Sphere360-Bench evaluations
High scores in human assessments for spatial quality and visual-audio alignment

The system particularly excels at maintaining accurate sound positioning during head movements - a crucial factor for VR applications.

Resources for Developers

The research team has made their work accessible through:

Key Points

OmniAudio generates realistic spatial audio from panoramic videos using FOA format.
The Sphere360 dataset addresses critical data scarcity in this emerging field.
Two-stage training combines self-supervised learning with supervised refinement.
Testing shows significant improvements over existing solutions in both objective metrics and human evaluation.