BAAI Unveils See3D: A Breakthrough in 3D Video Learning

The Beijing Academy of Artificial Intelligence (BAAI) has announced the launch of See3D, an innovative 3D generation model designed to learn from large-scale unlabeled internet videos. This technological advancement aligns with the concept of "See Video, Get 3D" and represents a significant step forward in the field of 3D learning and generation.

Technical Innovations of See3D

See3D distinguishes itself by not relying on traditional camera parameters. Instead, it utilizes visual conditioning techniques to generate camera-direction controllable and geometrically consistent multi-view images based solely on visual cues obtained from videos. This approach eliminates the necessity for costly 3D or camera annotations, streamlining the process of learning 3D priors from abundant internet video data.

The model supports various forms of generation including:

Text-to-3D generation
Single view to 3D
Sparse views to 3D Additionally, it is capable of performing 3D editing and Gaussian rendering. BAAI has made the model, code, and a demo available as open-source resources, facilitating broader technical reference and experimentation.

Demonstrations of See3D's capabilities include:

Unlocking 3D interactive worlds
3D reconstruction based on sparse images
Open-world 3D generation
3D generation from single views These features highlight the extensive applicability of See3D in various creative 3D applications, enabling users to engage with 3D environments more dynamically.

Motivation Behind the Development

The impetus for developing See3D arises from the challenges associated with traditional 3D data collection methods, which are often time-consuming and expensive. In contrast, videos provide a wealth of multi-view correlations and camera motion information, making them valuable for revealing intricate 3D structures.

The See3D team has constructed a comprehensive dataset to facilitate this process, comprising 16 million video clips and 320 million frames of images. This dataset, named WebVi3D, is pivotal in enabling the model to generate pure 2D visual signals by introducing time-dependent noise to masked video data. This method supports scalable multi-view diffusion model training, achieving 3D generation without relying on camera conditions.

Key Advantages of See3D

See3D offers several key advantages:

Data Scalability: Sourced from a vast array of internet videos, the training data significantly enhances the scale of the constructed multi-view dataset.
Camera Controllability: The model supports scene generation under complex camera trajectories, ensuring geometric consistency across frames.
Geometric Consistency: The model maintains geometric integrity when generating multi-view images, which is crucial for realistic 3D representations. By expanding the scale of available datasets, See3D aims to provide new insights and methodologies for advancing 3D generation technology. The research team hopes this initiative will motivate the 3D research community to focus on large-scale unlabeled camera data, lowering the costs associated with 3D data collection and bridging gaps with existing closed-source 3D solutions.

Project Address: See3D Project

Key Points

See3D can generate 3D images from unlabeled video data.
The model eliminates the need for traditional camera parameters.
It supports multiple forms of 3D generation and editing.
The initiative aims to reduce costs in 3D data collection and promote research in the field.