Kuaishou and Shanghai Jiao Tong University Launch Orthus: A New Era in Multimodal AI

At the recent International Conference on Machine Learning (ICML), Kuaishou and Shanghai Jiao Tong University unveiled Orthus, a groundbreaking multimodal generation and understanding model. Based on an autoregressive Transformer architecture, Orthus excels in converting between text and images with unprecedented efficiency and is now available as an open-source project.

Unmatched Computational Efficiency

Orthus stands out for its exceptional computational efficiency and robust learning capabilities. Research indicates that even with minimal computational resources, Orthus outperforms existing hybrid models like Chameleon and Show-o on multiple image comprehension metrics. Notably, on the GenEval metric for text-to-image generation, Orthus surpasses specialized models such as SDXL, a diffusion model designed specifically for this task.

Innovative Architectural Design

The model's architecture is ingeniously designed, featuring an autoregressive Transformer as its backbone network. It is equipped with specialized modality generation heads for text and image generation, effectively decoupling the modeling of image details from text feature expression. This allows Orthus to focus on capturing the complex relationships between text and images.

Core Components

Orthus comprises several key components:

A text tokenizer
A visual autoencoder
Two specific modality embedding modules

These elements integrate text and image features into a unified representation space, enhancing the backbone network's efficiency in processing inter-modal dependencies. During inference, the model generates the next text token or image feature autoregressively based on specific markers, showcasing remarkable flexibility.

Practical Applications

Beyond text-image interaction, Orthus shows significant potential in:

Image editing
Web page generation
Other multimodal applications

The model's design avoids the divergence between end-to-end diffusion modeling and autoregressive mechanisms while minimizing information loss caused by image discretization. This innovation represents an expansion of He Kai-ming's MAR work into the multimodal domain.

Industry Impact

The collaboration between Kuaishou and Shanghai Jiao Tong University opens new possibilities for multimodal generation models. This advancement is poised to attract significant attention from both the academic community and industry professionals.

Key Points:

Orthus is a multimodal AI model developed by Kuaishou and Shanghai Jiao Tong University.
It features an autoregressive Transformer architecture with specialized modality heads.
The model demonstrates superior performance to existing solutions with minimal computational resources.
Orthus is now open-sourced, making it accessible to researchers and developers worldwide.
Its applications extend beyond text-image conversion to include image editing and web page generation.

AI D-A-M-N

Kuaishou and Shanghai Jiao Tong University Unveil Orthus, a Breakthrough in Multimodal AI