Skip to main content

Peking University Unveils LLaVA-o1: A New Multimodal AI Model

Peking University Unveils LLaVA-o1: A New Multimodal AI Model

Recently, a research team from Peking University announced the launch of LLaVA-o1, a multimodal open-source model that claims to be the first visual language model capable of both spontaneous and systematic reasoning, similar to GPT-o1.

The model has demonstrated exceptional performance across six challenging multimodal benchmark tests. Its version with 11 billion parameters has outperformed notable competitors, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

image

Features and Capabilities

LLaVA-o1 is built on the Llama-3.2-Vision model and employs a unique "slow thinking" reasoning mechanism. This allows it to engage in more intricate reasoning processes autonomously, a significant advancement over traditional chain-of-thought prompting methods.

In multimodal reasoning benchmark evaluations, LLaVA-o1 outperformed its base model by 8.9%. The model's reasoning process is structured into four distinct stages: summarization, visual interpretation, logical reasoning, and conclusion generation. Traditional models often exhibit a relatively simplistic reasoning process, which can lead to incorrect conclusions. In contrast, LLaVA-o1's multi-step reasoning framework enhances the accuracy of its outputs.

For example, when addressing the question, "How many objects are left after removing all the small bright balls and purple objects?", LLaVA-o1 begins by summarizing the question, extracting relevant information from the accompanying image, and then conducting a detailed, step-by-step reasoning process to arrive at the correct answer. This phased approach significantly bolsters the model's systematic reasoning capabilities, increasing its efficiency in tackling complex problems.

Innovation in Reasoning

Notably, LLaVA-o1 integrates a stage-wise beam search method throughout its reasoning phases. This innovative approach enables the model to generate multiple candidate answers at each stage of reasoning and select the optimal response to progress to the next stage, markedly enhancing the overall quality of its reasoning. Through methodical fine-tuning and the use of appropriate training data, LLaVA-o1 has shown remarkable performance compared to larger or closed-source models.

The research accomplishments of the Peking University team are poised to advance the field of multimodal artificial intelligence. They introduce fresh ideas and methodologies for future visual language understanding models. The team has pledged to fully open-source the code, pre-trained weights, and datasets associated with LLaVA-o1, encouraging further exploration and application by researchers and developers in the AI community.

For more detailed information, the research paper is available here, and the project's source code can be found on GitHub.

Key Points

  1. LLaVA-o1 is a new multimodal reasoning model released by teams from Peking University, featuring "slow thinking" reasoning capabilities.
  2. This model outperforms its base model by 8.9% in multimodal reasoning benchmark tests.
  3. LLaVA-o1 ensures accuracy through structured multi-step reasoning and will be open-sourced soon.

Enjoyed this article?

Subscribe to our newsletter for the latest AI news, product reviews, and project recommendations delivered to your inbox weekly.

Weekly digestFree foreverUnsubscribe anytime

Related Articles

Microsoft's MAI-Image-2 Breaks Into Global Top 3 for AI Image Generation
News

Microsoft's MAI-Image-2 Breaks Into Global Top 3 for AI Image Generation

Microsoft has unveiled its powerful new MAI-Image-2 model, which now ranks among the world's top three text-to-image AI systems. The breakthrough technology solves the persistent problem of garbled text in AI-generated images while delivering stunning visual quality. Users can already test the model for free, with plans to integrate it into Microsoft's productivity tools soon.

March 20, 2026
AIMicrosoftimage-generation
News

Tech Titans Unite: $12.5M Boost for Open-Source Security

In a rare show of unity, Google, Microsoft, OpenAI and other tech giants have pooled $12.5 million to help the Linux Foundation tackle a growing problem - the flood of unreliable AI-generated security reports overwhelming open-source maintainers. The funding will support efforts to filter out these 'AI garbage reports' while protecting critical open-source infrastructure. This collaboration marks another step in the industry's push to establish shared security standards beyond competitive interests.

March 18, 2026
OpenSourceCybersecurityAI
Manus AI Brings 'My Computer' to Life with 20-Minute App Creation
News

Manus AI Brings 'My Computer' to Life with 20-Minute App Creation

Meta's AI platform Manus just made a game-changing leap from the cloud to your desktop. Their new 'My Computer' feature lets AI agents directly manage files, automate tasks, and even build apps in minutes - all while keeping your data secure with strict human oversight. This could transform how we interact with our devices, turning AI from a helper into a true digital colleague.

March 18, 2026
AIProductivity ToolsMeta
NVIDIA's NemoClaw Brings One-Click AI to OpenClaw Ecosystem
News

NVIDIA's NemoClaw Brings One-Click AI to OpenClaw Ecosystem

NVIDIA has unveiled NemoClaw, a game-changing toolkit that simplifies AI agent deployment for the OpenClaw platform. With just one command, users can now install powerful AI models like Nemotron and OpenShell runtime. The solution addresses critical privacy concerns with isolated sandboxes and hybrid model strategies while supporting everything from consumer devices to enterprise supercomputers. NVIDIA CEO Jensen Huang calls it the 'AI operating system' of our era.

March 17, 2026
AINVIDIAOpenClaw
Zhipu's GLM-5-Turbo: The AI Assistant That Won't Quit on You
News

Zhipu's GLM-5-Turbo: The AI Assistant That Won't Quit on You

Zhipu AI has unveiled GLM-5-Turbo, a powerful new model designed to tackle complex tasks without stalling. Unlike standard AI tools that might falter with lengthy processes, this upgrade focuses on four key improvements: reliable tool usage, breaking down complicated requests, understanding time-sensitive tasks, and handling heavy workloads efficiently. Early tests show it outperforms competitors in real-world business scenarios, with major tech companies already praising its accuracy and reliability.

March 17, 2026
AIZhipuProductivity
News

MiniMax Surpasses Baidu: China's AI Landscape Gets a Shake-Up

In a stunning market reversal, AI unicorn MiniMax has overtaken tech giant Baidu with a HK$382.6 billion valuation. The company's stock surged 22% amid strong financials showing 158.9% revenue growth, with 70% coming from international markets. This milestone signals shifting priorities in China's AI sector - from technical benchmarks to real-world profitability and global competitiveness.

March 11, 2026
AITechStocksMarketTrends