Small AI Models Surpass Larger Ones with New Training Method

When the race for larger AI models makes computing power prohibitively expensive, a breakthrough technology called "On-Policy Distillation" is changing the game. Led by former OpenAI CTO Mira Murati at Thinking Machines Lab, this method allows smaller models to achieve performance levels previously reserved for much larger systems—at a fraction of the cost.

Efficiency Breakthrough: 8B Model Matches 32B Performance

Recent research shows that an 8 billion-parameter model, when trained with on-policy distillation, can achieve 70% of the performance of a 32 billion-parameter model. The training cost drops by 90%, while efficiency increases by 50 to 100 times. This development could democratize AI development, enabling small and medium enterprises as well as individual developers to train specialized models competitively.

How It Works: Real-Time Feedback Revolutionizes Training

The key innovation lies in a "dense feedback per token" mechanism. Unlike traditional reinforcement learning (RL), which provides sparse rewards at the end of each episode, on-policy distillation allows the teacher model to provide real-time scores for every token generated by the student model. This continuous guidance:

Accelerates convergence
Prevents "policy drift" during long sequence training
Ensures consistent high-quality output from smaller models

In practical tests, the Qwen3-8B model achieved 70% accuracy on math reasoning tasks with just 150 training steps, compared to traditional RL methods requiring 17,920 GPU hours for similar results.

Solving Catastrophic Forgetting: Retaining Knowledge While Learning New Skills

One persistent challenge in AI has been "catastrophic forgetting"—where models lose previously learned abilities when acquiring new knowledge. Traditional fine-tuning might see instruction-following ability drop from 85% to 45% when incorporating new documentation.

On-policy distillation addresses this through:

Real-time trajectory sampling
Gradual teacher correction

The method retains 41% of new knowledge while quickly restoring original capabilities to 83%, significantly outperforming conventional approaches.

Implementation: Simple Four-Step Process

The method's lightweight architecture requires only four repeating steps:

Deploy a teacher model (e.g., 32B) as supervision source
Student model generates response trajectories
Teacher calculates log probability for each token
Optimize student parameters using reverse Kullback-Leibler divergence

The system works with existing distillation frameworks without complex infrastructure, enabling what researchers call a "cost-effective and accurate" performance leap.

Implications for AI Democratization

Murati's approach represents what industry experts call a "downgrade strike"—using smarter training methods rather than simply scaling up parameters. This has significant implications:

Makes high-performance AI accessible on mobile and IoT devices
Reduces reliance on cloud-based "AI monopolies"
Enables continuous model evolution without capability loss

The technology is particularly promising for enterprise applications where models need to dynamically learn business rules without sacrificing core functionality like basic conversation and tool calling.

Key Points:

90% cost reduction in AI training
Small (8B) models achieve 70% performance of large (32B) models
Solves catastrophic forgetting while adding new knowledge
Simple implementation compatible with existing frameworks
Potential to democratize AI development across industries

Small AI Models Surpass Larger Ones with New Training Method

Small AI Models Surpass Larger Ones with New Training Method

Efficiency Breakthrough: 8B Model Matches 32B Performance

How It Works: Real-Time Feedback Revolutionizes Training

Solving Catastrophic Forgetting: Retaining Knowledge While Learning New Skills

Implementation: Simple Four-Step Process

Implications for AI Democratization

Key Points:

Related Articles

LLaMA-Factory Online: Your Gateway to Easy AI Model Training

DeepSeek-V4 Set to Revolutionize Code Generation This February

DeepSeek Finds Smarter AI Doesn't Need Bigger Brains

Chinese AI Model Stuns Tech World with Consumer GPU Performance

NVIDIA's NitroGen learns to game like humans by watching YouTube

NVIDIA and Stanford Unleash Open-Source Gaming AI That Masters 1,000 Titles

AI DAMN

Main Pages

Content

Others