Small AI Models Surpass Larger Ones with New Training Method
Small AI Models Surpass Larger Ones with New Training Method
When the race for larger AI models makes computing power prohibitively expensive, a breakthrough technology called "On-Policy Distillation" is changing the game. Led by former OpenAI CTO Mira Murati at Thinking Machines Lab, this method allows smaller models to achieve performance levels previously reserved for much larger systems—at a fraction of the cost.
Efficiency Breakthrough: 8B Model Matches 32B Performance
Recent research shows that an 8 billion-parameter model, when trained with on-policy distillation, can achieve 70% of the performance of a 32 billion-parameter model. The training cost drops by 90%, while efficiency increases by 50 to 100 times. This development could democratize AI development, enabling small and medium enterprises as well as individual developers to train specialized models competitively.

How It Works: Real-Time Feedback Revolutionizes Training
The key innovation lies in a "dense feedback per token" mechanism. Unlike traditional reinforcement learning (RL), which provides sparse rewards at the end of each episode, on-policy distillation allows the teacher model to provide real-time scores for every token generated by the student model. This continuous guidance:
- Accelerates convergence
- Prevents "policy drift" during long sequence training
- Ensures consistent high-quality output from smaller models
In practical tests, the Qwen3-8B model achieved 70% accuracy on math reasoning tasks with just 150 training steps, compared to traditional RL methods requiring 17,920 GPU hours for similar results.
Solving Catastrophic Forgetting: Retaining Knowledge While Learning New Skills
One persistent challenge in AI has been "catastrophic forgetting"—where models lose previously learned abilities when acquiring new knowledge. Traditional fine-tuning might see instruction-following ability drop from 85% to 45% when incorporating new documentation.
On-policy distillation addresses this through:
- Real-time trajectory sampling
- Gradual teacher correction
The method retains 41% of new knowledge while quickly restoring original capabilities to 83%, significantly outperforming conventional approaches.
Implementation: Simple Four-Step Process
The method's lightweight architecture requires only four repeating steps:
- Deploy a teacher model (e.g., 32B) as supervision source
- Student model generates response trajectories
- Teacher calculates log probability for each token
- Optimize student parameters using reverse Kullback-Leibler divergence
The system works with existing distillation frameworks without complex infrastructure, enabling what researchers call a "cost-effective and accurate" performance leap.
Implications for AI Democratization
Murati's approach represents what industry experts call a "downgrade strike"—using smarter training methods rather than simply scaling up parameters. This has significant implications:
- Makes high-performance AI accessible on mobile and IoT devices
- Reduces reliance on cloud-based "AI monopolies"
- Enables continuous model evolution without capability loss
The technology is particularly promising for enterprise applications where models need to dynamically learn business rules without sacrificing core functionality like basic conversation and tool calling.
Key Points:
- 90% cost reduction in AI training
- Small (8B) models achieve 70% performance of large (32B) models
- Solves catastrophic forgetting while adding new knowledge
- Simple implementation compatible with existing frameworks
- Potential to democratize AI development across industries