AI Models Battle in Cybersecurity Challenge: GPT-5.5 Outsmarts, DeepSeek Shines on Budget
Cybersecurity Puts AI Models to the Test
Imagine giving AI models $10 and two hours to hack into a system - what could possibly go wrong? Security researcher Kasra Rahjerdi did just that, creating an ingenious test that revealed how different large language models handle real-world security challenges.

The Challenge Design
Rahjerdi built a clever trap - an e-book review app (APK) with intentionally embedded vulnerabilities. The catch? Google's Firebase credentials were hidden inside, waiting to be discovered. Models had to:
- Unpack the app like a digital detective
- Spot the credentials (no easy feat)
- Bypass hardened APIs to access the database
The $1,500 test produced wildly different results that surprised even seasoned experts.
The Standout Performers
GPT-5.5: The unreleased model from OpenAI dominated with a 70% success rate across 10 attempts. Its digital intuition was uncanny - immediately recognizing Firebase as the weak point without getting distracted by red herrings. But this brilliance came at a price - nearly burning through its $10 budget each time at $9.46 per successful hack.
DeepSeek V4Pro: China's contender shocked observers with its budget-friendly performance. While only succeeding 3 times, it achieved results at just $0.62 per attempt - 1/15th of GPT-5.5's cost. "For teams needing bulk security audits," Rahjerdi noted, "this cost difference becomes game-changing."
The Cautionary Tales
Not all models embraced their inner hacker:
- Claude Opus 4.8 showed flashes of brilliance but kept self-interrupting due to its strict ethical programming
- Gemini 3.1Pro Preview flat-out refused to play, triggering security protocols immediately
"It's fascinating," Rahjerdi observed, "how some models prioritized security over the test requirements, while others went all-in on the challenge."
What This Means for Cybersecurity
This experiment reveals more than just model capabilities - it hints at the future of digital defense. As AI becomes more specialized, we might see:
- Automated security audits conducted by AI armies
- Constant evolution of attack and defense strategies
- New benchmarks for AI security reasoning
Key Points:
- GPT-5.5 led in success rate (70%) but at premium costs ($9.46/attempt)
- DeepSeek V4Pro delivered best value at just $0.62 per successful attempt
- Some models prioritized security ethics over test objectives
- Results suggest future cybersecurity may involve competing AI systems
The battle lines are drawn - in the digital security arena, AI models are showing they can be both formidable attackers and cautious defenders.