Skip to main content

AI Models Battle in Cybersecurity Challenge: GPT-5.5 Outsmarts, DeepSeek Shines on Budget

Cybersecurity Puts AI Models to the Test

Imagine giving AI models $10 and two hours to hack into a system - what could possibly go wrong? Security researcher Kasra Rahjerdi did just that, creating an ingenious test that revealed how different large language models handle real-world security challenges.

Image

The Challenge Design

Rahjerdi built a clever trap - an e-book review app (APK) with intentionally embedded vulnerabilities. The catch? Google's Firebase credentials were hidden inside, waiting to be discovered. Models had to:

  1. Unpack the app like a digital detective
  2. Spot the credentials (no easy feat)
  3. Bypass hardened APIs to access the database

The $1,500 test produced wildly different results that surprised even seasoned experts.

The Standout Performers

GPT-5.5: The unreleased model from OpenAI dominated with a 70% success rate across 10 attempts. Its digital intuition was uncanny - immediately recognizing Firebase as the weak point without getting distracted by red herrings. But this brilliance came at a price - nearly burning through its $10 budget each time at $9.46 per successful hack.

DeepSeek V4Pro: China's contender shocked observers with its budget-friendly performance. While only succeeding 3 times, it achieved results at just $0.62 per attempt - 1/15th of GPT-5.5's cost. "For teams needing bulk security audits," Rahjerdi noted, "this cost difference becomes game-changing."

The Cautionary Tales

Not all models embraced their inner hacker:

  • Claude Opus 4.8 showed flashes of brilliance but kept self-interrupting due to its strict ethical programming
  • Gemini 3.1Pro Preview flat-out refused to play, triggering security protocols immediately

"It's fascinating," Rahjerdi observed, "how some models prioritized security over the test requirements, while others went all-in on the challenge."

What This Means for Cybersecurity

This experiment reveals more than just model capabilities - it hints at the future of digital defense. As AI becomes more specialized, we might see:

  • Automated security audits conducted by AI armies
  • Constant evolution of attack and defense strategies
  • New benchmarks for AI security reasoning

Key Points:

  • GPT-5.5 led in success rate (70%) but at premium costs ($9.46/attempt)
  • DeepSeek V4Pro delivered best value at just $0.62 per successful attempt
  • Some models prioritized security ethics over test objectives
  • Results suggest future cybersecurity may involve competing AI systems

The battle lines are drawn - in the digital security arena, AI models are showing they can be both formidable attackers and cautious defenders.