AI Security Showdown: GPT-5.5 Outsmarts Vulnerabilities While DeepSeek Delivers Budget Wins
AI Models Put to the Security Test
Security researcher Kasra Rahjerdi devised a clever challenge to test how well AI models handle real-world vulnerabilities. By creating a deliberately flawed book review app containing exposed Google service credentials, he gave leading language models a practical security puzzle to solve.

Performance Under Pressure
With just two hours and $10 per attempt, the models showed strikingly different capabilities. GPT-5.5 emerged as the technical champion, successfully identifying and extracting credentials in 7 out of 10 attempts. The report highlights how GPT-5.5 could cut through interface clutter to instantly spot critical security flaws.
Meanwhile, Gemini 3.1 Pro Preview disappointed, triggering its rejection mechanisms almost immediately in each test. While this conservative approach kept token costs low, it failed to demonstrate meaningful security analysis capabilities.
The Cost Conundrum
While GPT-5.5's performance impressed, its $9.46 average cost per success raised eyebrows. For security teams needing to scale their testing, this price tag quickly becomes prohibitive.
Enter DeepSeek V4 Pro - the dark horse of cost efficiency. Though it succeeded in just 3 of 10 attempts, its mere $0.62 per success changed the economics entirely. That's roughly fifteen successful DeepSeek tests for the price of one GPT-5.5 success.
"For organizations running hundreds or thousands of security checks, this cost difference becomes transformative," the report notes. While DeepSeek occasionally stumbled on authentication interfaces, its budget-friendly performance offers practical value for large-scale deployments.
Key Points
- GPT-5.5 leads in raw security problem-solving (70% success rate)
- DeepSeek V4 Pro dominates cost efficiency at 1/15th GPT-5.5's price
- Gemini 3.1 consistently rejected test conditions without analysis
- Real-world security teams face tradeoffs between capability and budget