DeepSeek AI Faces Scrutiny Over Possible Gemini Data Use
DeepSeek's newly released R1 reasoning AI model has demonstrated impressive capabilities in math and coding benchmarks, but its success comes with controversy. The company's silence about its training data sources has led researchers to suspect the model may have been partially trained on Google's Gemini AI series.
Image Source Note: Image generated by AI, image authorization service provided by Midjourney
Suspicious Similarities Emerge
Sam Paeach, a Melbourne-based developer, identified concerning parallels between DeepSeek's R1-0528 model and Google Gemini2.5Pro. While not conclusive evidence, these observations were reinforced by an anonymous SpeechMap project founder who noted identical "thought trajectories" during reasoning tasks - a behavioral fingerprint that's hard to explain without some shared training basis.
This isn't DeepSeek's first encounter with such allegations. Last December, their V3 model frequently misidentified itself as OpenAI's ChatGPT, suggesting possible training on ChatGPT logs. Earlier this year, OpenAI disclosed finding evidence of DeepSeek employing "data distillation" techniques - extracting knowledge from existing models to train new ones. Bloomberg reports suggest leaked OpenAI developer account data may have facilitated this process.
The Gray Area of Data Distillation
While common in AI development, data distillation walks an ethical tightrope. OpenAI explicitly prohibits using its outputs for competitive products. The challenge intensifies as low-quality web content causes models to unintentionally mimic each other, making source attribution increasingly complex.
Nathan Lambert, an AI researcher, notes DeepSeek certainly has resources to access top-tier API models for synthetic data generation. "With their funding," Lambert explains, "they could easily generate training material from the best available models."
Industry Responds With Security Measures
AI companies are fortifying defenses against unauthorized data extraction. OpenAI now mandates identity verification for advanced model access, while Google bolsters security on its AI Studio platform and restricts access to model generation patterns.
The situation highlights growing tensions in the competitive AI landscape. As models achieve remarkable capabilities, the industry faces difficult questions about intellectual property boundaries and ethical training practices.
Key Points
- DeepSeek's R1 model shows behavioral similarities to Google Gemini despite undisclosed training data
- This follows previous incidents suggesting unauthorized use of OpenAI's ChatGPT outputs
- "Data distillation" techniques raise ethical concerns despite being common practice
- Major AI firms are implementing stricter security measures to protect their models
- The incident underscores unresolved questions about intellectual property in AI development