Stanford's Latest Evaluation: DeepSeek R1 Medical AI Model Outperforms Google and OpenAI with High Scores

Stanford's Latest Evaluation: DeepSeek R1 Medical AI Model Outperforms Google and OpenAI with High ScoresAIbase基地Published inAI News · 3 min read · Jun 4, 202534 Recently, Stanford University released a comprehensive evaluation of clinical medical AI models. DeepSeek R1 stood out among nine advanced large models with a 66% win rate and a macro average score of 0.75, becoming the champion. The highlight of this evaluation is that it not only focuses on traditional medical license exam questions but also delves into the daily work scenarios of clinical doctors, providing a more practical assessment.

The evaluation team developed an integrated evaluation framework called MedHELM, which includes 35 benchmarks covering 22 subcategories of medical tasks. This framework was validated by 29 practicing doctors from 14 medical specialties to ensure its rationality and practicality. Ultimately, the evaluation results revealed the superior performance of DeepSeek R1, followed by o3-mini and Claude3.7Sonnet. {{MEDIA_PLACEHOLDER_0}} In particular, DeepSeek R1 demonstrated stable performance in various benchmark tests, with a standard deviation of only 0.10 in the win rate, indicating its stability across different tests. o3-mini performed notably in the clinical decision support category, achieving a 64% win rate and a highest macro average score of 0.77, ranking second. Other models like Claude3.5 and 3.7Sonnet followed closely behind with win rates of 63% and 64%, respectively. {{MEDIA_PLACEHOLDER_1}} Notably, this evaluation innovatively adopted the Large Language Model Jury (LLM-jury) method for result assessment, showing high consistency with the scores given by clinical doctors, proving its effectiveness. Additionally, the research team conducted a cost-benefit analysis, finding that the usage cost of inference models is relatively high while non-inference models are more cost-effective, suitable for users with different needs.

This evaluation not only provides valuable data support for the development of medical AI but also offers more possibilities and flexibility for future clinical practices. [AI New Terms](/search/AI New Terms&type=0)[Clinical Medicine](/search/Clinical Medicine&type=0)DeepSeekR1 MedHELMThis article is from AIbase Dailysvg]:px-3 bg-[#0080FF] text-white rounded-lg text-sm px-4 py-2 hover:bg-blue-500" data-state="closed">Scan to viewWelcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.—— Created by the AIbase Daily Team© Copyright AIbase Base 2024, Click to View Source -

AI DAMN

Stanford's Latest Evaluation: DeepSeek R1 Medical AI Model Outperforms Google and OpenAI with High Scores