ChatGPT's Scientific Judgment Falls Short, Study Finds
ChatGPT's Scientific Inconsistencies Exposed in New Study
When it comes to complex scientific judgments, ChatGPT might not be as reliable as its confident tone suggests. A recent Washington State University study paints a concerning picture of the AI's limitations in this critical area.
The Flaws Beneath the Surface
Professor Mesut Cicek's team put ChatGPT through rigorous testing, analyzing its responses to 719 research hypotheses from business journals. The results? While initial accuracy appeared decent at around 80%, deeper analysis revealed serious problems:
- Accuracy barely better than guessing: After accounting for random chance, performance improved only slightly above 50/50 odds - what researchers called a "low D-grade" showing.
- Particularly poor at spotting falsehoods: The model correctly identified false statements just 16.4% of the time.
- Version upgrades didn't help: Even newer iterations like ChatGPT-5 mini showed no significant improvement on these tasks.
The Consistency Problem
The study uncovered another troubling pattern - ChatGPT often couldn't stick to its own answers. Researchers submitted each hypothesis multiple times and found:
"In some cases, we'd get completely contradictory responses using identical prompts," noted Professor Cicek. "One query might alternate between 'true' and 'false' answers like flipping a coin."
While the model maintained consistent conclusions about 73% of the time, that still leaves significant room for error in professional settings where reliability matters most.
Why This Matters for Businesses
The research team issued clear warnings for corporate decision-makers:
- Don't mistake fluency for expertise: ChatGPT's polished language can mask its lack of true understanding.
- Always verify outputs: Never treat AI conclusions as final without human review.
- Train staff appropriately: Employees need education about both AI capabilities and limitations.
"These tools don't actually 'know' anything in the human sense," Cicek explained. "They're matching patterns from training data, not reasoning through problems."
Key Points:
- ChatGPT struggles with scientific truth verification, performing only slightly better than random guessing
- Consistency issues plague responses, with answers sometimes flip-flopping completely
- Newer versions show little improvement on these specific tasks
- Business leaders cautioned against over-reliance on AI for complex judgments
- Human verification remains essential despite AI's convincing presentation


