ChatGPT can pass standardized medical exams—but lacks clinical reasoning skills.
The chatbot ChatGPT has already proved itself capable of regurgitating the copious amounts of information expected from a medical student, passing the MCAT and US Medical Licensing Exam with flying colors.
Earlier this year, John Lin ’23 and Oliver Tang ’19 MD’23 pushed it further, comparing the performances of GPT-3.5 (which ChatGPT runs on) and its younger sibling, GPT-4, on an ophthalmology boards exam with the performance of residents and practicing physicians.
“We were really surprised,” Lin says. “We’re seeing really rapid advances, since GPT-3.5 didn’t pass the exam a couple months ago.”
When excluding image-based questions, which GPT-4 can’t process, ChatGPT outperformed humans. Lin and Tang’s results showed that both models performed worse than humans on higher-order questions, which require multiple steps of reasoning, than first-order questions.
Over 90 percent of problems in the neurosurgery written boards exam are first-order questions, ones that novel large language models (LLMs) are consistently matching the performance of, if not out-performing, doctors. Tang suggests that testing include more “scenarios of clinical equipoise and emphasis on higher-order questions.”
But medical licensing exams may not be the right framework to judge the performance of ChatGPT, says Morgan Cheatham ’17 MD’25, who was part of the first research group to demonstrate that it could pass the USMLE. “The problems on the USMLE are not the questions that we as physicians are grappling with day by day,” he notes.
Cheatham and collaborators at Stanford are evaluating the performance of LLMs on real-world questions posed by physicians at the point of care. They created a database of questions that doctors may ask while treating patients that fall “within the gray areas of medicine.” In moments of uncertainty, physicians consult other specialists; Cheatham and his team wanted to know, “What would happen if you consulted AI?”
When reviewed by practicing physicians, over 90 percent of the responses generated by GPT-3.5 and GPT-4 were deemed “safe,” but less than 20 percent of the responses agreed with the “correct” answer. Nearly a third left physicians divided and unable to assess their validity.
“The takeaway is that these models provided insufficient answers 70 percent of the time,” says Cheatham, who is also a resident editor at the New England Journal of Medicine Artificial Intelligence. “Without specialized training on clinical, medico-legal, ethics, safety, and other data sources, current LLMs aren’t ready for prime time when it comes to clinical decision-making.”
Implicit bias in data remains an unavoidable yet dangerous problem in medicine. Trained on imperfect or biased datasets—such as clinical trials that excluded underserved populations—LLMs could exacerbate existing inequalities, Cheatham says.
In his view, it’s all about capturing data accurately and efficiently. As a vice president at Bessemer Venture Partners, Cheatham has invested in a handful of companies that do just that. “In medicine, capturing and preserving the patient’s narrative is extremely important, especially when considering downstream applications of AI,” he says. “What we write about patients in their medical record stays with them, often, forever.”
The most crucial role of health care providers in this trajectory of AI in medicine is to ensure that LLMs like ChatGPT are safe before implementation.
“No matter how incorporated AI is in patient care, the responsibility of patient care and patient outcomes falls onto a human being, the physician,” Tang says. “The responsibility of rigorously assessing the quality of these models and setting up proper guardrails against bad outcomes also rests on physicians.”