In a study published in 2026, a study involving researchers affiliated with Harvard Medical School and a major Boston academic medical center quietly did something the medical establishment had been dreading. Researchers handed an AI the same messy, incomplete electronic health records that ER doctors work from every day, no curated test questions, no clean data set,s and let it try to diagnose real patients. The AI won
an OpenAI reasoning model correctly identified diagnoses in a majority of real emergency-room triage cases, outperforming the physician comparators. Two experienced attending physicians, working from the identical records, scored notably lower than the AI model. The study tested the model against a cohort of real ER patients from the study site at three stages of care: initial triage, first doctor interaction, and hospital admission.
That last detail is the one worth sitting with. Every previous AI-versus-doctor study used cleaned-up, structured cases, the kind of tidy problem sets that appear on board exams. This one used the real thing: incomplete notes, ambiguous symptoms, and records typed fast by nurses during shift change. The researchers called this the study’s most important distinction.
And here’s where it gets strange. The AI wasn’t just matching doctors on routine cases. On structured treatment-planning problems, O1 scored substantially higher than physicians on structured treatment-planning problems. Physicians scored less than half that, even when they had access to external reference resources.
One illustrative case from the study: According to the study, one illustrative case involved a patient with a complex presentation that the AI resolved by surfacing a buried detail in the chart that the clinical team had not connected. The human team hadn’t connected the dots. The AI flagged a possible lupus history in the patient’s records, a connection that could explain why the clot wasn’t responding. Whether that flag led to a changed outcome isn’t stated in the published findings, but the point stands. The model found a thread the doctors missed, buried in the same chart they were all reading.
What the Test Was Actually Measuring

The natural question is: how? A language model isn’t trained to diagnose anything. It’s trained to predict text. So how does predicting the next word in a sentence become something that looks like clinical reasoning?
The short answer is that medical knowledge is encoded in language. Everything a physician learns, from textbooks, case studies, treatment protocols, and published research, exists as text. A model trained on enough of that material develops something that functions like pattern recognition. It reads a chart the way a senior clinician reads a chart: looking for what doesn’t fit, what’s missing, what one symptom combination tends to mean.
One of the study’s co-authors put it plainly. AI models now score close to 100% on multiple-choice medical licensing exams. “We can’t track progress anymore,” he said. The study’s authors noted that AI performance on standardized medical exams has become so high that such benchmarks can no longer meaningfully track progress.
The ceiling comment is the part that should get attention. Medical AI benchmarks were designed to measure progress. They no longer do, because the thing being measured has outgrown them. The field is now running a different kind of race, against real-world conditions, real patient records, real clinical noise, and this study is the first serious lap time.
What This Doesn’t Mean

None of this is an argument for replacing physicians. A diagnosis is a decision, and decisions require accountability, judgment about what a patient can tolerate, and a conversation about risk. An AI doesn’t sit across from anyone. It doesn’t notice that the patient seems more confused than the chart suggests, or that the family in the hallway looks terrified and needs five minutes.
What the study actually shows is something more specific and more useful: that the information-processing part of diagnosis, the part that requires holding dozens of variables in mind simultaneously and checking them against a vast store of prior cases, is something AI does better than humans. Not sometimes. Reliably.
The question that follows isn’t whether to hand medicine over to a language model. It’s whether a doctor reading an ER chart in 2027 who isn’t using one of these tools is making a decision with less information than they could have. That’s the gap this study opened. It won’t close on its own.
This article was researched, written, and edited by our human editorial team. AI tools were used in a limited research-assistant capacity. All claims were independently verified.