Will artificial intelligence diagnose your next illness?

A major earthquake has occurred in science and medicine, and its aftershocks will be felt around the world for years to come.

AI systems outperform doctors at diagnostic tests, but experts warn that real-world care needs human judgment, accountability and patient interaction. (Shutterstock)

The impacts of artificial intelligence are being discussed in every corner of modern society, yet healthcare has thus far remained somewhat sheltered from its broader impacts. Interest is widespread, but adoption has been uneven globally, slowed by regulation, the high risk of error, and the deeply human nature of clinical work.

Dr. Robert Wachter, chair of the Department of Medicine at the University of California, San Francisco, and one of the most vocal observers of clinical AI, argues in his fascinating new book, One Giant Leap, that the question of the place of AI in medicine is best understood along two lines: feasibility and risk. He writes that the diagnosis carries high risks and low feasibility, and is therefore difficult to implement.

Few people agree these days, but I bet most of us have great respect for doctors who make quick and accurate diagnoses. “Our veneration is partly because effective medical care begins with the correct diagnosis, and partly because it is the most interesting thing we do,” Dr. Wachter writes, adding that it is also important for the patient. Misdiagnosis costs time, money, the opportunity to provide appropriate treatment, and often, life itself.

Anyone who has watched a skilled diagnostician at work knows that there is something almost magical about how disease is discovered through a Sherlockian process of elimination, Bayesian theory, and a list of differential diagnoses. It’s why we honor fictional investigators like House, and why true diagnostic intelligence still seems like one of medicine’s finest arts.

To date, diagnosis has remained one of the most difficult areas of AI in medicine.

In a study just published in the journal Science, Dr. Peter J. Brodeur and his colleagues at Harvard Medical School and Beth Israel Deaconess Medical Center fed OpenAI’s o1 preview reasoning model with the same triage notes a nurse might scribble on the front door of an emergency room (vital signs, basic history, and first impression for a sick patient).

The results were amazing.

In 76 real cases drawn from a major Boston emergency department, the model reached an accurate or very close diagnosis in 67 percent of cases. The two therapists tested against it passed 55 percent and 50 percent. When the same doctors were later handed a set of differential diagnoses and asked to guess which ones had been written by a colleague and which had been written by the AI, they couldn’t tell the difference. One of them got 15% of his guesses correct; The other got 3 percent.

A brief word about what these systems are. A large language model is trained on vast amounts of text drawn from the Internet, books, and other sources. By processing trillions of words, it learns the statistical patterns of language. ChatGPT, Gemini, and Claude are large language models. When asked a question, they produce an answer one word at a time by predicting what should come next.

Until recently, these models produced answers in a single breath. An inference model, like OpenAI’s o1 series, works differently. He is trained to slow down, work through the problem step by step, and check his work along the way. The shift from single-breath answers to deliberate reasoning has led to one of the biggest leaps in AI performance to date.

Boston’s emergency department wasn’t the only one tested. The same model was subjected to five other experiments, including case puzzles published in the New England Journal of Medicine, which have long been used as demanding tests of diagnostic reasoning, and a separate set of management cases drawn from real patients where the question was not what the diagnosis was, but what to do next. Throughout the study, the model was compared to hundreds of doctors, previous AI systems, and historical human baselines. In almost every trial, the model performed at or above the level of doctors.

The authors also checked whether the model simply remembered the instances on which it was trained. It wasn’t. The model was not simply information retrieval. It seems to be common sense.

Are the doctors finished? Of course not.

In a companion perspective in science, Dr. Expressed by Ashley M. Hopkins and Eric Cornelis of Flinders University in Australia put it plainly: “Passing exams is not the same as being a doctor.”

Thinking about text isn’t like being a doctor either. The model works from words alone. He did not see the patient. She was unable to order a test and think twice when the result came out. It was not necessary to bring bad news to the family.

Medicine is not just a set of conclusions. It is a relationship and a calling.

There are also reasons to be concerned about how AI will change the doctor who uses it. Aviation has lived with a version of this problem for decades. When machines take over, human skills atrophy, which is why pilots are still trained for the moments when automation fails. There is no good reason to believe that clinical reasoning is exempt.

A radiologist reading images alongside AI can become tedious in cases where the AI makes mistakes. A novice physician trained in this era may not be able to develop the ability to recognize patterns that come from years of unaided practice. The doctor will often order the test indicated by the AI, even when his own judgment says it is unnecessary, because the cost of ignoring the algorithm’s flag – in litigation and conscience – is heavier than the cost of acting on it.

Ironically, sometimes a doctor who uses AI performs worse than a doctor who does not. One study in JAMA found that systematically biased AI predictions reduced doctors’ diagnostic accuracy. Having a seemingly reliable second opinion changes how the first opinion is presented.

This study is also really old. The model tested was released in September 2024. In AI, this is a generation. Current AI inference systems are multimodal, processing not only text but also images, audio, and video. The next generation will be more capable.

But there are important questions that the medical community and society as a whole have yet to answer, questions that cannot be solved by building smarter models alone. We need evaluation frameworks that measure performance under the loud ambiguity of real care rather than the clean prose of a puzzle case, transparency that lets the patient know that an algorithm has shaped their diagnosis, and clear lines of accountability when something goes wrong.

Anirban Mahapatra is a scientist and author. His latest book is When Medications Don’t Work. The opinions expressed are personal