AI has been evolving rapidly since its inception. Forward-thinking physicians are evolving with it.
Hamish Fraser, MBCHB, MSc, has a message for has a message for physicians who
have not yet considered how the latest advances in artificial intelligence might affect their practices.
“You need to start getting a feel for it—because you don’t really have a choice,” says Fraser, an associate professor of medical science and of health services, policy, and practice. The reason, he adds, is simple: Medical AI isn’t on the horizon. It’s already here.
“This thing has happened,” he says. “It’s happening in the hospital. It’s happening to your patients every day.”
“Artificial intelligence” is a catchall term for technologies that aim to mimic human behavior and decision-making—technologies that physicians and biomedical researchers have in some cases been using for decades.
The earliest forms of medical AI were programmed to follow rules formulated by human experts (“If a patient has symptoms X and Y, treat according to protocol Z”). Physicians have long used such rule-based expert systems, which have over time evolved to deal with uncertainty and complexity, to help diagnose and manage disease; and patients now have access to them through online symptom checkers developed by companies like WebMD.
Next came machine learning systems that could figure out how to perform tasks they hadn’t been programmed for, like recognizing tumors or identifying skin lesions. Some of these systems employ artificial neural networks that learn by simulating the way connections are formed between neurons in the human brain. By layering multiple neural networks on top of each other, developers can create highly sophisticated deep learning models.
Researchers routinely use machine learning for tasks like identifying drug candidates and predicting disease-associated genes. On the clinical side, machine learning is getting better at interpreting diagnostic and imaging data like x-rays and EKGs and predicting polygenic risk scores for diseases such as Alzheimer’s and type 2 diabetes.
Generative AI programs like OpenAI’s GPT and Anthropic’s Claude represent the latest stage in the evolution of artificial intelligence. These new systems harness the power of neural networks and deep learning to generate new content based on whatever data (text, images, sound) they have been trained on. Generative AI models are already outperforming their predecessors at tasks like diagnosing rare conditions and mimicking human empathy. Perhaps just as importantly, they are incredibly accessible. ChatGPT and Claude, for instance, are based on large language models (LLMs), a species of generative AI that responds to natural language prompts. As a result, anyone—physician or patient, tech-savvy researcher or Luddite—can use them.
“The barrier to entry has never been lower for someone with a good idea to utilize this technology in a meaningful way,” says Rohaid Ali RES’25, MD, a neurosurgery resident at Brown who has collaborated on several AI projects with his wife, dermatology resident Fatima Mirza RES’25, MD, MPH.
But like any technology, AI also presents risks.
In general, AI systems are only as good as the data they train on, and are prone to various forms of bias. They can also be steered to serve the goals of different groups with competing interests: physicians and patients, insurers and pharmaceutical companies.
In addition, neural networks and deep learning systems are “black boxes” in the sense that no one, including their creators, can explain exactly how they generate their results: Their networks of virtual neurons learn to process information according to their own inscrutable logic, analyzing data and generating predictions in ways that remain invisible from the outside. Generative AI models also occasionally hallucinate, producing nonsensical results with the digital equivalent of a straight face.
All of which means that AI must be used with caution—preferably under the care of informed physicians who understand its capabilities and limitations. “The most important part of any AI project in this day and age,” Mirza says, “is making sure that appropriate guardrails are in place.”
Lorin Crawford, PhD, a distinguished senior fellow in biostatistics in the Brown School of Public Health, has spent much of the past decade developing AI tools to conduct clinically significant research in several different areas.
One line of inquiry focuses on how genes interact with each other and the environment to generate the architecture of physical traits like height and BMI or complex diseases such as schizophrenia and type 1 diabetes. In a recent study he co-authored with Sohini Ramachandran, PhD, the Hermon C. Bumpus Professor of Biology and founding director of Brown’s Data Science Institute, Crawford helped develop a neural network capable of teasing out the relationships between various genetic features and levels of high- and low-density lipoprotein.
Crawford also uses custom-tailored algorithms to explore gene-gene and gene-environment interactions at the cellular level to better understand cancer recurrence and drug resistance. In a similar vein, he is constructing virtual ex vivo tumor models based on real tumor tissue that might one day allow doctors to screen anticancer drugs for specific patients. And he is trying to correlate the shapes of tumors as revealed through medical imaging with their molecular properties to reduce the need for biopsies.
Crawford expects that generative AI will drive advances across a wide variety of fields, including genomics, imaging, and drug discovery—in part because of the technology’s ability to learn from its human interlocutors. “You ask GPT something, it says something back to you, you then have a response, and it learns from that interaction,” he says.
Yet these next-generation AI tools still have limitations.
The biomedical data banks that are used to train research models, for instance, tend to be highly skewed toward people of European ancestry. This limits the degree to which the predictions the models generate can be generalized to other populations, and risks exacerbating health disparities among different groups. (Similar problems can apply to data drawn from only one hospital, or from patients who all share the same socioeconomic status.) “The model can only learn based on patterns of things that it’s exposed to,” Crawford says.
“This is a very powerful and useful technology,” he says. “My biggest concern is that it is being thrown out there way, way, way too early.”
Hallucination, meanwhile, means that generative AI is guaranteed to spout rubbish a certain amount of the time.
“You query a model five or six times, and it gives you something insane,” Crawford says. On the one hand, Crawford explains, hallucination is just an extreme manifestation of the technology’s capacity to extrapolate beyond the task it was trained for. “In some ways, that allows the model to make discoveries,” he says.
On the other, it renders the technology potentially hazardous to naïve users who lack the knowledge and expertise to know when it’s making stuff up. The black-box quality of the models further complicates matters.
Like hallucination, this lack of transparency might not matter much in a relatively low-stakes situation where mistakes carry few consequences. But for a doctor in a clinical setting, understanding which features a diagnostic AI tool has used to make a prediction could be vital, if only to establish confidence.
And while Crawford and his colleagues are actively working on ways of making AI models less opaque and therefore capable of winning physicians’ trust—“we’re pretty close,” he says—they aren’t quite there yet.
Trust and confidence lie at the heart of Fraser’s research on AI in health care. Indeed, much of his work represents an attempt to determine just how reliable AI is in clinical settings—and how different kinds of AI perform relative to each another, and to real physicians.
For example, Fraser is currently comparing the performance of expert systems and machine learning programs with generative models such as GPT. Unlike most studies of generative AI in health care settings, Fraser’s investigations use real patient data; and Fraser has real physicians assess and critique the diagnoses and triage recommendations made by the programs that he evaluates.
So far, the results have been decidedly mixed.
Fraser has shown that while GPT is capable of greater diagnostic accuracy than symptom checkers that rely on expert systems, it is also considerably less reliable. For instance, unlike its rule-based predecessors, the newfangled LLM tends to route patients with serious conditions such as heart attack or head injury to routine rather than emergency care. It is also likely to interpret the same patient data in significantly different ways on different occasions. And the latest version of GPT that Fraser tested actually performed worse than the previous one.
There are reasons for optimism, however. Fraser is evaluating GPT’s ability to diagnose and triage stroke and transient ischemic attacks, conditions for which prompt treatment is vital. Having a home-based diagnostic tool capable of sending high-risk individuals to the ED could significantly improve patient outcomes, and preliminary results indicate that GPT is highly sensitive: “It missed almost nobody,” Fraser says. Unfortunately, it also suffered from low specificity, sending almost every patient in the dataset to the ED—a problem Fraser hopes to address in ongoing research.
Fraser is also comparing the ability of machine learning systems and generative AI programs to reduce the number of HIV patients who are lost to follow-up care in Kenya. The technologies are being used to predict which patients are at high risk of missing appointments so that they can be given additional support—a context in which any improvement would be helpful, and any errors would not carry the dire consequences of a faulty diagnosis or triage recommendation in an emergency situation. “There’s a good chance we can make a difference, and there’s limited opportunity to make things worse,” Fraser says.
When it comes to supporting important clinical decision safely and effectively, however, Fraser contends that generative AI has yet to prove itself. “This is a very powerful and useful technology,” he says. “My biggest concern is that it is being thrown out there way, way, way too early.”
He is especially worried that most of the people using programs like ChatGPT for health care purposes are not physicians but ordinary patients.
“They have no training; they don’t know what the limitations are; they don’t know how inconsistent it’s going to be,” he says. “Sometimes they’re going to get some really helpful answers, and maybe next time it will be really unhelpful.”
Part of the solution, Fraser and his colleagues say, is to have doctors insert themselves in the conversation about generative AI: familiarizing themselves with the technology’s power and pitfalls; sharing that information with the people under their care; and guiding the ways in which the technology is evaluated and implemented so that it serves the interests of patients.
In addition to addressing issues of safety and efficacy, Isaac Kohane ’81, MD, PhD, P’25, chair of the Department of Medical Bioinformatics at Harvard Medical School and editor-in-chief of the journal NEJM AI, says that engaging with the technology will prevent corporate and commercial interests from having sole control over how generative AI is used in health care.
As a cautionary tale, Kohane points to electronic health records. Because EHRs were seized upon as a means of producing the documentation necessary to maximize billing, their adoption increased the number of hours that physicians spent stuck behind computer screens, decreasing the amount of time they spent engaging with patients and further fueling burnout.
The advent of AI-driven ambient listening systems that can eavesdrop on clinician-patient conversations and automatically generate clinical notes could help remedy that situation, reducing the hours doctors spend on documentation and improving patient care. But unless their implementation is supervised by physicians themselves, such systems could also be used to pressure clinicians to cram even more patients into their practices. Similar caveats apply to the growing use of generative AI to optimize reimbursements and process claims.
“AI can be used to improve our diagnostic acuity and therapeutic focus, but it could just as easily be used to maximize procedures that are highly remunerative yet not particularly worthwhile or evidence based,” Kohane says. “It could be used to drive further volume without the patient interactions that both patients and doctors crave, and to disintermediate human beings from deciding whether a particular type of care should be paid for or not.”
The antidote to all these problems, Kohane argues, is to have physicians in the loop steering the models and advocating for patients.
In the short term, Kohane thinks that generative AI—with its ability to analyze massive amounts of textual information—will play a valuable role in medical education.
This is already beginning to happen. Jay Khurana ’19 MD’25 and Hossam Zaki ’22 MD’26, for instance, are investigating the use of generative AI to produce digital flashcards and summaries for preclinical lectures at The Warren Alpert Medical School.
“Med students love flashcards; they study everything from flashcards,” Zaki says. While student-generated flashcard decks are already available for some courses, they can fall out of date—whereas an AI-generated deck can always be as fresh as the last class.
Khurana and Zaki are working with the Office of Medical Education and Brown’s institutional review board to see if these AI-generated study aids, which are vetted by students to ensure that they meet the course learning objectives and aren’t marred by hallucinations, can help improve learning. (Preliminary results indicate that approximately 75 percent of students who used the AI material felt that it saved them time during studying.) They have received regulatory approval to test the same technology in the Department of Neurology at Brown University Health, where they will use it to generate summaries and flashcards of the material presented in grand rounds and didactics.
“Assuming we have a successful pilot there, we’re planning on expanding to many other departments,” says Khurana, who could see such a system being used to summarize a surgeon’s explanation of a procedure for the benefit of residents who cannot easily pay attention and take notes at the same time. Eventually, adaptive AI systems might even be able to tailor the materials they generate to individual learning styles.
“Having an incomplete and fallible but remarkably high-performing artificial assistant as part of the care team seems like a really good safety measure.”
Over the longer term, Kohane believes that integrating AI into physician-led teams could help mitigate the double whammy of rising demand for health care services and mounting physician shortages.
With the Association of American Medical Colleges predicting that the United States will face a shortfall of up to 86,000 physicians by 2036, Kohane envisions a future in which primary care physicians will quarterback teams of nurse practitioners and physician assistants with access to a variety of AI tools—including ones that may be able handle some tasks currently performed by specialists, such as interpreting echocardiograms.
Despite their inherent limitations, Kohane says that generative AI assistants could also provide physicians with instant second opinions and review their work for possible errors, all with an eye toward reducing patient mortality and morbidity.
“Unless we’re willing to posit that doctors are infallible and all-knowing, having an incomplete and fallible but remarkably high-performing artificial assistant as part of the care team seems like a really good safety measure,” he says.
Keeping physicians in the loop—and keeping patients top of mind—is key to Mirza and Ali’s AI work.
The two residents first encountered ChatGPT when they used it to study for their board certification exams in 2022. “We were really impressed by how it was able to simplify complex concepts and make them digestible,” Mirza says.
At around the same time, Mirza encountered a pediatric patient whose family was finding it difficult to understand the standard Brown University Health surgical consent form. She and Ali decided to see if GPT could work its explanatory magic with the document.
Whereas most Americans read at a sixth-grade level, analysis of consent forms from various institutions revealed that most were written at a college reading level. Mirza and Ali therefore fed GPT the surgical consent form along with the following prompt: “While preserving context and meaning, convert this consent form to the average American reading level.”
The two ran the simplified draft past an interdisciplinary group of stakeholders (clinicians, patient advocates, hospital leadership, the legal department) for review and approval, and by 2023 the new form had been disseminated across the entire Brown University Health system, where it is now used to consent approximately 35,000 procedures annually. GPT-simplified consent forms for chemotherapy and radiology, along with a simplified patient intake form, have since been introduced as well.
The initial act of simplification turned out to be the easiest part of the process: a single prompt, and the job was basically done. For Mirza and Ali, the real work involved establishing a framework wherein physicians and others could carefully vet the LLM’s output.
“AI was used as a tool to start the conversation,” Mirza says. “But there were experts at the table running a critical eye over what was produced and ensuring that it was accurate before it was rolled out in the clinical setting.”
A similar concern for the safe and responsible deployment of generative AI guided Mirza and Ali’s efforts to help a patient named Lexi using voice cloning, the same technology that is used to create deepfakes of people saying things they never actually said.
Lexi was left with severe speech deficits following surgery for a vascular brain tumor, and a representative of OpenAI who knew about Mirza and Ali’s consent form work suggested that the company’s Voice Engine model might prove helpful. The model recreated Lexi’s voice after training on a 15-second video she had made of herself in high school, and OpenAI engineers developed a text-to-voice app for her smartphone. Today, Lexi can “speak” in her old voice by tapping on a screen; yet because the app is keyed to her unique vocal signature, the technology cannot be used to mimic other people’s voices for fraudulent purposes.
In the end, the team created a template for implementing a sensitive new technology safely and ethically—a template that Ali likens to the long-standing practice of prescribing controlled substances rather than selling them over the counter. “You don’t have general access to this advanced technology, so there is not widespread abuse,” he says. Lexi’s care team continues to monitor her progress to ensure that her use of the technology has no off-target effects, such as impairing the recovery of her natural voice.
Neither the consent form nor the Voice Engine project would have been as successful had they not been led by physicians who were intent on doing no harm. And they wouldn’t have transpired at all had Mirza and Ali not decided to explore the power of generative AI to do good.
Ultimately, that combination of promise and peril represents the best reason for physicians to get involved now, when they can still shape the future of a technology that will likely reshape health care.
Or as Mirza told the NEJM AI Grand Rounds podcast: “AI is here and it’s here to stay. And it’s either going to have the input of the people who really care about the patients, or it’s not.”