Move Slowly and Build Things

AI could transform medicine—or magnify its flaws. Physicians must work with developers to ensure fair and ethical design.

When Christopher Chute ’77 MD’82, DRPH, began assembling the National COVID Cohort Collaborative (N3C) database in April 2020—a project that amassed over 40 billion pieces of patient data from across the country—he wasn’t thinking about artificial intelligence. His focus was simple: clean, harmonized data that could be trusted to inform critical public health decisions.

“You’re going to get nonsense if the units aren’t aligned,” he says, noting how something as trivial as mismatched weight units—kilograms versus pounds—could undermine an entire analysis.

Yet it’s precisely this foundation of clean data that AI depends on, exposing a fundamental truth: no algorithm, no matter how sophisticated, can rise above the quality of its inputs. Beneath the surface lie deeper issues, from imperfect variables to gaps in representation of minority populations. These types of errors aren’t just technical hiccups; they are ethical fault lines, threatening to skew conclusions and decisions that would affect millions of lives.

AI has the power to transform medicine, promising breakthroughs in diagnostics, treatment planning, and personalized care. Behind the shiny facade of technological advancement, however, lurks an urgent ethical challenge: bias. AI systems, built on flawed or incomplete data, risk perpetuating—and even amplifying—existing inequities in health care. From racial disparities encoded into diagnostic tools to socioeconomic biases lurking in algorithms, these hidden flaws raise critical questions about trust, accountability, and fairness.

Ethical AI in medicine cannot be left to chance; it requires intention. It demands rigorous oversight, transparency, and a commitment to inclusion at every step—from the way data is collected to how algorithms are designed and deployed. As AI takes on a larger role in medicine, these issues become even more pressing. Bias in AI isn’t just a bug—it’s a reflection of human systems and choices, baked into every dataset and algorithm. With intentional design, interdisciplinary collaboration, and a relentless focus on ethics, medicine can chart a different path— one where AI becomes a force for equity, not exclusion.

Machine learning scientists and AI specialists typically approach their work through a structured workflow: data collection and cleaning, algorithm training and testing, interpretation of results, and deployment of the algorithm into real-world applications. At each step, new forms of bias can creep in, raising ethical challenges that must be addressed. Using that same workflow, this article explores these issues and explains how they can be mitigated along the way.

Data Collection + Cleaning

Bias can be introduced into AI at different stages and different levels; it starts before AI even becomes involved. “Garbage in, garbage out” is the golden mantra in machine learning communities, meaning that an analysis can only be as good as the data. Data and information are what fuel these intelligent algorithms and tools, but bias can be introduced even by the systems that collect it.

“Data collection is opportunistic. You have data, and you use it. For virtually all of machine learning and computer science, the data collection part is never thought about,” says Suresh Venkatasubramanian, PhD, director of the Center for Technological Responsibility, Reimagination, and Redesign at Brown’s Data Science Institute. There are biases rooting from which data is collected, how that information is encoded in the form of data numbers, and how it is again represented.

Chute, the Bloomberg Distinguished Professor of Health Informatics at Johns Hopkins University, says that if you train models using data whose units aren’t aligned, the results will be meaningless: Those with weight in kilograms, for example, will be perceived by the model as significantly less in weight than those reported in pounds. This data must follow a universal standard and convention, a process known as data harmonization—otherwise, he says, “you’re almost certainly biasing, if not fundamentally invalidating whatever conclusions that you might make.”

Bringing data together from dozens of sources means there will undoubtedly be differences in conventions, coding systems, and frameworks. Neil Sarkar, PhD, MLIS, is an associate professor of medical science and of health services, policy, and practice at Brown and the CEO of the Rhode Island Quality Institute. He also runs the Rhode Island Health Information Exchange, in which “all [patient]data comes together into one unified medical record that serves the state,” meaning that if a patient gets a flu shot at CVS, seeks care from a primary care physician, and is referred to a hospital, providers at each stage will have access to the electronic health data generated at every site. The goal is to make sure that “the system does what it needs to do for supporting care coordination,” because “health care is very fragmented,” Sarkar says.

Uniting all of this data is challenging and requires compromise. Data coming from various sources could use different standards.

“There’s a race component to those equations that is completely and absolutely biased,”

Harmonizing these datatypes to a universal standard can mean mistranslation or a loss of nuance. Furthermore, medical data comes in all sorts of modalities—imaging, pathology reports, EHRs, claims, long-form clinical notes, wearables—all of which represent the story of a patient and their condition. However, for many types of studies, multimodal data needs to be distilled into two-dimensional data that can be presented in something like a spreadsheet.

“Sometimes you lose the forest for the trees and your simplifications can erase significant data components,” Chute says.

While this step is often necessary, its trade-offs can be minimized if handled by someone with a deep understanding of the data. While it might seem that no AI algorithm can ever be trained to be “fair” or “correct,” Chute points out that nothing in medicine is perfect or free of bias. Not even clinical trials, which have long been considered the gold standard, superior to observational data: Studies have shown that participation in studies or trials depends on demographics, propagating bias that stems from consent.

Algorithm Training + Testing

Bias occurs in the way algorithms are built and constructed to optimize for specific things. “The way you build machine learning is to say, ‘I want a pattern finder that’s good,’” Venkatasubramanian says, where “good” can be defined as whatever scientists are working toward. The motivations, attitudes, and metrics used to define such a “good” will distort to pattern matcher to introduce other biases.

“Artificial intelligence” refers to a variety of techniques aimed at different goals, achieved in different ways. It can be generally split into two large subgroups: classical machine learning and generative AI. Classical machine learning methods, such as neural networks, are tasked with being really good at predicting a specific outcome. However, this focus on accuracy often comes at a cost of interpretability, leading to the black box phenomenon, where it becomes challenging to understand how the algorithm arrived at such conclusions.

There are techniques to mitigate the black box effect and retrace steps in classical machine learning, but not for generative AI. For example, ChatGPT uses statistical patterns to predict the most likely next word in a sentence. However, Chute says, “they have no notion of a fact. They only model language.” For now, LLMs are not capable of logical reasoning. While it might tell us that 2+2=4, we can’t trust that it actually understands that 2+2=4, only that it has seen it enough times to correctly predict it. Unlike classical methods, generative AI models operate as opaque systems, where their inner workings cannot be retraced, making the black box effect even more pronounced.

“Biases operate on both sides,” Chute says. For generative AI methods, where the black box is unavoidable, it’s more difficult for clinicians to “follow if it’s not interpretable,” he says. Considering that all the neural networks’ future predictions and results are completely triangulated and based off the training data, it reiterates the need for the data to be accurate, representative, and “clean.”

When it comes to evaluating how well these AI models perform and the accuracy of the results, gold-standard clinical scores are often used as the source of “truth.” Yet these sources of truth can themselves be riddled with bias, racism, and error. For example, the eGFR equation for kidney disease, which used to include race-based adjustments, perpetuated health care inequities by overestimating kidney function in Black patients, delaying diagnosis and treatment. The formula included a “correction” factor for race, labeling individuals as “Black” or “non-Black,” based on the flawed presumption that Black individuals, on average, have higher muscle mass.

“There’s a race component to those equations that is completely and absolutely biased,” Sarkar says. “If you go into many of the clinical equations that are used in practice, you will find that race, especially, is encoded right into the algorithm, and that’s very problematic.”

Sarkar says a knee-jerk reaction would be to simply remove all these equations from practice, but this would potentially create new problems, as physicians would have nothing to refer to as baseline. For example, childhood growth curves are based mostly on those of white children. Pediatricians know this, so when they see a child of another race, they recalibrate and contextualize the standard for growth. They can explain to parents why the growth curve is a little off and reassure them their child is on a positive trajectory.

“Humans adjust for that, but the computer doesn’t,” Sarkar says—and that may be the most telling and dangerous difference between humans and AI. “The biggest challenge we have right now is not whether AI is getting the right answer all the time, it’s that what the AI is producing is actually interpretable and clinically actionable so that the clinician feels that they are comfortable with what they’re about to do to this patient,” he says.

The worst-case scenario is putting physicians in a position where they can make wrong decisions based on wrong answers from AI. In medicine, there are long-standing, established measures of success that are consistent and regulated, but these do not yet exist for AI methods, because they’re constantly evolving. In general, it’s difficult to define metrics for something that is not consistent nor reproducible.

“We still don’t see a lot of diagnosis being done exclusively by an AI or machine-learning algorithm,” Sarkar says. “We’ve had the ability to implement machine-learning-based clinical decision support for more than 40 years, so it’s not a new thing. But we’ve turned most of those off, and some of that’s because the doctors don’t know what we can trust and what we can’t trust.”

However, he adds, by understanding AI’s shortcomings, medicine can find jobs best suited for its strengths, like generating summaries of clinical notes, or transcribing clinic visits—uses that are “much better for the patient as well as for the clinician in terms of doing what they need to do,” he says.

Deployment

In 2021, President Biden tapped Venkatasubramanian to coauthor “A Blueprint for an AI Bill of Rights.” The initiative aimed to define what people-centered AI should look like, emphasizing the protection of human rights and values over purely AI-driven goals. “What do we need as a people? What should our government be doing to protect us as AI is changing?” were among the questions Venkatasubramanian asked.

He and his collaborators devised five principles: AI must be safe and effective, not discriminate, use human data carefully, be transparent and accountable, and have an option to opt out. These principles directly address ethical concerns, particularly the risks of bias, misuse of data, and lack of accountability in AI systems. Venkatasubramanian says that ensuring these principles requires interdisciplinary collaboration, bridging gaps between scientists, clinicians, and policymakers, to tackle AI’s challenges effectively.

“If I’m a computer scientist, you’re a doctor, and we’re supposed to work together, what are our incentives? Sometimes, they don’t always match up,” he says. “The way they approach the world is also very different. Trying to make sure that these different perspectives can come together, it’s a matter of building a relationship. That takes time.”

Public education is another vital piece of this puzzle, Venkatasubramanian says: “The public has a much better handle on AI as long as you demystify what it’s about, how it works.” By showing live demos and encouraging interactive exploration, he finds that audiences start asking critical questions about ethics and bias—questions that professionals themselves should address. “It’s about acknowledging people’s fears and working with them, not dismissing them,” he says.

For many, the fear might be math. “We have to dispel the illusion that math is complicated,” Venkatasubramanian says. It’s a complex relationship of comradery between the public and professionals to understand AI from the perspective that they know best and then to come together, teach each other, and defend their rights. Otherwise, people are left vulnerable to those who may misuse AI to exploit or manipulate them.

This ethos informs the work Venkatasubramanian does at the Center for Technological Responsibility, Reimagination, and Redesign. “I want to train a generation of researchers to go out in the world that have the capacity to, A, build the tools and have the technical expertise to critically understand and talk about those technologies in the context of society; and B, understand the landscape and the stakeholders who are and should be involved in any of these discussions and not just limit themselves to the tech itself; and C, to be able to communicate all of this in a way that’s accessible to the general public,” he says.

Sarkar adds that medical students must be trained rigorously in fundamental medicine and accrue clinical experience to be able to critically evaluate AI as it integrates into the medical field. He draws an analogy to calculators—useful tools but not substitutes for understanding the basics. When a calculator errs due to faulty input, humans still need the knowledge to recognize mistakes.

“You need to be able to tell when AI is right or when it’s just making something up,” Sarkar says. “AI is very convincing, but it can be wrong. And you need to know when it is.” While AI may serve as a cyber second opinion, it is ultimately the human clinician who bears responsibility, he adds: “You’re the human. You’re still in charge.”

Conclusion

Bias in AI is not just a technical issue; it reflects societal inequities embedded in the data and the systems that create it. Addressing this requires a commitment to transparency, rigorous data cleaning, and inclusive algorithm design. Physicians, with their firsthand knowledge of the clinical landscape and patient needs, are uniquely positioned to recognize when AI fails and to demand accountability. As stewards of patient care, they are essential to ensuring that AI tools are interpretable, actionable, and fair.

Education is equally critical—both for clinicians who will use AI and for the public who will be affected by its decisions. Demystifying AI, explaining its limits, and fostering a collaborative approach between technologists and clinicians are vital steps toward building trust. AI is not a replacement for human judgment but a tool to enhance it. However, this requires a careful balance of innovation and responsibility, with ethics as the guiding principle.

Ultimately, AI’s promise lies in its ability to augment human capabilities, not replace them. By grounding its development in rigorous standards, interdisciplinary collaboration, and a relentless focus on equity, medicine can ensure that AI is a tool for better care, better outcomes, and a more inclusive future for all.

Data Collection + Cleaning

Algorithm Training + Testing

Deployment

Conclusion

Related Posts

Through the Eyes of a Child

This Was Written by ChatGPT

Zorro