Large language models perpetuate racial bias in healthcare

AIAAIC recorded an incident entry (published November 2023) documenting that large language models (LLMs) have produced racially biased outputs in healthcare contexts. Independent academic audits and studies (including a 2024 audit titled "Unmasking and Quantifying Racial Bias of Large Language Models") found LLMs gave systematically different clinical-related recommendations and projections across racial groups. These outputs have the potential to cause harm when used in clinical decision-making by healthcare deployers.

Unspecified / healthcare deployer · Incident Nov 1, 2023 · Indexed Jun 10, 2026 · 3 sources

LLMs reflected statistical patterns in their training data that produced racially differential clinical suggestions.
What
AIAAIC recorded an incident entry (published November 2023) documenting that large language models (LLMs) have produced racially biased outputs in healthcare contexts.
Incident date
Nov 1, 2023
Who
Unspecified / healthcare deployer
Failure mode
Hallucination
AI surface
Chatbot
Severity
High

What happened

AIAAIC logged an incident entry in November 2023 reporting that large language models perpetuated racial bias in healthcare contexts. Independent academic research (2024) audited leading LLMs and reported that the models produced systematically different cost, outcome and treatment-related outputs across racial groups. Other peer-reviewed and preprint studies likewise documented racial disparities in diagnostic or treatment suggestions from major LLMs.

What broke inside the model

Failure path · mode profile · Hallucination
  1. 01 · TriggerA user asks for a fact, a citation, or a figure.
  2. 02 · Model stepThe model writes a fluent, confident answer.
  3. 03 · Control gapNothing ties the claim back to a real source.
  4. 04 · FailureA fabricated fact ships as if it were verified.
  5. 05 · ConsequenceThe false claim reaches a customer, a court, or the public.

Confidence holds, and even spikes, as the claim detaches from any source.

The failure arose from model outputs reflecting statistical associations and biases present in training data and evaluation processes, causing the LLMs to produce racially differential recommendations. Audit work found the models consistently projected different expected costs, lengths of stay, or clinical assessments by race, indicating gaps in dataset representativeness and fairness testing before deployment.

Public visibilityMedium
Regulatory exposurePossible
Customer impactMany customers
Financial impactUnknown
Time to disclosureMonths
  1. PrimaryLarge language models perpetuate healthcare racial bias - AIAAICaiaaic.org
  2. PrimaryUnmasking and Quantifying Racial Bias of Large Language Modelspubmed.ncbi.nlm.nih.gov
  3. PrimaryRacial bias in AI-mediated psychiatric diagnosis and treatment - PMCpmc.ncbi.nlm.nih.gov
Permalinkhttps://failureindex.ai/failures/large-language-models-perpetuate-racial-bias
CitationAI Failure Index. "Large language models perpetuate racial bias in healthcare" (FI-0425). Realm Labs. https://failureindex.ai/failures/large-language-models-perpetuate-racial-bias (indexed Jun 10, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0425. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard
  • AI Detection & Response (AIDR)

A runtime layer that watches the model's internal state can flag the moment a model commits to a claim it has no support for, and hold or reroute the response before it reaches a user. Realm reads those signals in real time rather than grading the transcript after the fact.