BMJ Open study finds half of leading chatbots give problematic medical advice

A BMJ Open study of five major chatbots found about half produced problematic medical answers, with a notable share being highly problematic due to false balance; this was reiterated by Bloomberg and NBC News.

OpenAI; Google; xAI; DeepSeek; Meta AI · Incident Apr 1, 2026 · Indexed Jun 8, 2026 · 4 sources

False balance in AI medical guidance; unproven therapies are presented as evidence-based options.
What
A BMJ Open study of five major chatbots found about half produced problematic medical answers, with a notable share being highly problematic due to false balance; this was reiterated by Bloomberg and NBC News.
Incident date
Apr 1, 2026
Who
OpenAI; Google; xAI; DeepSeek; Meta AI
Failure mode
Hallucination
AI surface
Chatbot
Severity
High

What happened

A peer-reviewed BMJ Open study published in April 2026 evaluated five leading chatbots: ChatGPT, Gemini, Grok, DeepSeek, and Meta AI. It found that about half of their medical responses were problematic, with 19.6% rated as highly problematic due to inaccuracies. A notable failure was false balance, where the AI presented unproven therapies as if they were evidence-based options. The study attributes this to training on broad web data and a tendency to align with user beliefs.

What broke inside the model

Failure path · mode profile · Hallucination
  1. 01 · TriggerA user asks for a fact, a citation, or a figure.
  2. 02 · Model stepThe model writes a fluent, confident answer.
  3. 03 · Control gapNothing ties the claim back to a real source.
  4. 04 · FailureA fabricated fact ships as if it were verified.
  5. 05 · ConsequenceThe false claim reaches a customer, a court, or the public.

Confidence holds, and even spikes, as the claim detaches from any source.

The study notes that chatbots are trained on broad web data including social media and forums, which can bias responses toward balanced yet inaccurate representations of evidence; models may align with user beliefs (sycophancy) rather than strict scientific accuracy.

Public visibilityMedium
Regulatory exposureNone
Customer impactFew customers
Financial impactUnknown
Time to disclosureDays
  1. PrimaryBMJ Open study finds AI chatbots provide poor answers to medical questions half the time, study findsbmjopen.bmj.com
  2. PressAI chatbots give misleading medical advice about half the time, study findsbloomberg.com
  3. PressAI chatbots will tell you where to find alternatives to chemotherapy if you ask them, a new study findsnbcnews.com
  4. PressAI chatbots provide poor answers to medical questions half the time, study findscidrap.umn.edu
Permalinkhttps://failureindex.ai/failures/bmj-open-study-finds-half-leading
CitationAI Failure Index. "BMJ Open study finds half of leading chatbots give problematic medical advice" (FI-0324). Realm Labs. https://failureindex.ai/failures/bmj-open-study-finds-half-leading (indexed Jun 8, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0324. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard
  • AI Detection & Response (AIDR)

A runtime layer that watches the model's internal state can flag the moment a model commits to a claim it has no support for, and hold or reroute the response before it reaches a user. Realm reads those signals in real time rather than grading the transcript after the fact.