Canada Revenue Agency's $18M Charlie chatbot gave wrong tax answers 66% of the time

The Canada Revenue Agency deployed an AI chatbot named Charlie that cost over $18 million to develop and operate since fiscal year 2018-19. An audit by Auditor General Karen Hogan found the chatbot provided correct answers in fewer than half of tested cases, with only 2 out of 6 questions answered accurately. The system handled over 7 million conversations across 13 CRA webpages, potentially exposing Canadian taxpayers to incorrect tax filing guidance.

Canada Revenue Agency · Incident Oct 21, 2025 · Indexed Jun 4, 2026 · 3 sources

A government tax chatbot that was wrong 66 percent of the time meant millions of Canadians received bad filing advice from the very agency that could penalize them for errors.
What
The Canada Revenue Agency deployed an AI chatbot named Charlie that cost over $18 million to develop and operate since fiscal year 2018-19.
Incident date
Oct 21, 2025
Who
Canada Revenue Agency
Failure mode
Hallucination
AI surface
Chatbot
Severity
High

What happened

The Canada Revenue Agency deployed Charlie the Chatbot in March 2020 to provide tax filing guidance to Canadians, spending over $18 million on its development and operation. Auditor General Karen Hogan's audit, tabled October 21, 2025, found that Charlie answered correctly in only 2 out of 6 test questions, an accuracy rate of roughly 44 percent. The chatbot handled over 7 million conversations and 18 million questions from the public while remaining active on 13 CRA webpages, and the AG report warned that errors in taxes and missed deadlines can be costly to taxpayers. Finance Minister François-Philippe Champagne subsequently demanded a 100-day service improvement plan from the CRA.

What broke inside the model

Failure path · mode profile · Hallucination
  1. 01 · TriggerA user asks for a fact, a citation, or a figure.
  2. 02 · Model stepThe model writes a fluent, confident answer.
  3. 03 · Control gapNothing ties the claim back to a real source.
  4. 04 · FailureA fabricated fact ships as if it were verified.
  5. 05 · ConsequenceThe false claim reaches a customer, a court, or the public.

Confidence holds, and even spikes, as the claim detaches from any source.

Charlie the Chatbot generated brief, context-poor responses that were factually wrong in the majority of test cases, indicating the underlying model failed to reliably retrieve and convey accurate tax information. The CRA had set a pre-Generative AI accuracy threshold of only 70 percent, meaning it accepted a 30 percent error rate as a baseline, and the upgraded GenAI version's accuracy could not even be precisely determined without a comprehensive review. The system lacked sufficient guardrails or verification mechanisms to prevent incorrect tax guidance from reaching the public at scale.

Public visibilityHigh
Regulatory exposurePossible
Customer impactClass-wide
Financial impactDisclosed
Time to disclosureMonths
  1. PrimaryCanada Revenue Agency Contact Centres - Auditor General of Canada Reportcanada.ca
  2. PressThe CRA spent $18 million on Charlie, its new tax information chatbotnationalpost.com
  3. PressIn scathing report, AG finds CRA call centres are slow to answer and often inaccuratecbc.ca
Permalinkhttps://failureindex.ai/failures/canada-revenue-agency-18m-charlie-chatbot
CitationAI Failure Index. "Canada Revenue Agency's $18M Charlie chatbot gave wrong tax answers 66% of the time" (FI-0145). Realm Labs. https://failureindex.ai/failures/canada-revenue-agency-18m-charlie-chatbot (indexed Jun 4, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0145. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard
  • AI Detection & Response (AIDR)

A runtime layer that watches the model's internal state can flag the moment a model commits to a claim it has no support for, and hold or reroute the response before it reaches a user. Realm reads those signals in real time rather than grading the transcript after the fact.