UK GOV.UK Chat gave citizens incorrect tax, VAT, and immigration advice in its alpha pilot

The UK Government Digital Service's GOV.UK Chat prototype produced inaccurate or misleading responses during a private pilot with approximately 1,000 users, scoring only 76% accuracy at its earliest benchmark. The system gave incorrect advice on tax, VAT registration, EU Settlement Scheme, and flight refund matters before GDS added filters to block certain question categories. The Times later reported that the chatbot gave misleading tax information, drawing criticism from tax professionals.

UK Government Digital Service (GDS) · Incident Jan 18, 2024 · Indexed Jun 4, 2026 · 3 sources

A RAG chatbot built to ground every answer in official government content still hallucinated wrong advice whenever the retrieved context fell short.
What
The UK Government Digital Service's GOV.UK Chat prototype produced inaccurate or misleading responses during a private pilot with approximately 1,000 users, scoring only 76% accuracy at its earliest benchmark.
Incident date
Jan 18, 2024
Who
UK Government Digital Service (GDS)
Failure mode
Hallucination
AI surface
Chatbot
Severity
High

What happened

During a private pilot with approximately 1,000 users in late 2023, the GOV.UK Chat prototype provided citizens with inaccurate or misleading responses on topics including tax, VAT registration, EU Settlement Scheme, and flight refunds. GDS published findings acknowledging that answers were not accurate enough and that the system made outright mistakes, with the earliest accuracy benchmark at just 76%. GDS subsequently added filters and rules to prevent the chatbot from answering certain question categories before broader pilot deployment. The Times later reported that the chatbot gave misleading tax information, drawing criticism from tax professionals.

What broke inside the model

Failure path · mode profile · Hallucination
  1. 01 · TriggerA user asks for a fact, a citation, or a figure.
  2. 02 · Model stepThe model writes a fluent, confident answer.
  3. 03 · Control gapNothing ties the claim back to a real source.
  4. 04 · FailureA fabricated fact ships as if it were verified.
  5. 05 · ConsequenceThe false claim reaches a customer, a court, or the public.

Confidence holds, and even spikes, as the claim detaches from any source.

The retrieval-augmented generation system combined semantic search over GOV.UK content with a large language model, but when retrieved context was incomplete or ambiguous, the LLM overgeneralized and produced responses not strictly grounded in source material. The model generated hallucinated hyperlinks and failed to reliably distinguish between topics where published guidance was sufficient and those where it was too limited to support a confident answer. GDS identified core failure areas including groundedness, factual accuracy, factual completeness, and reputational safety.

Public visibilityHigh
Regulatory exposurePossible
Customer impactMany customers
Financial impactUnknown
Time to disclosureMonths
  1. Primary5 things we learned testing GOV.UK Chat: an AI assistant for governmentinsidegovuk.blog.gov.uk
  2. PressNew government AI chatbot 'gives misleading tax information'thetimes.com
  3. PressGOV.UK AI chatbot achieves 90% accuracycivilserviceworld.com
Permalinkhttps://failureindex.ai/failures/uk-gov-uk-chat-gave-citizens
CitationAI Failure Index. "UK GOV.UK Chat gave citizens incorrect tax, VAT, and immigration advice in its alpha pilot" (FI-0108). Realm Labs. https://failureindex.ai/failures/uk-gov-uk-chat-gave-citizens (indexed Jun 4, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0108. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm would have caught this

Controls for this failure mode
  • Prism
  • OmniGuard
  • AI Detection & Response (AIDR)

A runtime layer that watches the model's internal state can flag the moment a model commits to a claim it has no support for, and hold or reroute the response before it reaches a user. Realm reads those signals in real time rather than grading the transcript after the fact.