UK GOV.UK Chat gave citizens incorrect tax, VAT, and immigration advice in its alpha pilot
The UK Government Digital Service's GOV.UK Chat prototype produced inaccurate or misleading responses during a private pilot with approximately 1,000 users, scoring only 76% accuracy at its earliest benchmark. The system gave incorrect advice on tax, VAT registration, EU Settlement Scheme, and flight refund matters before GDS added filters to block certain question categories. The Times later reported that the chatbot gave misleading tax information, drawing criticism from tax professionals.
A RAG chatbot built to ground every answer in official government content still hallucinated wrong advice whenever the retrieved context fell short.
Key facts
- What
- The UK Government Digital Service's GOV.UK Chat prototype produced inaccurate or misleading responses during a private pilot with approximately 1,000 users, scoring only 76% accuracy at its earliest benchmark.
- Incident date
- Jan 18, 2024
- Who
- UK Government Digital Service (GDS)
- Failure mode
- Hallucination
- AI surface
- Chatbot
- Severity
- High
What happened
During a private pilot with approximately 1,000 users in late 2023, the GOV.UK Chat prototype provided citizens with inaccurate or misleading responses on topics including tax, VAT registration, EU Settlement Scheme, and flight refunds. GDS published findings acknowledging that answers were not accurate enough and that the system made outright mistakes, with the earliest accuracy benchmark at just 76%. GDS subsequently added filters and rules to prevent the chatbot from answering certain question categories before broader pilot deployment. The Times later reported that the chatbot gave misleading tax information, drawing criticism from tax professionals.
What broke inside the model
- 01 · TriggerA user asks for a fact, a citation, or a figure.
- 02 · Model stepThe model writes a fluent, confident answer.
- 03 · Control gapNothing ties the claim back to a real source.
- 04 · FailureA fabricated fact ships as if it were verified.
- 05 · ConsequenceThe false claim reaches a customer, a court, or the public.
Confidence holds, and even spikes, as the claim detaches from any source.
The retrieval-augmented generation system combined semantic search over GOV.UK content with a large language model, but when retrieved context was incomplete or ambiguous, the LLM overgeneralized and produced responses not strictly grounded in source material. The model generated hallucinated hyperlinks and failed to reliably distinguish between topics where published guidance was sufficient and those where it was too limited to support a confident answer. GDS identified core failure areas including groundedness, factual accuracy, factual completeness, and reputational safety.
What it cost
Sources
- Primary5 things we learned testing GOV.UK Chat: an AI assistant for governmentinsidegovuk.blog.gov.uk
- PressNew government AI chatbot 'gives misleading tax information'thetimes.com
- PressGOV.UK AI chatbot achieves 90% accuracycivilserviceworld.com
Cite this entry
https://failureindex.ai/failures/uk-gov-uk-chat-gave-citizensAI Failure Index. "UK GOV.UK Chat gave citizens incorrect tax, VAT, and immigration advice in its alpha pilot" (FI-0108). Realm Labs. https://failureindex.ai/failures/uk-gov-uk-chat-gave-citizens (indexed Jun 4, 2026).Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0108. Full dataset at /data.
Note from Realm Labs, the Index steward
How Realm would have caught this
- Prism
- OmniGuard
- AI Detection & Response (AIDR)
A runtime layer that watches the model's internal state can flag the moment a model commits to a claim it has no support for, and hold or reroute the response before it reaches a user. Realm reads those signals in real time rather than grading the transcript after the fact.