US State Education Departments' automated essay scoring found biased against some groups

Automated essay scoring engines were used in many U.S. state standardized tests and multiple investigations and research studies found systematic differences in scores across demographic groups. Reporting and peer-reviewed analysis (including an ETS technical study) showed some engines gave higher average scores to certain groups and lower scores to others, and that some systems could be fooled by nonsense text.

US State Education Departments · Incident Jun 30, 2018 · Indexed Jun 9, 2026 · 3 sources

The models learned scoring shortcuts in the training data , measurable surface features that correlated with demographics rather than true writing quality.
What
Automated essay scoring engines were used in many U.S.
Incident date
Jun 30, 2018
Who
US State Education Departments
Failure mode
Policy Violation
AI surface
Algorithmic Decision
Severity
High

What happened

Beginning in at least 2018, many U.S. state education departments adopted automated essay scoring engines to grade written responses on standardized tests. Media investigations and technical reports documented that these systems produced mean score differences by demographic group and sometimes relied on surface features like sentence length and vocabulary rather than deeper meaning. Research published by ETS and reporting by outlets such as NPR and Motherboard/Vice described cases where the e-rater and other scoring engines gave systematically different scores for essays from different demographic groups and where some nonsense or formulaic essays could receive high automated scores.

What broke inside the model

Failure path · mode profile · Policy Violation
  1. 01 · TriggerA prompt pushes against a deployment boundary.
  2. 02 · Model stepThe model produces the disallowed output.
  3. 03 · Control gapNo enforcement blocks it at generation time.
  4. 04 · FailureThe output crosses the policy line.
  5. 05 · ConsequenceA limit the business set is breached in public.

The output crosses a policy boundary the deployment had defined.

The scoring systems were statistical models trained to predict human-assigned scores and therefore learned patterns present in training data that correlated with demographics rather than strictly with writing quality. The engines emphasized measurable surface features such as sentence length, vocabulary complexity, spelling, and grammar which can disadvantage English-language learners and other groups who write differently. In some deployments, a large share of essays were scored by the machine alone with limited human review, creating a pathway for biased or gamed scores to affect outcomes.

Public visibilityHigh
Regulatory exposurePossible
Customer impactMany customers
Financial impactUnknown
Time to disclosureMonths
  1. PressFlawed Algorithms Are Grading Millions of Students’ Essaysvice.com
  2. PressMore States Opting To 'Robo-Grade' Student Essays By Computernpr.org
  3. PrimaryUnderstanding Mean Score Differences Between the e-rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test (ETS research report)onlinelibrary.wiley.com
Permalinkhttps://failureindex.ai/failures/state-education-departments-automated-essay-scoring
CitationAI Failure Index. "US State Education Departments' automated essay scoring found biased against some groups" (FI-0347). Realm Labs. https://failureindex.ai/failures/state-education-departments-automated-essay-scoring (indexed Jun 9, 2026).
Share cardA branded image of this record for posts and slides.

Data fields CC-BY 4.0, prose citation permitted. Incident ID FI-0347. Full dataset at /data.

Note from Realm Labs, the Index steward

How Realm fits

Controls for this failure mode
  • Prism
  • OmniGuard

This entry sits in the index's predictive wing: a system that scores, ranks, perceives, or steers rather than generates. Realm's runtime layer is built for the generative and agentic systems now moving into these same decision seats, where it watches a model's internal state and holds an unsupported claim or an unchecked action before it commits. The control gap on this record, an automated decision that reached people with no runtime check in front of it, is the same gap. The index keeps predictive failures on the record because the pattern carries straight into the systems shipping today.