Key Takeaways
- We tested EssayHero against 119 official HKEAA exemplar essays spanning Levels 1-5 (2020-2025)
- The primary metric is level-to-level comparison: sum AI criterion scores, map to a level, compare against the official HKEAA level
- Per-criterion comparisons are secondary and come with important caveats
- We report Quadratic Weighted Kappa with 95% bootstrap confidence intervals
- Limitations are disclosed in full below
Why Validation Matters
If you're going to trust an AI to score your essays, you need evidence it works. Not marketing claims. Data.
Over my years of teaching, I've seen students rely on tools that promise accurate feedback but never explain how they measured accuracy. That's not good enough. If I'm asking students to use EssayHero as a practice tool, I need to hold it to the same standard I'd expect of a human second marker.
This post explains how we test our scoring, presents the results with honest interpretation, and is transparent about what the numbers don't tell you. If you're a teacher evaluating whether to recommend EssayHero, or a student deciding whether to trust the feedback, this is the evidence.
The Ground Truth Corpus
Our validation corpus consists of 119 Part B essays from the HKEAA's official Samples of Candidates' Performance booklets, published annually for teacher moderation and marker training. These are not random student essays. They are exemplars selected by the chief examiner's panel to represent each performance level definitively.
| Level | Essays | Description |
|---|---|---|
| Level 1 | 23 | Below threshold performance |
| Level 2 | 24 | Limited performance |
| Level 3 | 24 | Adequate performance |
| Level 4 | 24 | Good performance |
| Level 5 | 24 | Excellent performance |
The corpus covers six exam years (2020-2025). Both the pre-2024 format (Questions 2-9) and the revised 2024 format (Questions 2-5) are represented.
Why Part B only? Part A is compulsory guided writing that includes visual prompts (images, charts, maps). Since EssayHero analyses text-only submissions, Part B essays are the appropriate benchmark.
Why HKEAA exemplars? These are the gold standard for HKDSE marking. They are used to train and calibrate human markers across Hong Kong. When two examiners disagree on a borderline case, they turn to these booklets. If you want to know what a Level 4 essay looks like, these are the definitive examples.
One important detail about how HKEAA grades these essays. Each exemplar receives a single holistic level (1-5). The HKEAA does not publish separate scores for Content, Language & Style, and Organisation. This has significant implications for how we measure accuracy, which I'll explain in the methodology section.
Methodology
Blind Evaluation
Each essay was submitted to EssayHero's production analysis pipeline — the exact same system that scores student submissions — with no knowledge of the human grade. The AI received only the essay text and the official question prompt. No hints, no level information, no special treatment.
For each essay, the AI produced scores on three criteria (each on a 1-7 scale):
- Content — Task completion, idea development, relevance
- Language & Style — Vocabulary range, grammar accuracy, register
- Organisation — Structure, coherence, paragraphing
Primary Metric: Level-to-Level Comparison
Because HKEAA assigns a single holistic level per essay, our primary accuracy measure is a level-to-level comparison.
The process:
- The AI scores three criteria (Content, Language & Style, Organisation), each 1-7
- We sum the three scores to get a total (range: 3-21)
- We map the total to a level using the official HKDSE grading scale:
| Total Score | Level |
|---|---|
| 20-21 | 7 |
| 17-19 | 6 |
| 14-16 | 5 |
| 11-13 | 4 |
| 8-10 | 3 |
| 5-7 | 2 |
| 3-4 | 1 |
- Since our HKEAA exemplars only go up to Level 5, we clamp AI-predicted levels to 1-5 for fair comparison. An AI prediction of Level 6 or 7 is treated as Level 5.
- We compare the clamped AI level against the HKEAA level.
This is the fairest comparison we can make. The AI and the HKEAA are answering the same question: "What level is this essay?"
Secondary Metric: Per-Criterion Comparison
EssayHero gives separate scores for Content, Language & Style, and Organisation. But the HKEAA only gives one holistic level. How do you compare criterion-level accuracy when the ground truth doesn't break down that way?
Our approach: we use the HKEAA's holistic level as an approximate ground truth for each criterion. If an essay is rated Level 3 overall, we treat that as a "3" for Content, a "3" for Language & Style, and a "3" for Organisation.
This is an imperfect proxy. Real essays have uneven criterion strengths. A student might have strong content ideas but weak grammar, or excellent organisation but limited vocabulary. By assigning the same level to all three criteria, we're smoothing over exactly the kind of variation that makes essays interesting.
The per-criterion comparison is still useful as a directional signal. If the AI systematically scores Language & Style lower than the holistic level, that tells us something about the AI's tendencies. But the absolute numbers should be interpreted with caution. A large discrepancy between AI and "ground truth" on a single criterion might reflect the AI correctly identifying uneven strengths that the holistic level masks.
We report per-criterion metrics as secondary analysis, clearly labelled as such.
Metrics
We use standard metrics from Automated Essay Scoring (AES) research:
Quadratic Weighted Kappa (QWK) is the primary statistic. QWK measures agreement between two raters, corrected for chance. Unlike simple percentage agreement, QWK accounts for the ordinal nature of scores — a 2-level disagreement is penalised more heavily than a 1-level disagreement. This is the same metric used in the Kaggle Automated Student Assessment Prize (ASAP) competition and in peer-reviewed AES literature.
Mean Absolute Error (MAE) is the average absolute difference between AI and human grades. An MAE of 0.5 means the AI is, on average, half a level off.
Exact Match Rate is the percentage of cases where the AI level equals the HKEAA level exactly.
Within-One Rate is the percentage of cases where the AI level is within one level of the HKEAA grade. This is the most practically relevant metric for students: if the AI says Level 3, the real level is probably 2, 3, or 4.
Bias indicates whether the AI systematically grades higher (positive) or lower (negative) than human examiners.
Statistical Rigour
All QWK values are reported with 95% bootstrap confidence intervals (1,000 iterations, percentile method, seeded PRNG for reproducibility). Confidence intervals tell you the range within which the true agreement likely falls, given sampling variability. With 119 essays, these intervals are informative but not narrow — a larger corpus would tighten them.
Interpreting QWK
Before presenting results, it's worth understanding what QWK values actually mean. Landis and Koch (1977) proposed the following benchmark scale, which remains widely cited:
| QWK Range | Interpretation |
|---|---|
| 0.81 - 1.00 | Almost perfect agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.21 - 0.40 | Fair agreement |
| Below 0.20 | Slight agreement |
For context: when two trained human markers score the same essay independently, they typically achieve QWK in the range of 0.60-0.80 (Shermis & Hamner, 2012). Human markers disagree with each other regularly — by one level, sometimes two. Perfect agreement between any two raters, human or AI, is not a realistic expectation.
The Kaggle ASAP competition (2012), one of the largest public AES benchmarks, saw winning systems achieve QWK values between 0.70 and 0.81 depending on the essay prompt. These systems were trained on thousands of human-graded essays for each prompt. Our setup is different — we use a general-purpose language model with prompt engineering rather than a purpose-trained scoring model — so direct comparisons should be made cautiously.
Results
Results below are from our production evaluation run in February 2026 against 119 HKEAA exemplars using Google Gemini 3 Flash Preview (gemini-3-flash-preview) with ThinkingLevel.LOW, text-type-aware positive framing prompt, rewritten band descriptors (IP-safe original language), and baseline strictness. For the story of how we developed text-type-aware scoring, see How We Caught Our AI Being Too Harsh — and Fixed It.
Level-to-Level Comparison (Primary)
0.833
QWK (Level Agreement)
Almost perfect agreement
98.3%
Within-One Rate
Nearly every essay
55.5%
Exact Match Rate
Correct level prediction
0.46
Mean Absolute Error
Average levels off
-0.08
Bias
Nearly zero systematic error
119
Sample Size
HKEAA exemplar essays
What This Means
A QWK of 0.833 places EssayHero in the "almost perfect agreement" range on the Landis and Koch scale. This is comparable to agreement levels between trained human examiners.
Accuracy by Level
| HKEAA Level | N | Exact Match | Within-1 | MAE | Bias |
|---|---|---|---|---|---|
| Level 1 | 23 | 21.7% | 100% | 0.78 | +0.78 |
| Level 2 | 24 | 91.7% | 100% | 0.08 | +0.08 |
| Level 3 | 24 | 83.3% | 100% | 0.17 | -0.17 |
| Level 4 | 24 | 62.5% | 95.8% | 0.38 | -0.38 |
| Level 5 | 24 | 29.2% | 95.8% | 0.71 | -0.71 |
Thinking Level Comparison
We tested four thinking levels on the same 119-essay corpus. The default (LOW) achieves the best overall QWK, but MEDIUM thinking is more accurate for Level 4-5 essays. Teachers can opt into "Thorough scoring" in batch marking for classes with predominantly strong writers.
| Level | LOW (default) Exact | MEDIUM (thorough) Exact |
|---|---|---|
| Level 2 | 91.7% | 62.5% |
| Level 3 | 83.3% | 41.7% |
| Level 4 | 62.5% | 87.5% |
| Level 5 | 29.2% | 66.7% |
Per-Criterion Comparison (Secondary)
The following uses the HKEAA holistic level as an approximate per-criterion ground truth. As discussed in the methodology section, this is an imperfect proxy. Our calibration primarily measures level-to-level agreement; per-criterion analysis is directional.
Important Caveat
Per-criterion ground truth is approximated from holistic levels. Discrepancies may reflect the AI correctly identifying uneven criterion strengths rather than scoring errors.
What These Results Mean for You
For students
Nearly every AI prediction (98.3%) lands within one level of the official HKEAA grade. If the AI predicts Level 3, the HKEAA would almost certainly assign Level 2, 3, or 4.
The AI is strongest at Level 2-3 essays, where exact match rates exceed 83%. At Level 5, the AI tends to underscore slightly (within-one is still 95.8%, but exact match is 29.2% — it often predicts Level 4 instead of Level 5). If you are a strong writer and the AI gives you one level below what you expected, the qualitative feedback is still highly relevant.
Use the paragraph-by-paragraph feedback to understand the criteria and identify patterns in your writing. The qualitative feedback is often more useful than the level number itself.
For teachers
Our level-to-level QWK of 0.833 places EssayHero in the "almost perfect agreement" range on the Landis and Koch scale. For context, trained human markers typically achieve QWK of 0.60-0.80 when scoring independently. EssayHero's agreement with the HKEAA chief examiner panel exceeds that of typical inter-rater reliability.
The bias is nearly zero (-0.08), meaning the AI is equally likely to score slightly above or below the official grade. The remaining inaccuracy is concentrated at the extremes: Level 1 essays (small score range makes exact match difficult) and Level 5 essays (slight tendency to underscore by one level).
For classes with predominantly Level 4-5 students, teachers can enable "Thorough scoring" in batch marking. This uses a deeper thinking mode that achieves 87.5% exact match on Level 4 and 66.7% on Level 5, at the cost of lower accuracy on Level 2-3 essays.
We achieve this level of agreement with prompt engineering alone, using a general-purpose language model (Google Gemini 3 Flash Preview). No fine-tuning, no task-specific training data.
What the numbers don't tell you
- Your actual DSE score. Real exam marking involves moderation, standardisation, and human judgement that no AI can replicate.
- Performance on unusual writing. Our corpus consists of official exemplars. Highly unconventional essays may behave differently.
- Per-criterion ground truth. We don't have it. The per-criterion comparisons use the holistic level as an approximation.
Limitations
I believe in disclosing limitations upfront, not burying them in footnotes.
Single exam type. This validation covers HKDSE Paper 2 Part B only. IELTS validation is planned but not yet completed. Results should not be generalised to other exams.
Holistic ground truth. The HKEAA provides one level per essay, not separate criterion scores. Our primary metric (level-to-level) works within this constraint. Our secondary metric (per-criterion) uses the holistic level as an approximation, which introduces measurement error. We cannot know the true per-criterion accuracy without per-criterion human grades.
AI vs. examiner panel, not AI vs. individual marker. The HKEAA levels represent the definitive judgement of the chief examiner's panel, not an individual marker's opinion. Individual markers have their own variance. We're comparing the AI against the best available ground truth, but this is a higher bar than comparing against a single human rater.
Integer score granularity. Both AI and HKEAA scores are whole numbers. A student who is "borderline Level 3/4" will be forced into one or the other, making exact match harder at boundaries. QWK handles this better than exact match does, which is one reason we use it as the primary statistic.
Sample size. 119 essays across 5 levels (23-24 per level) is adequate for aggregate metrics but limits the reliability of per-level analysis. We plan to expand the corpus as HKEAA publishes new exemplar booklets.
Levels 1-5 only. The HKEAA does not publish exemplars at Level 5* or 5**. Our corpus therefore cannot test the AI's ability to distinguish between Level 5 and the starred levels. Since the AI uses a 1-7 criterion scale that maps up to Level 7, there is an untested range at the top end.
Temporal scope. The corpus spans 2020-2025. The HKDSE exam format was revised in 2024 (from Questions 2-9 to Questions 2-5). Both formats are represented, but we have not yet tested whether the AI performs differently on pre- and post-reform essays.
Continuous Improvement
This is not a one-off validation. The calibration pipeline is automated and runs whenever we update the scoring prompt:
- Submit all 119 essays to the updated prompt
- Compute level-to-level and per-criterion metrics automatically
- Compare against the previous prompt version
- Deploy only if accuracy is maintained or improved
This creates a tight feedback loop. If a prompt change causes the AI to underscore Level 4 essays, we catch it before students see the change. If the per-criterion bias shifts, we can trace it to the specific prompt edit that caused it.
Gemini Migration (February 2026)
EssayHero originally used DeepSeek Chat for essay analysis, achieving a level QWK of 0.558 (moderate agreement). In February 2026, we migrated to Google Gemini 3 Flash Preview, which improved QWK to 0.833 (almost perfect agreement) — a substantial gain across all metrics.
The migration also resolved the persistent Level 4-5 underscoring problem that was our biggest accuracy limitation with DeepSeek. With Gemini, Level 4 exact match improved from 8.3% to 62.5%, and Level 5 from 8.3% to 29.2%. The historical analysis of the underscoring problem is in How We Caught Our AI Being Too Harsh — and Fixed It.
Thinking Level Experiments
We tested four thinking levels (OFF, LOW, MEDIUM, HIGH) on the full corpus. LOW achieved the best overall QWK (0.833), while MEDIUM achieved the best Level 4-5 accuracy. This led to a practical design decision: the default mode uses LOW for best overall accuracy, with an optional "Thorough scoring" mode using MEDIUM for teachers who know their class contains predominantly strong writers.
The results on this page will be updated as we refine our prompts and expand the corpus.
References
- Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
- Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays. In Handbook of Automated Essay Evaluation.
- HKEAA (2020-2025). HKDSE English Language Paper 2: Samples of Candidates' Performance. Hong Kong Examinations and Assessment Authority.
EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.
Related Articles
How We Validate Our Scores
Our methodology for testing AI scoring accuracy against 120 official HKEAA exemplar essays, with full results and limitations.
Read moreHow EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)
A transparent explanation for teachers and tutors of how EssayHero assesses HKDSE English Paper 2 writing, how scoring works, and where AI falls short.
Read moreHow We Caught Our AI Being Too Harsh — and Fixed It
We added text-type awareness to EssayHero's scoring and accidentally made it harsher. Here's how we discovered the problem, why it happened, and the experiment that fixed it.
Read more