Key Takeaways
- We tested EssayHero against 120 official HKEAA exemplar essays spanning Levels 1-5 (2020-2025)
- The primary metric is level-to-level comparison: sum AI criterion scores, map to a level, compare against the official HKEAA level
- Per-criterion comparisons are secondary and come with important caveats
- We report Quadratic Weighted Kappa with 95% bootstrap confidence intervals
- Limitations are disclosed in full below
Why Validation Matters
If you're going to trust an AI to score your essays, you need evidence it works. Not marketing claims. Data.
Over my years of teaching, I've seen students rely on tools that promise accurate feedback but never explain how they measured accuracy. That's not good enough.
If I'm asking students to use EssayHero as a practice tool, I need to hold it to the same standard I'd expect of a human second marker.
What This Post Covers
This post explains how we test our scoring, presents the results with honest interpretation, and is transparent about what the numbers don't tell you.
If you're a teacher evaluating whether to recommend EssayHero, or a student deciding whether to trust the feedback, this is the evidence.
The Ground Truth Corpus
Our validation corpus consists of 120 Part B essays from the HKEAA's official Samples of Candidates' Performance booklets, published annually for teacher moderation and marker training.
These are not random student essays. They are exemplars selected by the chief examiner's panel to represent each performance level definitively.
| Level | Essays | Description |
|---|---|---|
| Level 1 | 24 | Below threshold performance |
| Level 2 | 24 | Limited performance |
| Level 3 | 24 | Adequate performance |
| Level 4 | 24 | Good performance |
| Level 5 | 24 | Excellent performance |
Corpus Scope
The corpus covers six exam years (2020-2025). Both the pre-2024 format (Questions 2-9) and the revised 2024 format (Questions 2-5) are represented.
Why Part B Only?
Part A is compulsory guided writing that includes visual prompts (images, charts, maps). Since EssayHero analyses text-only submissions, Part B essays are the appropriate benchmark.
Why HKEAA Exemplars?
These are the gold standard for HKDSE marking. They are used to train and calibrate human markers across Hong Kong.
When two examiners disagree on a borderline case, they turn to these booklets. If you want to know what a Level 4 essay looks like, these are the definitive examples.
Important Caveat About Holistic Scoring
Each exemplar receives a single holistic level (1-5). The HKEAA does not publish separate scores for Content, Language & Style, and Organisation.
This has significant implications for how we measure accuracy, which I'll explain in the methodology section.
Methodology
Blind Evaluation
Each essay was submitted to EssayHero's production analysis pipeline — the exact same system that scores student submissions — with no knowledge of the human grade.
The AI received only the essay text and the official question prompt. No hints, no level information, no special treatment.
AI Output Format
For each essay, the AI produced scores on three criteria (each on a 1-7 scale):
- Content — Task completion, idea development, relevance
- Language & Style — Vocabulary range, grammar accuracy, register
- Organisation — Structure, coherence, paragraphing
Primary Metric: Level-to-Level Comparison
Because HKEAA assigns a single holistic level per essay, our primary accuracy measure is a level-to-level comparison.
Conversion Process
The process works in five steps:
- Score three criteria — The AI scores Content, Language & Style, and Organisation (each 1-7)
- Sum to total — We add the three scores to get a total (range: 3-21)
- Map to level — We convert the total to a level using the official HKDSE grading scale:
| Total Score | Level |
|---|---|
| 20-21 | 7 |
| 17-19 | 6 |
| 14-16 | 5 |
| 11-13 | 4 |
| 8-10 | 3 |
| 5-7 | 2 |
| 3-4 | 1 |
- Clamp to corpus range — Since our HKEAA exemplars only go up to Level 5, we clamp AI-predicted levels to 1-5 for fair comparison. An AI prediction of Level 6 or 7 is treated as Level 5.
- Compare — We compare the clamped AI level against the HKEAA level.
Why This Is Fair
This is the fairest comparison we can make. The AI and the HKEAA are answering the same question: "What level is this essay?"
Secondary Metric: Per-Criterion Comparison
EssayHero gives separate scores for Content, Language & Style, and Organisation. But the HKEAA only gives one holistic level.
How do you compare criterion-level accuracy when the ground truth doesn't break down that way?
Our Approach
We use the HKEAA's holistic level as an approximate ground truth for each criterion. If an essay is rated Level 3 overall, we treat that as a "3" for Content, a "3" for Language & Style, and a "3" for Organisation.
Important Limitation
This is an imperfect proxy. Real essays have uneven criterion strengths.
A student might have strong content ideas but weak grammar, or excellent organisation but limited vocabulary. By assigning the same level to all three criteria, we're smoothing over exactly the kind of variation that makes essays interesting.
What Per-Criterion Metrics Tell Us
The per-criterion comparison is still useful as a directional signal:
- If the AI systematically scores Language & Style lower than the holistic level, that tells us something about the AI's tendencies
- But the absolute numbers should be interpreted with caution
- A large discrepancy on a single criterion might reflect the AI correctly identifying uneven strengths that the holistic level masks
We report per-criterion metrics as secondary analysis, clearly labelled as such.
Metrics
We use standard metrics from Automated Essay Scoring (AES) research:
Core Metrics Explained
-
Quadratic Weighted Kappa (QWK) — Primary statistic measuring agreement between two raters, corrected for chance. Unlike simple percentage agreement, QWK accounts for the ordinal nature of scores (a 2-level disagreement is penalised more heavily than a 1-level disagreement). This is the same metric used in the Kaggle Automated Student Assessment Prize (ASAP) competition and in peer-reviewed AES literature.
-
Mean Absolute Error (MAE) — Average absolute difference between AI and human grades. An MAE of 0.5 means the AI is, on average, half a level off.
-
Exact Match Rate — Percentage of cases where the AI level equals the HKEAA level exactly.
-
Within-One Rate — Percentage of cases where the AI level is within one level of the HKEAA grade. This is the most practically relevant metric for students: if the AI says Level 3, the real level is probably 2, 3, or 4.
-
Bias — Indicates whether the AI systematically grades higher (positive) or lower (negative) than human examiners.
Statistical Rigour
All QWK values are reported with 95% bootstrap confidence intervals (1,000 iterations, percentile method, seeded PRNG for reproducibility).
Confidence intervals tell you the range within which the true agreement likely falls, given sampling variability. With 120 essays, these intervals are informative but not narrow — a larger corpus would tighten them.
Interpreting QWK
Before presenting results, it's worth understanding what QWK values actually mean.
Benchmark Scale
Landis and Koch (1977) proposed the following benchmark scale, which remains widely cited:
| QWK Range | Interpretation |
|---|---|
| 0.81 - 1.00 | Almost perfect agreement |
| 0.61 - 0.80 | Substantial agreement |
| 0.41 - 0.60 | Moderate agreement |
| 0.21 - 0.40 | Fair agreement |
| Below 0.20 | Slight agreement |
Human Baseline
When two trained human markers score the same essay independently, they typically achieve QWK in the range of 0.60-0.80 (Shermis & Hamner, 2012).
Human markers disagree with each other regularly — by one level, sometimes two. Perfect agreement between any two raters, human or AI, is not a realistic expectation.
Industry Benchmarks
The Kaggle ASAP competition (2012), one of the largest public AES benchmarks, saw winning systems achieve QWK values between 0.70 and 0.81 depending on the essay prompt.
These systems were trained on thousands of human-graded essays for each prompt. Our setup is different — we use a general-purpose language model with prompt engineering rather than a purpose-trained scoring model — so direct comparisons should be made cautiously.
Results
Production System Results
Results below are from our production evaluation run on 10 February 2026 against 120 HKEAA exemplars using DeepSeek Chat (deepseek-chat) with text-type-aware positive framing prompt, rewritten band descriptors (IP-safe original language), and baseline strictness.
For the story of how we developed text-type-aware scoring, see How We Caught Our AI Being Too Harsh — and Fixed It.
Level-to-Level Comparison (Primary)
| Metric | Value |
|---|---|
| QWK (Level Agreement) | 0.558 (Moderate agreement) |
| 95% Bootstrap CI | [0.461, 0.646] |
| Mean Absolute Error | 0.83 |
| Exact Match Rate | 39.2% |
| Within-One Rate | 80% |
| Bias | -0.56 (AI slightly underscores) |
| Sample Size | 120 essays |
Accuracy by Level
| HKEAA Level | N | Exact Match | Within-1 | MAE | Bias |
|---|---|---|---|---|---|
| Level 1 | 24 | 33.3% | 100% | 0.67 | +0.67 |
| Level 2 | 24 | 100% | 100% | 0.0 | 0.0 |
| Level 3 | 24 | 45.8% | 100% | 0.54 | -0.54 |
| Level 4 | 24 | 8.3% | 66.7% | 1.25 | -1.25 |
| Level 5 | 24 | 8.3% | 33.3% | 1.67 | -1.67 |
Per-Criterion Comparison (Secondary)
Interpretation Caveat
The following table uses the HKEAA holistic level as an approximate per-criterion ground truth. As discussed in the methodology section, this is an imperfect proxy.
These numbers provide directional signal about where the AI may systematically over- or underscore, but should not be treated as precise accuracy figures.
| Criterion | QWK | Bias | Interpretation |
|---|---|---|---|
| Content | 0.746 | -0.27 | Substantial agreement; slight underscoring |
| Language & Style | 0.522 | -0.68 | Moderate agreement; notable underscoring |
| Organisation | 0.558 | -0.56 | Moderate agreement; moderate underscoring |
| Overall (criterion average) | 0.607 | -0.50 | Moderate agreement overall |
Note: Per-criterion ground truth is approximated from holistic levels. Discrepancies may reflect the AI correctly identifying uneven criterion strengths rather than scoring errors.
What These Results Mean for You
For Students
Four out of five AI predictions land within one level of the official HKEAA grade. If the AI predicts Level 3, the HKEAA would very likely assign Level 2, 3, or 4.
Accuracy Varies by Level
The AI performs differently across the level spectrum:
- Strongest at Levels 1-3 — Within-one accuracy is 100%
- Weakest at Level 5 — Within-one drops to 33.3%
- Tendency to underscore — Especially for high-quality work
If you're a strong writer and the AI gives you a Level 3, your real level may well be higher. This is an active area of improvement.
How to Use the Feedback
Use the paragraph-by-paragraph feedback to understand the criteria and identify patterns in your writing.
The qualitative feedback is often more useful than the level number itself. Don't treat the level as a guarantee.
For Teachers
Our level-to-level QWK of 0.558 places EssayHero in the moderate agreement range on the Landis and Koch scale.
For context, trained human markers typically achieve QWK of 0.60-0.80 when scoring independently. We're below that threshold, which means the AI's level assignments should be treated as rough estimates rather than reliable second-marker judgements.
Two Key Patterns
Underscoring bias: The AI underscores by an average of 0.56 levels. This means students are more likely to receive a level below their true performance than above it.
Top-end struggles: The underscoring is concentrated at the top. Level 4-5 essays are frequently predicted as Level 2-3. The AI handles the lower end well but struggles to distinguish good from excellent writing.
Per-Criterion Insights
The per-criterion bias figures reinforce these patterns:
- Language & Style — Largest underscoring (bias of -0.68), suggesting the AI is overly critical of grammar and vocabulary
- Content — Smallest bias (-0.27) and highest criterion-level QWK (0.746, in the substantial agreement range)
We use these signals to refine our prompts. Language & Style calibration is a priority.
Technical Approach
It's worth emphasising that we achieve this level of agreement with prompt engineering alone, using a general-purpose language model.
No fine-tuning, no task-specific training data. We expect accuracy to improve as we refine prompts and potentially move to fine-tuned models.
What the Numbers Don't Tell You
These metrics have important blind spots:
-
Your actual DSE score — Real exam marking involves moderation, standardisation, and human judgement that no AI can replicate
-
Performance on unusual writing — Our corpus consists of official exemplars. Highly unconventional essays may behave differently
-
Per-criterion ground truth — We don't have it. The per-criterion comparisons use the holistic level as an approximation
-
How Level 4-5 students should interpret scores — If you're aiming for Levels 4-5, the AI is more likely to underscore your work. Focus on the qualitative feedback rather than the level number
Limitations
I believe in disclosing limitations upfront, not burying them in footnotes.
Scope Limitations
Single exam type — This validation covers HKDSE Paper 2 Part B only. IELTS validation is planned but not yet completed. Results should not be generalised to other exams.
Temporal scope — The corpus spans 2020-2025. The HKDSE exam format was revised in 2024 (from Questions 2-9 to Questions 2-5). Both formats are represented, but we have not yet tested whether the AI performs differently on pre- and post-reform essays.
Levels 1-5 only — The HKEAA does not publish exemplars at Level 5* or 5**. Our corpus therefore cannot test the AI's ability to distinguish between Level 5 and the starred levels. Since the AI uses a 1-7 criterion scale that maps up to Level 7, there is an untested range at the top end.
Measurement Limitations
Holistic ground truth — The HKEAA provides one level per essay, not separate criterion scores. Our primary metric (level-to-level) works within this constraint. Our secondary metric (per-criterion) uses the holistic level as an approximation, which introduces measurement error. We cannot know the true per-criterion accuracy without per-criterion human grades.
AI vs. examiner panel, not AI vs. individual marker — The HKEAA levels represent the definitive judgement of the chief examiner's panel, not an individual marker's opinion. Individual markers have their own variance. We're comparing the AI against the best available ground truth, but this is a higher bar than comparing against a single human rater.
Integer score granularity — Both AI and HKEAA scores are whole numbers. A student who is "borderline Level 3/4" will be forced into one or the other, making exact match harder at boundaries. QWK handles this better than exact match does, which is one reason we use it as the primary statistic.
Statistical Limitations
Sample size — 120 essays across 5 levels (24 per level) is adequate for aggregate metrics but limits the reliability of per-level analysis. The confidence intervals reflect this. We plan to expand the corpus as HKEAA publishes new exemplar booklets.
Continuous Improvement
This is not a one-off validation. The calibration pipeline is automated and runs whenever we update the scoring prompt.
Automated Pipeline
The workflow ensures quality control:
- Submit — All 120 essays to the updated prompt
- Compute — Level-to-level and per-criterion metrics automatically
- Compare — Against the previous prompt version
- Deploy — Only if accuracy is maintained or improved
Tight Feedback Loop
This process catches problems before they reach students:
- If a prompt change causes the AI to underscore Level 4 essays, we catch it before deployment
- If the per-criterion bias shifts, we can trace it to the specific prompt edit that caused it
Recent Improvements: Text-Type-Aware Scoring
We recently added text-type conventions to the scoring prompt so the AI can give format-specific feedback (e.g., recognising blog conventions vs formal letter conventions).
Framing Experiments
During calibration, we discovered that the way we framed these conventions significantly affected scoring accuracy:
- Initial approach — Degraded QWK from 0.532 to 0.48
- After testing — Three alternative framings tested against full 120-essay corpus
- Final result — Positive framing approach achieved QWK 0.539 (slight improvement over baseline)
The full story, including experiment methodology and results, is in How We Caught Our AI Being Too Harsh — and Fixed It.
IP Safety Review
After the text-type framing experiments, we also rewrote all exam board band descriptors in original language (replacing verbatim copyrighted text) as part of our intellectual property safety review.
A full recalibration confirmed that this change maintained or slightly improved accuracy — QWK rose from 0.539 to 0.558 — indicating that the scoring system relies on the substance of the descriptors rather than their exact wording.
Living Document
The results on this page will be updated as we refine our prompts and expand the corpus.
References
- Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220.
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
- Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays. In Handbook of Automated Essay Evaluation.
- HKEAA (2020-2025). HKDSE English Language Paper 2: Samples of Candidates' Performance. Hong Kong Examinations and Assessment Authority.
EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.
Related Articles
How We Validate Our Scores
Our methodology for testing AI scoring accuracy against 119 official HKEAA exemplar essays, with full results and limitations.
Read moreHow EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)
A transparent explanation for teachers and tutors of how EssayHero assesses HKDSE English Paper 2 writing, how scoring works, and where AI falls short.
Read moreHow We Caught Our AI Being Too Harsh — and Fixed It
We added text-type awareness to EssayHero's scoring and accidentally made it harsher. Here's how we discovered the problem, why it happened, and the experiment that fixed it.
Read more