EssayHero is a free, volunteer-run project that helps students improve their English writing. Using AI and official marking criteria (HKDSE Level Descriptors, IELTS Band Descriptors), it provides detailed, paragraph-by-paragraph feedback to help students understand exactly where they can improve.

Why is EssayHero free?

This project is entirely privately funded with no commercial aims or affiliations.

Who is EssayHero designed for?

EssayHero is designed for three groups: students preparing for HKDSE or IELTS who want detailed feedback on their writing, English teachers exploring AI's potential in education, and EdTech researchers interested in AI-assisted learning. If you have suggestions or feedback, email hello@essayhero.app.

How does EssayHero analyse my essay?

Your submission is sent to an AI language model with a system prompt that incorporates the publicly available official assessment criteria for your chosen exam. The AI analyses your work against these standards, providing a level/band estimate and specific feedback for each section.

Is my essay stored or shared?

For individual use, your submission is processed and immediately discarded — we do not store it unless you are signed in (in which case it is saved to your dashboard). If you use the sharing feature, your work and feedback are stored to generate a shareable link, then automatically deleted after 30 days. For Batch Essay Marking (teachers only), student names and scores extracted from scanned pages are stored for 30 days for report delivery, then auto-deleted. Teachers are responsible for obtaining appropriate consent before uploading student work. In all cases, your data is never shared with third parties and is used solely to provide analysis. We comply with Hong Kong's Personal Data (Privacy) Ordinance.

What happens to student data in Batch Essay Marking?

When a teacher uploads a class PDF for batch marking, student names, class information, and candidate numbers are extracted from the scanned pages to match reports to the correct student. This data is stored for 30 days (the same retention period as shared reports) and then permanently auto-deleted. Only the uploading teacher can view, export, or share results. Teachers are responsible for informing students and obtaining consent — particularly for students under 18, where parental or guardian consent may be appropriate. Students who wish to have their data removed should contact their teacher, who can delete individual reports at any time.

How accurate are the scores?

EssayHero provides estimates based on official marking criteria, but these are indicative, not definitive. Exam boards have the final say on actual scores. If any feedback seems off, trust your teacher's judgement — they know your work better than any AI. Use EssayHero as a learning tool, not a crystal ball.

What are the different marking standards?

Lenient gives the benefit of the doubt and focuses on strengths — useful for building confidence. Baseline applies standard HKDSE marking criteria. Harsh is deliberately strict, reserving high scores only for exceptional work — useful for students aiming for top scores who want rigorous feedback.

What do the Writing Statistics mean?

Writing Statistics measure aspects of your writing that research links to quality. Vocabulary (MTLD) measures word variety — aim for 80-100 (below 60 needs improvement). Sentences (MLT) measures average sentence length — aim for 14-18 words (under 11 is too basic, over 22 may indicate run-ons). Readability shows complexity level — target Form 4-5 (FK 10-12). The shaded zone on each bar shows your target range. These metrics are based on peer-reviewed linguistic research.

What are the Writing Tools?

Writing Tools are free, AI-powered practice tools to help you improve specific aspects of your writing. Vocabulary Enhancement analyses your word choices and suggests improvements. Text Type Checker verifies you're following the correct format conventions. Rewrite Assistant helps you improve weak sentences step-by-step. Prompt Generator creates exam-style questions for practice. Sentence Drills provides exercises to improve sentence variety. Access them from the "Writing Tools" link in the header.

How We Validate Our Scores

Trust & Methodology

Key Takeaways

We tested EssayHero against 119 official HKEAA exemplar essays spanning Levels 1-5 (2020-2025)
The primary metric is level-to-level comparison: sum AI criterion scores, map to a level, compare against the official HKEAA level
Per-criterion comparisons are secondary and come with important caveats
We report Quadratic Weighted Kappa with 95% bootstrap confidence intervals
Limitations are disclosed in full below

Why Validation Matters

If you're going to trust an AI to score your essays, you need evidence it works. Not marketing claims. Data.

Over my years of teaching, I've seen students rely on tools that promise accurate feedback but never explain how they measured accuracy. That's not good enough. If I'm asking students to use EssayHero as a practice tool, I need to hold it to the same standard I'd expect of a human second marker.

This post explains how we test our scoring, presents the results with honest interpretation, and is transparent about what the numbers don't tell you. If you're a teacher evaluating whether to recommend EssayHero, or a student deciding whether to trust the feedback, this is the evidence.

The Ground Truth Corpus

Our validation corpus consists of 119 Part B essays from the HKEAA's official Samples of Candidates' Performance booklets, published annually for teacher moderation and marker training. These are not random student essays. They are exemplars selected by the chief examiner's panel to represent each performance level definitively.

Level	Essays	Description
Level 1	23	Below threshold performance
Level 2	24	Limited performance
Level 3	24	Adequate performance
Level 4	24	Good performance
Level 5	24	Excellent performance

The corpus covers six exam years (2020-2025). Both the pre-2024 format (Questions 2-9) and the revised 2024 format (Questions 2-5) are represented.

Why Part B only? Part A is compulsory guided writing that includes visual prompts (images, charts, maps). Since EssayHero analyses text-only submissions, Part B essays are the appropriate benchmark.

Why HKEAA exemplars? These are the gold standard for HKDSE marking. They are used to train and calibrate human markers across Hong Kong. When two examiners disagree on a borderline case, they turn to these booklets. If you want to know what a Level 4 essay looks like, these are the definitive examples.

One important detail about how HKEAA grades these essays. Each exemplar receives a single holistic level (1-5). The HKEAA does not publish separate scores for Content, Language & Style, and Organisation. This has significant implications for how we measure accuracy, which I'll explain in the methodology section.

Methodology

Blind Evaluation

Each essay was submitted to EssayHero's production analysis pipeline — the exact same system that scores student submissions — with no knowledge of the human grade. The AI received only the essay text and the official question prompt. No hints, no level information, no special treatment.

For each essay, the AI produced scores on three criteria (each on a 1-7 scale):

Content — Task completion, idea development, relevance
Language & Style — Vocabulary range, grammar accuracy, register
Organisation — Structure, coherence, paragraphing

Primary Metric: Level-to-Level Comparison

Because HKEAA assigns a single holistic level per essay, our primary accuracy measure is a level-to-level comparison.

The process:

The AI scores three criteria (Content, Language & Style, Organisation), each 1-7
We sum the three scores to get a total (range: 3-21)
We map the total to a level using the official HKDSE grading scale:

Total Score	Level
20-21	7
17-19	6
14-16	5
11-13	4
8-10	3
5-7	2
3-4	1

Since our HKEAA exemplars only go up to Level 5, we clamp AI-predicted levels to 1-5 for fair comparison. An AI prediction of Level 6 or 7 is treated as Level 5.
We compare the clamped AI level against the HKEAA level.

This is the fairest comparison we can make. The AI and the HKEAA are answering the same question: "What level is this essay?"

Secondary Metric: Per-Criterion Comparison

EssayHero gives separate scores for Content, Language & Style, and Organisation. But the HKEAA only gives one holistic level. How do you compare criterion-level accuracy when the ground truth doesn't break down that way?

Our approach: we use the HKEAA's holistic level as an approximate ground truth for each criterion. If an essay is rated Level 3 overall, we treat that as a "3" for Content, a "3" for Language & Style, and a "3" for Organisation.

This is an imperfect proxy. Real essays have uneven criterion strengths. A student might have strong content ideas but weak grammar, or excellent organisation but limited vocabulary. By assigning the same level to all three criteria, we're smoothing over exactly the kind of variation that makes essays interesting.

The per-criterion comparison is still useful as a directional signal. If the AI systematically scores Language & Style lower than the holistic level, that tells us something about the AI's tendencies. But the absolute numbers should be interpreted with caution. A large discrepancy between AI and "ground truth" on a single criterion might reflect the AI correctly identifying uneven strengths that the holistic level masks.

We report per-criterion metrics as secondary analysis, clearly labelled as such.

Metrics

We use standard metrics from Automated Essay Scoring (AES) research:

Quadratic Weighted Kappa (QWK) is the primary statistic. QWK measures agreement between two raters, corrected for chance. Unlike simple percentage agreement, QWK accounts for the ordinal nature of scores — a 2-level disagreement is penalised more heavily than a 1-level disagreement. This is the same metric used in the Kaggle Automated Student Assessment Prize (ASAP) competition and in peer-reviewed AES literature.

Mean Absolute Error (MAE) is the average absolute difference between AI and human grades. An MAE of 0.5 means the AI is, on average, half a level off.

Exact Match Rate is the percentage of cases where the AI level equals the HKEAA level exactly.

Within-One Rate is the percentage of cases where the AI level is within one level of the HKEAA grade. This is the most practically relevant metric for students: if the AI says Level 3, the real level is probably 2, 3, or 4.

Bias indicates whether the AI systematically grades higher (positive) or lower (negative) than human examiners.

Statistical Rigour

All QWK values are reported with 95% bootstrap confidence intervals (1,000 iterations, percentile method, seeded PRNG for reproducibility). Confidence intervals tell you the range within which the true agreement likely falls, given sampling variability. With 119 essays, these intervals are informative but not narrow — a larger corpus would tighten them.

Interpreting QWK

Before presenting results, it's worth understanding what QWK values actually mean. Landis and Koch (1977) proposed the following benchmark scale, which remains widely cited:

QWK Range	Interpretation
0.81 - 1.00	Almost perfect agreement
0.61 - 0.80	Substantial agreement
0.41 - 0.60	Moderate agreement
0.21 - 0.40	Fair agreement
Below 0.20	Slight agreement

For context: when two trained human markers score the same essay independently, they typically achieve QWK in the range of 0.60-0.80 (Shermis & Hamner, 2012). Human markers disagree with each other regularly — by one level, sometimes two. Perfect agreement between any two raters, human or AI, is not a realistic expectation.

The Kaggle ASAP competition (2012), one of the largest public AES benchmarks, saw winning systems achieve QWK values between 0.70 and 0.81 depending on the essay prompt. These systems were trained on thousands of human-graded essays for each prompt. Our setup is different — we use a general-purpose language model with prompt engineering rather than a purpose-trained scoring model — so direct comparisons should be made cautiously.

Results

Results below are from our production evaluation run in February 2026 against 119 HKEAA exemplars using Google Gemini 3 Flash Preview (gemini-3-flash-preview) with ThinkingLevel.LOW, text-type-aware positive framing prompt, rewritten band descriptors (IP-safe original language), and baseline strictness. For the story of how we developed text-type-aware scoring, see How We Caught Our AI Being Too Harsh — and Fixed It.

Level-to-Level Comparison (Primary)

0.833

QWK (Level Agreement)

Almost perfect agreement

98.3%

Within-One Rate

Nearly every essay

55.5%

Exact Match Rate

Correct level prediction

0.46

Mean Absolute Error

Average levels off

-0.08

Bias

Nearly zero systematic error

119

Sample Size

HKEAA exemplar essays

What This Means

A QWK of 0.833 places EssayHero in the "almost perfect agreement" range on the Landis and Koch scale. This is comparable to agreement levels between trained human examiners.

Accuracy by Level

HKEAA Level	N	Exact Match	Within-1	MAE	Bias
Level 1	23	21.7%	100%	0.78	+0.78
Level 2	24	91.7%	100%	0.08	+0.08
Level 3	24	83.3%	100%	0.17	-0.17
Level 4	24	62.5%	95.8%	0.38	-0.38
Level 5	24	29.2%	95.8%	0.71	-0.71

Thinking Level Comparison

We tested four thinking levels on the same 119-essay corpus. The default (LOW) achieves the best overall QWK, but MEDIUM thinking is more accurate for Level 4-5 essays. Teachers can opt into "Thorough scoring" in batch marking for classes with predominantly strong writers.

Level	LOW (default) Exact	MEDIUM (thorough) Exact
Level 2	91.7%	62.5%
Level 3	83.3%	41.7%
Level 4	62.5%	87.5%
Level 5	29.2%	66.7%

Per-Criterion Comparison (Secondary)

The following uses the HKEAA holistic level as an approximate per-criterion ground truth. As discussed in the methodology section, this is an imperfect proxy. Our calibration primarily measures level-to-level agreement; per-criterion analysis is directional.

Important Caveat

Per-criterion ground truth is approximated from holistic levels. Discrepancies may reflect the AI correctly identifying uneven criterion strengths rather than scoring errors.

What These Results Mean for You

For students

Nearly every AI prediction (98.3%) lands within one level of the official HKEAA grade. If the AI predicts Level 3, the HKEAA would almost certainly assign Level 2, 3, or 4.

The AI is strongest at Level 2-3 essays, where exact match rates exceed 83%. At Level 5, the AI tends to underscore slightly (within-one is still 95.8%, but exact match is 29.2% — it often predicts Level 4 instead of Level 5). If you are a strong writer and the AI gives you one level below what you expected, the qualitative feedback is still highly relevant.

Use the paragraph-by-paragraph feedback to understand the criteria and identify patterns in your writing. The qualitative feedback is often more useful than the level number itself.

For teachers

Our level-to-level QWK of 0.833 places EssayHero in the "almost perfect agreement" range on the Landis and Koch scale. For context, trained human markers typically achieve QWK of 0.60-0.80 when scoring independently. EssayHero's agreement with the HKEAA chief examiner panel exceeds that of typical inter-rater reliability.

The bias is nearly zero (-0.08), meaning the AI is equally likely to score slightly above or below the official grade. The remaining inaccuracy is concentrated at the extremes: Level 1 essays (small score range makes exact match difficult) and Level 5 essays (slight tendency to underscore by one level).

For classes with predominantly Level 4-5 students, teachers can enable "Thorough scoring" in batch marking. This uses a deeper thinking mode that achieves 87.5% exact match on Level 4 and 66.7% on Level 5, at the cost of lower accuracy on Level 2-3 essays.

We achieve this level of agreement with prompt engineering alone, using a general-purpose language model (Google Gemini 3 Flash Preview). No fine-tuning, no task-specific training data.

What the numbers don't tell you

Your actual DSE score. Real exam marking involves moderation, standardisation, and human judgement that no AI can replicate.
Performance on unusual writing. Our corpus consists of official exemplars. Highly unconventional essays may behave differently.
Per-criterion ground truth. We don't have it. The per-criterion comparisons use the holistic level as an approximation.

Limitations

I believe in disclosing limitations upfront, not burying them in footnotes.

Single exam type. This validation covers HKDSE Paper 2 Part B only. IELTS validation is planned but not yet completed. Results should not be generalised to other exams.

Holistic ground truth. The HKEAA provides one level per essay, not separate criterion scores. Our primary metric (level-to-level) works within this constraint. Our secondary metric (per-criterion) uses the holistic level as an approximation, which introduces measurement error. We cannot know the true per-criterion accuracy without per-criterion human grades.

AI vs. examiner panel, not AI vs. individual marker. The HKEAA levels represent the definitive judgement of the chief examiner's panel, not an individual marker's opinion. Individual markers have their own variance. We're comparing the AI against the best available ground truth, but this is a higher bar than comparing against a single human rater.

Integer score granularity. Both AI and HKEAA scores are whole numbers. A student who is "borderline Level 3/4" will be forced into one or the other, making exact match harder at boundaries. QWK handles this better than exact match does, which is one reason we use it as the primary statistic.

Sample size. 119 essays across 5 levels (23-24 per level) is adequate for aggregate metrics but limits the reliability of per-level analysis. We plan to expand the corpus as HKEAA publishes new exemplar booklets.

Levels 1-5 only. The HKEAA does not publish exemplars at Level 5* or 5**. Our corpus therefore cannot test the AI's ability to distinguish between Level 5 and the starred levels. Since the AI uses a 1-7 criterion scale that maps up to Level 7, there is an untested range at the top end.

Temporal scope. The corpus spans 2020-2025. The HKDSE exam format was revised in 2024 (from Questions 2-9 to Questions 2-5). Both formats are represented, but we have not yet tested whether the AI performs differently on pre- and post-reform essays.

Continuous Improvement

This is not a one-off validation. The calibration pipeline is automated and runs whenever we update the scoring prompt:

Submit all 119 essays to the updated prompt
Compute level-to-level and per-criterion metrics automatically
Compare against the previous prompt version
Deploy only if accuracy is maintained or improved

This creates a tight feedback loop. If a prompt change causes the AI to underscore Level 4 essays, we catch it before students see the change. If the per-criterion bias shifts, we can trace it to the specific prompt edit that caused it.

Gemini Migration (February 2026)

EssayHero originally used DeepSeek Chat for essay analysis, achieving a level QWK of 0.558 (moderate agreement). In February 2026, we migrated to Google Gemini 3 Flash Preview, which improved QWK to 0.833 (almost perfect agreement) — a substantial gain across all metrics.

The migration also resolved the persistent Level 4-5 underscoring problem that was our biggest accuracy limitation with DeepSeek. With Gemini, Level 4 exact match improved from 8.3% to 62.5%, and Level 5 from 8.3% to 29.2%. The historical analysis of the underscoring problem is in How We Caught Our AI Being Too Harsh — and Fixed It.

Thinking Level Experiments

We tested four thinking levels (OFF, LOW, MEDIUM, HIGH) on the full corpus. LOW achieved the best overall QWK (0.833), while MEDIUM achieved the best Level 4-5 accuracy. This led to a practical design decision: the default mode uses LOW for best overall accuracy, with an optional "Thorough scoring" mode using MEDIUM for teachers who know their class contains predominantly strong writers.

The results on this page will be updated as we refine our prompts and expand the corpus.

References

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays. In Handbook of Automated Essay Evaluation.
HKEAA (2020-2025). HKDSE English Language Paper 2: Samples of Candidates' Performance. Hong Kong Examinations and Assessment Authority.

EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.