EssayHero is a free, volunteer-run project that helps students improve their English writing. Using AI and official marking criteria (HKDSE Level Descriptors, IELTS Band Descriptors), it provides detailed, paragraph-by-paragraph feedback to help students understand exactly where they can improve.

Why is EssayHero free?

This project is entirely privately funded with no commercial aims or affiliations.

Who is EssayHero designed for?

EssayHero is designed for three groups: students preparing for HKDSE or IELTS who want detailed feedback on their writing, English teachers exploring AI's potential in education, and EdTech researchers interested in AI-assisted learning. If you have suggestions or feedback, email hello@essayhero.app.

How does EssayHero analyse my essay?

Your submission is sent to an AI language model with a system prompt that incorporates the publicly available official assessment criteria for your chosen exam. The AI analyses your work against these standards, providing a level/band estimate and specific feedback for each section.

Is my essay stored or shared?

For individual use, your submission is processed and immediately discarded — we do not store it unless you are signed in (in which case it is saved to your dashboard). If you use the sharing feature, your work and feedback are stored to generate a shareable link, then automatically deleted after 30 days. For Batch Essay Marking (teachers only), student names and scores extracted from scanned pages are stored for 30 days for report delivery, then auto-deleted. Teachers are responsible for obtaining appropriate consent before uploading student work. In all cases, your data is never shared with third parties and is used solely to provide analysis. We comply with Hong Kong's Personal Data (Privacy) Ordinance.

What happens to student data in Batch Essay Marking?

When a teacher uploads a class PDF for batch marking, student names, class information, and candidate numbers are extracted from the scanned pages to match reports to the correct student. This data is stored for 30 days (the same retention period as shared reports) and then permanently auto-deleted. Only the uploading teacher can view, export, or share results. Teachers are responsible for informing students and obtaining consent — particularly for students under 18, where parental or guardian consent may be appropriate. Students who wish to have their data removed should contact their teacher, who can delete individual reports at any time.

How accurate are the scores?

EssayHero provides estimates based on official marking criteria, but these are indicative, not definitive. Exam boards have the final say on actual scores. If any feedback seems off, trust your teacher's judgement — they know your work better than any AI. Use EssayHero as a learning tool, not a crystal ball.

What are the different marking standards?

Lenient gives the benefit of the doubt and focuses on strengths — useful for building confidence. Baseline applies standard HKDSE marking criteria. Harsh is deliberately strict, reserving high scores only for exceptional work — useful for students aiming for top scores who want rigorous feedback.

What do the Writing Statistics mean?

Writing Statistics measure aspects of your writing that research links to quality. Vocabulary (MTLD) measures word variety — aim for 80-100 (below 60 needs improvement). Sentences (MLT) measures average sentence length — aim for 14-18 words (under 11 is too basic, over 22 may indicate run-ons). Readability shows complexity level — target Form 4-5 (FK 10-12). The shaded zone on each bar shows your target range. These metrics are based on peer-reviewed linguistic research.

What are the Writing Tools?

Writing Tools are free, AI-powered practice tools to help you improve specific aspects of your writing. Vocabulary Enhancement analyses your word choices and suggests improvements. Text Type Checker verifies you're following the correct format conventions. Rewrite Assistant helps you improve weak sentences step-by-step. Prompt Generator creates exam-style questions for practice. Sentence Drills provides exercises to improve sentence variety. Access them from the "Writing Tools" link in the header.

How We Validate Our Scores

Trust & Methodology

Key Takeaways

We tested EssayHero against 120 official HKEAA exemplar essays spanning Levels 1-5 (2020-2025)
The primary metric is level-to-level comparison: sum AI criterion scores, map to a level, compare against the official HKEAA level
Per-criterion comparisons are secondary and come with important caveats
We report Quadratic Weighted Kappa with 95% bootstrap confidence intervals
Limitations are disclosed in full below

Why Validation Matters

If you're going to trust an AI to score your essays, you need evidence it works. Not marketing claims. Data.

Over my years of teaching, I've seen students rely on tools that promise accurate feedback but never explain how they measured accuracy. That's not good enough.

If I'm asking students to use EssayHero as a practice tool, I need to hold it to the same standard I'd expect of a human second marker.

What This Post Covers

This post explains how we test our scoring, presents the results with honest interpretation, and is transparent about what the numbers don't tell you.

If you're a teacher evaluating whether to recommend EssayHero, or a student deciding whether to trust the feedback, this is the evidence.

The Ground Truth Corpus

Our validation corpus consists of 120 Part B essays from the HKEAA's official Samples of Candidates' Performance booklets, published annually for teacher moderation and marker training.

These are not random student essays. They are exemplars selected by the chief examiner's panel to represent each performance level definitively.

Level	Essays	Description
Level 1	24	Below threshold performance
Level 2	24	Limited performance
Level 3	24	Adequate performance
Level 4	24	Good performance
Level 5	24	Excellent performance

Corpus Scope

The corpus covers six exam years (2020-2025). Both the pre-2024 format (Questions 2-9) and the revised 2024 format (Questions 2-5) are represented.

Why Part B Only?

Part A is compulsory guided writing that includes visual prompts (images, charts, maps). Since EssayHero analyses text-only submissions, Part B essays are the appropriate benchmark.

Why HKEAA Exemplars?

These are the gold standard for HKDSE marking. They are used to train and calibrate human markers across Hong Kong.

When two examiners disagree on a borderline case, they turn to these booklets. If you want to know what a Level 4 essay looks like, these are the definitive examples.

Important Caveat About Holistic Scoring

Each exemplar receives a single holistic level (1-5). The HKEAA does not publish separate scores for Content, Language & Style, and Organisation.

This has significant implications for how we measure accuracy, which I'll explain in the methodology section.

Methodology

Blind Evaluation

Each essay was submitted to EssayHero's production analysis pipeline — the exact same system that scores student submissions — with no knowledge of the human grade.

The AI received only the essay text and the official question prompt. No hints, no level information, no special treatment.

AI Output Format

For each essay, the AI produced scores on three criteria (each on a 1-7 scale):

Content — Task completion, idea development, relevance
Language & Style — Vocabulary range, grammar accuracy, register
Organisation — Structure, coherence, paragraphing

Primary Metric: Level-to-Level Comparison

Because HKEAA assigns a single holistic level per essay, our primary accuracy measure is a level-to-level comparison.

Conversion Process

The process works in five steps:

Score three criteria — The AI scores Content, Language & Style, and Organisation (each 1-7)
Sum to total — We add the three scores to get a total (range: 3-21)
Map to level — We convert the total to a level using the official HKDSE grading scale:

Total Score	Level
20-21	7
17-19	6
14-16	5
11-13	4
8-10	3
5-7	2
3-4	1

Clamp to corpus range — Since our HKEAA exemplars only go up to Level 5, we clamp AI-predicted levels to 1-5 for fair comparison. An AI prediction of Level 6 or 7 is treated as Level 5.
Compare — We compare the clamped AI level against the HKEAA level.

Why This Is Fair

This is the fairest comparison we can make. The AI and the HKEAA are answering the same question: "What level is this essay?"

Secondary Metric: Per-Criterion Comparison

EssayHero gives separate scores for Content, Language & Style, and Organisation. But the HKEAA only gives one holistic level.

How do you compare criterion-level accuracy when the ground truth doesn't break down that way?

Our Approach

We use the HKEAA's holistic level as an approximate ground truth for each criterion. If an essay is rated Level 3 overall, we treat that as a "3" for Content, a "3" for Language & Style, and a "3" for Organisation.

Important Limitation

This is an imperfect proxy. Real essays have uneven criterion strengths.

A student might have strong content ideas but weak grammar, or excellent organisation but limited vocabulary. By assigning the same level to all three criteria, we're smoothing over exactly the kind of variation that makes essays interesting.

What Per-Criterion Metrics Tell Us

The per-criterion comparison is still useful as a directional signal:

If the AI systematically scores Language & Style lower than the holistic level, that tells us something about the AI's tendencies
But the absolute numbers should be interpreted with caution
A large discrepancy on a single criterion might reflect the AI correctly identifying uneven strengths that the holistic level masks

We report per-criterion metrics as secondary analysis, clearly labelled as such.

Metrics

We use standard metrics from Automated Essay Scoring (AES) research:

Core Metrics Explained

Quadratic Weighted Kappa (QWK) — Primary statistic measuring agreement between two raters, corrected for chance. Unlike simple percentage agreement, QWK accounts for the ordinal nature of scores (a 2-level disagreement is penalised more heavily than a 1-level disagreement). This is the same metric used in the Kaggle Automated Student Assessment Prize (ASAP) competition and in peer-reviewed AES literature.
Mean Absolute Error (MAE) — Average absolute difference between AI and human grades. An MAE of 0.5 means the AI is, on average, half a level off.
Exact Match Rate — Percentage of cases where the AI level equals the HKEAA level exactly.
Within-One Rate — Percentage of cases where the AI level is within one level of the HKEAA grade. This is the most practically relevant metric for students: if the AI says Level 3, the real level is probably 2, 3, or 4.
Bias — Indicates whether the AI systematically grades higher (positive) or lower (negative) than human examiners.

Statistical Rigour

All QWK values are reported with 95% bootstrap confidence intervals (1,000 iterations, percentile method, seeded PRNG for reproducibility).

Confidence intervals tell you the range within which the true agreement likely falls, given sampling variability. With 120 essays, these intervals are informative but not narrow — a larger corpus would tighten them.

Interpreting QWK

Before presenting results, it's worth understanding what QWK values actually mean.

Benchmark Scale

Landis and Koch (1977) proposed the following benchmark scale, which remains widely cited:

QWK Range	Interpretation
0.81 - 1.00	Almost perfect agreement
0.61 - 0.80	Substantial agreement
0.41 - 0.60	Moderate agreement
0.21 - 0.40	Fair agreement
Below 0.20	Slight agreement

Human Baseline

When two trained human markers score the same essay independently, they typically achieve QWK in the range of 0.60-0.80 (Shermis & Hamner, 2012).

Human markers disagree with each other regularly — by one level, sometimes two. Perfect agreement between any two raters, human or AI, is not a realistic expectation.

Industry Benchmarks

The Kaggle ASAP competition (2012), one of the largest public AES benchmarks, saw winning systems achieve QWK values between 0.70 and 0.81 depending on the essay prompt.

These systems were trained on thousands of human-graded essays for each prompt. Our setup is different — we use a general-purpose language model with prompt engineering rather than a purpose-trained scoring model — so direct comparisons should be made cautiously.

Results

Production System Results

Results below are from our production evaluation run on 10 February 2026 against 120 HKEAA exemplars using DeepSeek Chat (deepseek-chat) with text-type-aware positive framing prompt, rewritten band descriptors (IP-safe original language), and baseline strictness.

For the story of how we developed text-type-aware scoring, see How We Caught Our AI Being Too Harsh — and Fixed It.

Level-to-Level Comparison (Primary)

Metric	Value
QWK (Level Agreement)	0.558 (Moderate agreement)
95% Bootstrap CI	[0.461, 0.646]
Mean Absolute Error	0.83
Exact Match Rate	39.2%
Within-One Rate	80%
Bias	-0.56 (AI slightly underscores)
Sample Size	120 essays

Accuracy by Level

HKEAA Level	N	Exact Match	Within-1	MAE	Bias
Level 1	24	33.3%	100%	0.67	+0.67
Level 2	24	100%	100%	0.0	0.0
Level 3	24	45.8%	100%	0.54	-0.54
Level 4	24	8.3%	66.7%	1.25	-1.25
Level 5	24	8.3%	33.3%	1.67	-1.67

Per-Criterion Comparison (Secondary)

Interpretation Caveat

The following table uses the HKEAA holistic level as an approximate per-criterion ground truth. As discussed in the methodology section, this is an imperfect proxy.

These numbers provide directional signal about where the AI may systematically over- or underscore, but should not be treated as precise accuracy figures.

Criterion	QWK	Bias	Interpretation
Content	0.746	-0.27	Substantial agreement; slight underscoring
Language & Style	0.522	-0.68	Moderate agreement; notable underscoring
Organisation	0.558	-0.56	Moderate agreement; moderate underscoring
Overall (criterion average)	0.607	-0.50	Moderate agreement overall

Note: Per-criterion ground truth is approximated from holistic levels. Discrepancies may reflect the AI correctly identifying uneven criterion strengths rather than scoring errors.

What These Results Mean for You

For Students

Four out of five AI predictions land within one level of the official HKEAA grade. If the AI predicts Level 3, the HKEAA would very likely assign Level 2, 3, or 4.

Accuracy Varies by Level

The AI performs differently across the level spectrum:

Strongest at Levels 1-3 — Within-one accuracy is 100%
Weakest at Level 5 — Within-one drops to 33.3%
Tendency to underscore — Especially for high-quality work

If you're a strong writer and the AI gives you a Level 3, your real level may well be higher. This is an active area of improvement.

How to Use the Feedback

Use the paragraph-by-paragraph feedback to understand the criteria and identify patterns in your writing.

The qualitative feedback is often more useful than the level number itself. Don't treat the level as a guarantee.

For Teachers

Our level-to-level QWK of 0.558 places EssayHero in the moderate agreement range on the Landis and Koch scale.

For context, trained human markers typically achieve QWK of 0.60-0.80 when scoring independently. We're below that threshold, which means the AI's level assignments should be treated as rough estimates rather than reliable second-marker judgements.

Two Key Patterns

Underscoring bias: The AI underscores by an average of 0.56 levels. This means students are more likely to receive a level below their true performance than above it.

Top-end struggles: The underscoring is concentrated at the top. Level 4-5 essays are frequently predicted as Level 2-3. The AI handles the lower end well but struggles to distinguish good from excellent writing.

Per-Criterion Insights

The per-criterion bias figures reinforce these patterns:

Language & Style — Largest underscoring (bias of -0.68), suggesting the AI is overly critical of grammar and vocabulary
Content — Smallest bias (-0.27) and highest criterion-level QWK (0.746, in the substantial agreement range)

We use these signals to refine our prompts. Language & Style calibration is a priority.

Technical Approach

It's worth emphasising that we achieve this level of agreement with prompt engineering alone, using a general-purpose language model.

No fine-tuning, no task-specific training data. We expect accuracy to improve as we refine prompts and potentially move to fine-tuned models.

What the Numbers Don't Tell You

These metrics have important blind spots:

Your actual DSE score — Real exam marking involves moderation, standardisation, and human judgement that no AI can replicate
Performance on unusual writing — Our corpus consists of official exemplars. Highly unconventional essays may behave differently
Per-criterion ground truth — We don't have it. The per-criterion comparisons use the holistic level as an approximation
How Level 4-5 students should interpret scores — If you're aiming for Levels 4-5, the AI is more likely to underscore your work. Focus on the qualitative feedback rather than the level number

Limitations

I believe in disclosing limitations upfront, not burying them in footnotes.

Scope Limitations

Single exam type — This validation covers HKDSE Paper 2 Part B only. IELTS validation is planned but not yet completed. Results should not be generalised to other exams.

Temporal scope — The corpus spans 2020-2025. The HKDSE exam format was revised in 2024 (from Questions 2-9 to Questions 2-5). Both formats are represented, but we have not yet tested whether the AI performs differently on pre- and post-reform essays.

Levels 1-5 only — The HKEAA does not publish exemplars at Level 5* or 5**. Our corpus therefore cannot test the AI's ability to distinguish between Level 5 and the starred levels. Since the AI uses a 1-7 criterion scale that maps up to Level 7, there is an untested range at the top end.

Measurement Limitations

Holistic ground truth — The HKEAA provides one level per essay, not separate criterion scores. Our primary metric (level-to-level) works within this constraint. Our secondary metric (per-criterion) uses the holistic level as an approximation, which introduces measurement error. We cannot know the true per-criterion accuracy without per-criterion human grades.

AI vs. examiner panel, not AI vs. individual marker — The HKEAA levels represent the definitive judgement of the chief examiner's panel, not an individual marker's opinion. Individual markers have their own variance. We're comparing the AI against the best available ground truth, but this is a higher bar than comparing against a single human rater.

Integer score granularity — Both AI and HKEAA scores are whole numbers. A student who is "borderline Level 3/4" will be forced into one or the other, making exact match harder at boundaries. QWK handles this better than exact match does, which is one reason we use it as the primary statistic.

Statistical Limitations

Sample size — 120 essays across 5 levels (24 per level) is adequate for aggregate metrics but limits the reliability of per-level analysis. The confidence intervals reflect this. We plan to expand the corpus as HKEAA publishes new exemplar booklets.

Continuous Improvement

This is not a one-off validation. The calibration pipeline is automated and runs whenever we update the scoring prompt.

Automated Pipeline

The workflow ensures quality control:

Submit — All 120 essays to the updated prompt
Compute — Level-to-level and per-criterion metrics automatically
Compare — Against the previous prompt version
Deploy — Only if accuracy is maintained or improved

Tight Feedback Loop

This process catches problems before they reach students:

If a prompt change causes the AI to underscore Level 4 essays, we catch it before deployment
If the per-criterion bias shifts, we can trace it to the specific prompt edit that caused it

Recent Improvements: Text-Type-Aware Scoring

We recently added text-type conventions to the scoring prompt so the AI can give format-specific feedback (e.g., recognising blog conventions vs formal letter conventions).

Framing Experiments

During calibration, we discovered that the way we framed these conventions significantly affected scoring accuracy:

Initial approach — Degraded QWK from 0.532 to 0.48
After testing — Three alternative framings tested against full 120-essay corpus
Final result — Positive framing approach achieved QWK 0.539 (slight improvement over baseline)

The full story, including experiment methodology and results, is in How We Caught Our AI Being Too Harsh — and Fixed It.

IP Safety Review

After the text-type framing experiments, we also rewrote all exam board band descriptors in original language (replacing verbatim copyrighted text) as part of our intellectual property safety review.

A full recalibration confirmed that this change maintained or slightly improved accuracy — QWK rose from 0.539 to 0.558 — indicating that the scoring system relies on the substance of the descriptors rather than their exact wording.

Living Document

The results on this page will be updated as we refine our prompts and expand the corpus.

References

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays. In Handbook of Automated Essay Evaluation.
HKEAA (2020-2025). HKDSE English Language Paper 2: Samples of Candidates' Performance. Hong Kong Examinations and Assessment Authority.

EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.