EssayHero is a free, volunteer-run project that helps students improve their English writing. Using AI and official marking criteria (HKDSE Level Descriptors, IELTS Band Descriptors), it provides detailed, paragraph-by-paragraph feedback to help students understand exactly where they can improve.

Why is EssayHero free?

This project is entirely privately funded with no commercial aims or affiliations.

Who is EssayHero designed for?

EssayHero is designed for three groups: students preparing for HKDSE or IELTS who want detailed feedback on their writing, English teachers exploring AI's potential in education, and EdTech researchers interested in AI-assisted learning. If you have suggestions or feedback, email hello@essayhero.app.

How does EssayHero analyse my essay?

Your submission is sent to an AI language model with a system prompt that incorporates the publicly available official assessment criteria for your chosen exam. The AI analyses your work against these standards, providing a level/band estimate and specific feedback for each section.

Is my essay stored or shared?

For individual use, your submission is processed and immediately discarded — we do not store it unless you are signed in (in which case it is saved to your dashboard). If you use the sharing feature, your work and feedback are stored to generate a shareable link, then automatically deleted after 30 days. For Batch Essay Marking (teachers only), student names and scores extracted from scanned pages are stored for 30 days for report delivery, then auto-deleted. Teachers are responsible for obtaining appropriate consent before uploading student work. In all cases, your data is never shared with third parties and is used solely to provide analysis. We comply with Hong Kong's Personal Data (Privacy) Ordinance.

What happens to student data in Batch Essay Marking?

When a teacher uploads a class PDF for batch marking, student names, class information, and candidate numbers are extracted from the scanned pages to match reports to the correct student. This data is stored for 30 days (the same retention period as shared reports) and then permanently auto-deleted. Only the uploading teacher can view, export, or share results. Teachers are responsible for informing students and obtaining consent — particularly for students under 18, where parental or guardian consent may be appropriate. Students who wish to have their data removed should contact their teacher, who can delete individual reports at any time.

How accurate are the scores?

EssayHero provides estimates based on official marking criteria, but these are indicative, not definitive. Exam boards have the final say on actual scores. If any feedback seems off, trust your teacher's judgement — they know your work better than any AI. Use EssayHero as a learning tool, not a crystal ball.

What are the different marking standards?

Lenient gives the benefit of the doubt and focuses on strengths — useful for building confidence. Baseline applies standard HKDSE marking criteria. Harsh is deliberately strict, reserving high scores only for exceptional work — useful for students aiming for top scores who want rigorous feedback.

What do the Writing Statistics mean?

Writing Statistics measure aspects of your writing that research links to quality. Vocabulary (MTLD) measures word variety — aim for 80-100 (below 60 needs improvement). Sentences (MLT) measures average sentence length — aim for 14-18 words (under 11 is too basic, over 22 may indicate run-ons). Readability shows complexity level — target Form 4-5 (FK 10-12). The shaded zone on each bar shows your target range. These metrics are based on peer-reviewed linguistic research.

What are the Writing Tools?

Writing Tools are free, AI-powered practice tools to help you improve specific aspects of your writing. Vocabulary Enhancement analyses your word choices and suggests improvements. Text Type Checker verifies you're following the correct format conventions. Rewrite Assistant helps you improve weak sentences step-by-step. Prompt Generator creates exam-style questions for practice. Sentence Drills provides exercises to improve sentence variety. Access them from the "Writing Tools" link in the header.

How We Caught Our AI Being Too Harsh — and Fixed It

Trust & Methodology

Editor's note: This historical analysis used DeepSeek Chat (Level QWK 0.558). EssayHero now uses Google Gemini 3 Flash Preview, achieving Level QWK 0.833 — almost perfect agreement with the HKEAA chief examiner panel. The Level 4-5 underscoring described below has been substantially resolved. See current results.

Key Takeaways

We added text-type conventions (blog, speech, formal letter, etc.) to make EssayHero's feedback more format-specific
Calibration showed this accidentally made scoring harsher — accuracy dropped
The culprit: we framed conventions as faults to find, not qualities to recognise
We designed three experiments, tested each against 120 official HKEAA essays, and found that positive framing not only fixed the problem but slightly improved accuracy
The lesson applies beyond AI: how you frame assessment criteria changes the assessment itself

The Hypothesis

HKDSE Paper 2 asks students to write in specific text types — blog entries, formal letters, speeches, proposals, short stories, reports, and more. Each text type has distinct conventions:

Blogs should have a personal voice and direct reader address
Formal letters need the correct salutation-closing pair
Speeches should use rhetorical devices and oral signposts

These aren't minor stylistic preferences. They're what markers look for. A well-argued essay written in the wrong register for a blog will score lower on Organisation than the same quality of argument formatted correctly.

Building the Text-Type System

We built a comprehensive text-type conventions system covering 10 HKDSE text types and 2 IELTS formats. For each, we documented the expected register, key features, paragraph guidance, and common student mistakes.

The idea was simple: if we tell the AI what markers expect for a specific text type, the AI should be able to give more accurate, format-specific feedback.

We expected this to make scoring better. It did the opposite.

The Test

We maintain a validation corpus of 120 official HKEAA exemplar essays — 24 per level, spanning Levels 1 to 5, covering exam years 2020-2025. These are the same essays the HKEAA publishes to train and calibrate human markers across Hong Kong. They are the gold standard.

(For full details on our validation methodology, corpus, and metrics, see How We Validate Our Scores.)

Baseline Performance

Before adding text-type conventions, our baseline accuracy looked like this:

Metric	Baseline (no text-type injection)
Level QWK	0.532 (Moderate agreement)
Within-One Rate	80.8%
Bias	-0.46 (AI slightly underscores)

All 120 essays were scored using deepseek-chat with our production prompt (v1.1.2) at baseline strictness.

What is Level QWK?

Level QWK (Quadratic Weighted Kappa) is our primary metric — it measures agreement between AI and human raters, corrected for chance, with larger disagreements penalised more heavily than small ones.

Running the Comparison

After adding text-type conventions, we ran calibration again. Same 120 essays, same model, same everything except the prompt now included format-specific guidance.

The Surprise

Metric	Before (baseline)	After (text-type conventions)	Change
Level QWK	0.532	0.48	-0.052 (worse)
Within-One Rate	80.8%	78.3%	-2.5pp
Level Bias	-0.46	-0.52	More underscoring

The AI was now less accurate. Not by a trivial amount — QWK dropped from 0.532 to 0.48, and the underscoring bias increased.

Where the Damage Was Concentrated

When we looked at the per-level breakdown, the damage was concentrated at the top end. Level 4 and Level 5 essays were being scored as Level 2 and Level 3.

The AI was treating strong writing as mediocre.

Something in the text-type conventions was priming the AI to be harsher. But what?

The Culprit: Deficit Framing

We re-read the text we were injecting into the prompt. Here's what it said (emphasis added):

When evaluating this blog entry, pay special attention to these format-specific expectations...

Common student mistakes to watch for: - Writing in a formal essay style instead of a personal, conversational tone - Failing to express personal opinions or experiences...

...note where the writing meets or falls short of blog entry expectations.

The Problem With Fault-Finding Language

The language was oriented around detecting problems:

"Mistakes to watch for"
"Falls short of expectations"
"Pay special attention" — which in context meant "look harder for faults"

This is a well-understood dynamic in education. Assessment researchers distinguish between deficit-based and strengths-based approaches to evaluation:

Deficit-based assessment asks: "What is the student missing? Where did they go wrong?"
Strengths-based assessment asks: "What has the student demonstrated? What are they doing well?"

Both approaches can identify the same issues, but they produce different evaluative behaviour. A marker primed to find faults will find more faults. A marker primed to recognise quality will be more balanced in their judgement.

The Inadvertent Effect

We had, without realising it, built a deficit-based assessment frame into the prompt. The AI dutifully found faults — in essays where the HKEAA's chief examiner panel had found excellence.

The Experiment

Rather than simply reverting the change, we wanted to understand the mechanism. We designed three alternative framings of the same text-type conventions and tested each against the full 120-essay calibration set.

Three Variants

Experiment A — Positive framing

Reframed conventions as qualities to recognise and reward. Removed the mistakes list entirely. Instead of "watch for these problems," the prompt said "when you see these features, give the student credit."

Experiment B — Minimal injection

Stripped conventions down to just two lines: the text type name, expected register, and tone. No features list, no paragraph guidance, no mistakes. We wanted to test whether a gentle nudge was better than detailed instruction.

Experiment C — Full conventions with a scoring guard

Kept the detailed convention information but added an explicit instruction: "These conventions guide your FEEDBACK COMMENTS to be format-specific and helpful. They should NOT cause you to lower scores." We wanted to test whether the AI could be told to use conventions for feedback without letting them affect scoring.

Experimental Setup

Each experiment used the same setup:

All 120 HKEAA exemplars
deepseek-chat model
Production prompt
Baseline strictness

The only variable was how the text-type conventions were framed.

The Results

Overall Performance Comparison

Variant	Level QWK	Exact Match	Within-One	Level MAE	Level Bias
Experiment A (positive framing)	0.539	37.5%	80.8%	0.83	-0.48
Baseline (no injection)	0.532	38.3%	80.8%	0.83	-0.46
Experiment B (minimal)	0.52	36.7%	80.8%	0.84	-0.46
Original (deficit framing)	0.48	33.3%	78.3%	0.91	-0.52
Experiment C (guard clause)	0.477	33.3%	80.8%	0.89	-0.47

Per-Criterion Bias

Criterion	Exp A (positive)	Exp B (minimal)	Exp C (guard)
Content	-0.15	-0.14	-0.17
Language & Style	-0.66	-0.67	-0.71
Organisation	-0.49	-0.47	-0.47

Level 4-5 Underscoring (Key Diagnostic)

The most revealing comparison is how each variant handles strong essays — Level 4 and Level 5 — where underscoring is the dominant error.

Criterion Bias (L4-5 essays only)	Exp A (positive)	Exp B (minimal)	Exp C (guard)
Content	-0.96	-0.94	-1.04
Language & Style	-1.63	-1.60	-1.67
Organisation	-1.46	-1.50	-1.48

What We Learned

Three findings stand out.

1. Positive Framing Is the Only Variant That Improved Accuracy

Experiment A achieved QWK 0.539 — slightly above the 0.532 baseline with no text-type injection at all. This means we can have text-type-aware feedback (which gives students genuinely useful format-specific guidance) without sacrificing scoring accuracy.

In fact, we gain a small amount.

The improvement is modest. With 120 essays and bootstrap confidence intervals, the difference between 0.532 and 0.539 is within noise. But the direction is consistent, and crucially, it's the only variant that moved accuracy in the right direction.

We adopted it as our production prompt.

2. More Detail Makes Scoring Worse, Not Better

The ranking tells a clear story:

Positive framing (QWK 0.539) — key features only, framed as qualities to recognise
No injection (QWK 0.532) — the model uses its own knowledge
Minimal injection (QWK 0.52) — just register and tone
Deficit framing (QWK 0.48) — features + mistakes + "falls short" language
Guard clause (QWK 0.477) — full details + "don't lower scores"

The more convention detail we included, the worse the scoring became. This held true even when we explicitly told the model not to let conventions affect scores (Experiment C).

The Checklist Effect

The detailed convention list functions as a checklist. The model evaluates each item, and any miss becomes a penalty — regardless of what the surrounding instructions say.

3. Guard Clauses Do Not Work

Experiment C was our most instructive failure. We told the model:

"These conventions guide your FEEDBACK COMMENTS to be format-specific and helpful. They should NOT cause you to lower scores. A Level 5 essay that slightly deviates from format expectations is still a Level 5 essay."

The model scored QWK 0.477 — worse than the deficit-framed version (0.48) and significantly worse than baseline (0.532). The guard clause was not just ineffective; it may have made things slightly worse by drawing even more attention to the conventions as evaluative criteria.

The lesson: Language models do not compartmentalise instructions the way humans might. You cannot say "here is a detailed checklist" and then "but don't use it for scoring." The checklist's presence in the prompt shapes the model's evaluative behaviour regardless of surrounding instructions.

The Pedagogical Parallel

There is a direct parallel here to educational assessment practice.

How Human Markers Are Affected by Framing

When markers are trained with deficit-oriented rubrics — "deduct marks for X, watch for error Y, note the absence of Z" — they tend to score more harshly than markers trained with the same criteria framed positively — "award marks when you see X, recognise achievement in Y, credit the presence of Z."

The content of the rubric is identical. The criteria are the same. What changes is the evaluative stance: are you looking for what's missing, or recognising what's present?

The Same Dynamic in AI Scoring

We found the same dynamic in our AI scoring. The model we use (deepseek-chat) is, at its core, following instructions:

When those instructions orient it toward finding faults, it finds faults — even in essays that Hong Kong's chief examiner panel rated as excellent
When those instructions orient it toward recognising quality, it produces scores that align more closely with expert human judgement

A Broader Lesson

This isn't a quirk of one model or one prompt. It reflects something fundamental about how evaluative framing shapes evaluative outcomes. It's worth being aware of for anyone using AI in assessment contexts.

What's Still Hard

We want to be transparent about the limits of this improvement.

The Persistent Level 4-5 Underscoring Problem

The Level 4-5 underscoring problem persists across all variants. Even with positive framing, the AI underscores Language & Style by an average of 1.63 points and Organisation by 1.46 points for the strongest essays.

The AI handles the lower end of the scale well (Levels 1-3 are consistently within one level of the HKEAA grade) but struggles to distinguish good from excellent writing.

Beyond Prompt Engineering

This is not a prompt framing problem. It's a deeper limitation of the model's calibration. The AI appears to have an internal ceiling for how highly it's willing to score, and that ceiling sits below where the HKEAA places Level 4-5 work.

Addressing this likely requires techniques beyond prompt engineering — an area we're actively exploring.

If You're Aiming for Levels 4-5

If you're a strong writer aiming for Levels 4-5, be aware that EssayHero's level predictions may underestimate your performance. The qualitative feedback (paragraph-by-paragraph analysis, text-type-specific suggestions) is more reliable than the level number at the top end of the scale.

How We Work

This episode illustrates our approach to accuracy. We don't treat the scoring system as "done." Every change to the prompt — no matter how sensible it seems in theory — gets tested against the same 120 HKEAA exemplars before it reaches students.

Our Testing Process

Hypothesise — We thought text-type conventions would improve accuracy
Test — We ran calibration and found it made things worse
Investigate — We identified deficit framing as the likely cause
Experiment — We designed three alternatives and tested each against the full corpus
Adopt what works — Positive framing won. We deployed it
Disclose — You're reading this

If a change improves accuracy, we ship it. If it doesn't, we revert it — or, as in this case, find a better way to achieve the original goal.

The calibration pipeline catches problems before students see them.

Why Transparency Matters

We think this kind of methodological transparency is important. If you're trusting an AI to give you feedback on your writing, you should know how seriously the people behind it take accuracy.

Not as a marketing claim. As a practice.

Where We Are Now (February 2026 Update)

Since this analysis was conducted, EssayHero migrated from DeepSeek Chat to Google Gemini 3 Flash Preview. The results speak for themselves:

Level QWK: 0.558 (DeepSeek) → 0.833 (Gemini) — from moderate to almost perfect agreement
Level 4 exact match: 8.3% → 62.5%
Level 5 exact match: 8.3% → 29.2%
Within-one rate: 80% → 98.3%
Bias: -0.56 → -0.08 (nearly zero systematic error)

The persistent Level 4-5 underscoring problem described in this post — where the AI scored strong essays as mediocre — has been substantially resolved. The lessons about framing effects remain valid and continue to inform our prompt design.

For teachers with predominantly strong classes, a "Thorough scoring" mode uses deeper model reasoning that achieves 87.5% exact match on Level 4 and 66.7% on Level 5.

For current results and methodology, see How We Validate Our Scores.

EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.

How We Validate Our Scores

Our methodology for testing AI scoring accuracy against 120 official HKEAA exemplar essays, with full results and limitations.

How We Validate Our Scores

Our methodology for testing AI scoring accuracy against 119 official HKEAA exemplar essays, with full results and limitations.

How EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)

A transparent explanation for teachers and tutors of how EssayHero assesses HKDSE English Paper 2 writing, how scoring works, and where AI falls short.

Trust & Methodology

Key Takeaways

We added text-type conventions (blog, speech, formal letter, etc.) to make EssayHero's feedback more format-specific
Calibration showed this accidentally made scoring harsher — accuracy dropped
The culprit: we framed conventions as faults to find, not qualities to recognise
We designed three experiments, tested each against 120 official HKEAA essays, and found that positive framing not only fixed the problem but slightly improved accuracy
The lesson applies beyond AI: how you frame assessment criteria changes the assessment itself

The Hypothesis

HKDSE Paper 2 asks students to write in specific text types — blog entries, formal letters, speeches, proposals, short stories, reports, and more. Each text type has distinct conventions:

Blogs should have a personal voice and direct reader address
Formal letters need the correct salutation-closing pair
Speeches should use rhetorical devices and oral signposts

Building the Text-Type System

The idea was simple: if we tell the AI what markers expect for a specific text type, the AI should be able to give more accurate, format-specific feedback.

We expected this to make scoring better. It did the opposite.

The Test

(For full details on our validation methodology, corpus, and metrics, see How We Validate Our Scores.)

Baseline Performance

Before adding text-type conventions, our baseline accuracy looked like this:

Metric	Baseline (no text-type injection)
Level QWK	0.532 (Moderate agreement)
Within-One Rate	80.8%
Bias	-0.46 (AI slightly underscores)

All 120 essays were scored using deepseek-chat with our production prompt (v1.1.2) at baseline strictness.

What is Level QWK?

Running the Comparison

After adding text-type conventions, we ran calibration again. Same 120 essays, same model, same everything except the prompt now included format-specific guidance.

The Surprise

Metric	Before (baseline)	After (text-type conventions)	Change
Level QWK	0.532	0.48	-0.052 (worse)
Within-One Rate	80.8%	78.3%	-2.5pp
Level Bias	-0.46	-0.52	More underscoring

The AI was now less accurate. Not by a trivial amount — QWK dropped from 0.532 to 0.48, and the underscoring bias increased.

Where the Damage Was Concentrated

When we looked at the per-level breakdown, the damage was concentrated at the top end. Level 4 and Level 5 essays were being scored as Level 2 and Level 3.

The AI was treating strong writing as mediocre.

Something in the text-type conventions was priming the AI to be harsher. But what?

The Culprit: Deficit Framing

We re-read the text we were injecting into the prompt. Here's what it said (emphasis added):

When evaluating this blog entry, pay special attention to these format-specific expectations...

Common student mistakes to watch for: - Writing in a formal essay style instead of a personal, conversational tone - Failing to express personal opinions or experiences...

...note where the writing meets or falls short of blog entry expectations.

The Problem With Fault-Finding Language

The language was oriented around detecting problems:

"Mistakes to watch for"
"Falls short of expectations"
"Pay special attention" — which in context meant "look harder for faults"

This is a well-understood dynamic in education. Assessment researchers distinguish between deficit-based and strengths-based approaches to evaluation:

Deficit-based assessment asks: "What is the student missing? Where did they go wrong?"
Strengths-based assessment asks: "What has the student demonstrated? What are they doing well?"

The Inadvertent Effect

We had, without realising it, built a deficit-based assessment frame into the prompt. The AI dutifully found faults — in essays where the HKEAA's chief examiner panel had found excellence.

The Experiment

Three Variants

Experiment A — Positive framing

Experiment B — Minimal injection

Experiment C — Full conventions with a scoring guard

Experimental Setup

Each experiment used the same setup:

All 120 HKEAA exemplars
deepseek-chat model
Production prompt
Baseline strictness

The only variable was how the text-type conventions were framed.

The Results

Overall Performance Comparison

Variant	Level QWK	Exact Match	Within-One	Level MAE	Level Bias
Experiment A (positive framing)	0.539	37.5%	80.8%	0.83	-0.48
Baseline (no injection)	0.532	38.3%	80.8%	0.83	-0.46
Experiment B (minimal)	0.52	36.7%	80.8%	0.84	-0.46
Original (deficit framing)	0.48	33.3%	78.3%	0.91	-0.52
Experiment C (guard clause)	0.477	33.3%	80.8%	0.89	-0.47

Per-Criterion Bias

Criterion	Exp A (positive)	Exp B (minimal)	Exp C (guard)
Content	-0.15	-0.14	-0.17
Language & Style	-0.66	-0.67	-0.71
Organisation	-0.49	-0.47	-0.47

Level 4-5 Underscoring (Key Diagnostic)

The most revealing comparison is how each variant handles strong essays — Level 4 and Level 5 — where underscoring is the dominant error.

Criterion Bias (L4-5 essays only)	Exp A (positive)	Exp B (minimal)	Exp C (guard)
Content	-0.96	-0.94	-1.04
Language & Style	-1.63	-1.60	-1.67
Organisation	-1.46	-1.50	-1.48

What We Learned

Three findings stand out.

1. Positive Framing Is the Only Variant That Improved Accuracy

In fact, we gain a small amount.

We adopted it as our production prompt.

2. More Detail Makes Scoring Worse, Not Better

The ranking tells a clear story:

Positive framing (QWK 0.539) — key features only, framed as qualities to recognise
No injection (QWK 0.532) — the model uses its own knowledge
Minimal injection (QWK 0.52) — just register and tone
Deficit framing (QWK 0.48) — features + mistakes + "falls short" language
Guard clause (QWK 0.477) — full details + "don't lower scores"

The more convention detail we included, the worse the scoring became. This held true even when we explicitly told the model not to let conventions affect scores (Experiment C).

The Checklist Effect

The detailed convention list functions as a checklist. The model evaluates each item, and any miss becomes a penalty — regardless of what the surrounding instructions say.

3. Guard Clauses Do Not Work

Experiment C was our most instructive failure. We told the model:

"These conventions guide your FEEDBACK COMMENTS to be format-specific and helpful. They should NOT cause you to lower scores. A Level 5 essay that slightly deviates from format expectations is still a Level 5 essay."

The Pedagogical Parallel

There is a direct parallel here to educational assessment practice.

How Human Markers Are Affected by Framing

The content of the rubric is identical. The criteria are the same. What changes is the evaluative stance: are you looking for what's missing, or recognising what's present?

The Same Dynamic in AI Scoring

We found the same dynamic in our AI scoring. The model we use (deepseek-chat) is, at its core, following instructions:

When those instructions orient it toward finding faults, it finds faults — even in essays that Hong Kong's chief examiner panel rated as excellent
When those instructions orient it toward recognising quality, it produces scores that align more closely with expert human judgement

A Broader Lesson

What's Still Hard

We want to be transparent about the limits of this improvement.

The Persistent Level 4-5 Underscoring Problem

The AI handles the lower end of the scale well (Levels 1-3 are consistently within one level of the HKEAA grade) but struggles to distinguish good from excellent writing.

Beyond Prompt Engineering

Addressing this likely requires techniques beyond prompt engineering — an area we're actively exploring.

If You're Aiming for Levels 4-5

How We Work

Our Testing Process

Hypothesise — We thought text-type conventions would improve accuracy
Test — We ran calibration and found it made things worse
Investigate — We identified deficit framing as the likely cause
Experiment — We designed three alternatives and tested each against the full corpus
Adopt what works — Positive framing won. We deployed it
Disclose — You're reading this

If a change improves accuracy, we ship it. If it doesn't, we revert it — or, as in this case, find a better way to achieve the original goal.

The calibration pipeline catches problems before students see them.

Why Transparency Matters

We think this kind of methodological transparency is important. If you're trusting an AI to give you feedback on your writing, you should know how seriously the people behind it take accuracy.

Not as a marketing claim. As a practice.

Where We Are Now (February 2026 Update)

Since this analysis was conducted, EssayHero migrated from DeepSeek Chat to Google Gemini 3 Flash Preview. The results speak for themselves:

Level QWK: 0.558 (DeepSeek) → 0.833 (Gemini) — from moderate to almost perfect agreement
Level 4 exact match: 8.3% → 62.5%
Level 5 exact match: 8.3% → 29.2%
Within-one rate: 80% → 98.3%
Bias: -0.56 → -0.08 (nearly zero systematic error)

For teachers with predominantly strong classes, a "Thorough scoring" mode uses deeper model reasoning that achieves 87.5% exact match on Level 4 and 66.7% on Level 5.

For current results and methodology, see How We Validate Our Scores.

EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.

Key Takeaways

The Hypothesis

Building the Text-Type System

The Test

Baseline Performance

Running the Comparison

The Surprise

Where the Damage Was Concentrated

The Culprit: Deficit Framing

The Problem With Fault-Finding Language

The Experiment

Three Variants

Experimental Setup

The Results

Overall Performance Comparison

Per-Criterion Bias

Level 4-5 Underscoring (Key Diagnostic)

What We Learned

1. Positive Framing Is the Only Variant That Improved Accuracy

2. More Detail Makes Scoring Worse, Not Better

3. Guard Clauses Do Not Work

The Pedagogical Parallel

How Human Markers Are Affected by Framing

The Same Dynamic in AI Scoring

What's Still Hard

The Persistent Level 4-5 Underscoring Problem

Beyond Prompt Engineering

How We Work

Our Testing Process

Why Transparency Matters

Where We Are Now (February 2026 Update)

Related Articles

How We Validate Our Scores

How We Validate Our Scores

How EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)

Key Takeaways

The Hypothesis

Building the Text-Type System

The Test

Baseline Performance

Running the Comparison

The Surprise

Where the Damage Was Concentrated

The Culprit: Deficit Framing

The Problem With Fault-Finding Language

The Experiment

Three Variants

Experimental Setup

The Results

Overall Performance Comparison

Per-Criterion Bias

Level 4-5 Underscoring (Key Diagnostic)

What We Learned

1. Positive Framing Is the Only Variant That Improved Accuracy

2. More Detail Makes Scoring Worse, Not Better

3. Guard Clauses Do Not Work

The Pedagogical Parallel

How Human Markers Are Affected by Framing

The Same Dynamic in AI Scoring

What's Still Hard

The Persistent Level 4-5 Underscoring Problem

Beyond Prompt Engineering

How We Work

Our Testing Process

Why Transparency Matters

Where We Are Now (February 2026 Update)

Related Articles

How We Validate Our Scores

How We Validate Our Scores

How EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)