Editor's note: This historical analysis used DeepSeek Chat (Level QWK 0.558). EssayHero now uses Google Gemini 3 Flash Preview, achieving Level QWK 0.833 — almost perfect agreement with the HKEAA chief examiner panel. The Level 4-5 underscoring described below has been substantially resolved. See current results.
Key Takeaways
- We added text-type conventions (blog, speech, formal letter, etc.) to make EssayHero's feedback more format-specific
- Calibration showed this accidentally made scoring harsher — accuracy dropped
- The culprit: we framed conventions as faults to find, not qualities to recognise
- We designed three experiments, tested each against 120 official HKEAA essays, and found that positive framing not only fixed the problem but slightly improved accuracy
- The lesson applies beyond AI: how you frame assessment criteria changes the assessment itself
The Hypothesis
HKDSE Paper 2 asks students to write in specific text types — blog entries, formal letters, speeches, proposals, short stories, reports, and more. Each text type has distinct conventions:
- Blogs should have a personal voice and direct reader address
- Formal letters need the correct salutation-closing pair
- Speeches should use rhetorical devices and oral signposts
These aren't minor stylistic preferences. They're what markers look for. A well-argued essay written in the wrong register for a blog will score lower on Organisation than the same quality of argument formatted correctly.
Building the Text-Type System
We built a comprehensive text-type conventions system covering 10 HKDSE text types and 2 IELTS formats. For each, we documented the expected register, key features, paragraph guidance, and common student mistakes.
The idea was simple: if we tell the AI what markers expect for a specific text type, the AI should be able to give more accurate, format-specific feedback.
We expected this to make scoring better. It did the opposite.
The Test
We maintain a validation corpus of 120 official HKEAA exemplar essays — 24 per level, spanning Levels 1 to 5, covering exam years 2020-2025. These are the same essays the HKEAA publishes to train and calibrate human markers across Hong Kong. They are the gold standard.
(For full details on our validation methodology, corpus, and metrics, see How We Validate Our Scores.)
Baseline Performance
Before adding text-type conventions, our baseline accuracy looked like this:
| Metric | Baseline (no text-type injection) |
|---|---|
| Level QWK | 0.532 (Moderate agreement) |
| Within-One Rate | 80.8% |
| Bias | -0.46 (AI slightly underscores) |
All 120 essays were scored using deepseek-chat with our production prompt (v1.1.2) at baseline strictness.
What is Level QWK?
Level QWK (Quadratic Weighted Kappa) is our primary metric — it measures agreement between AI and human raters, corrected for chance, with larger disagreements penalised more heavily than small ones.
Running the Comparison
After adding text-type conventions, we ran calibration again. Same 120 essays, same model, same everything except the prompt now included format-specific guidance.
The Surprise
| Metric | Before (baseline) | After (text-type conventions) | Change |
|---|---|---|---|
| Level QWK | 0.532 | 0.48 | -0.052 (worse) |
| Within-One Rate | 80.8% | 78.3% | -2.5pp |
| Level Bias | -0.46 | -0.52 | More underscoring |
The AI was now less accurate. Not by a trivial amount — QWK dropped from 0.532 to 0.48, and the underscoring bias increased.
Where the Damage Was Concentrated
When we looked at the per-level breakdown, the damage was concentrated at the top end. Level 4 and Level 5 essays were being scored as Level 2 and Level 3.
The AI was treating strong writing as mediocre.
Something in the text-type conventions was priming the AI to be harsher. But what?
The Culprit: Deficit Framing
We re-read the text we were injecting into the prompt. Here's what it said (emphasis added):
When evaluating this blog entry, pay special attention to these format-specific expectations...
Common student mistakes to watch for: - Writing in a formal essay style instead of a personal, conversational tone - Failing to express personal opinions or experiences...
...note where the writing meets or falls short of blog entry expectations.
The Problem With Fault-Finding Language
The language was oriented around detecting problems:
- "Mistakes to watch for"
- "Falls short of expectations"
- "Pay special attention" — which in context meant "look harder for faults"
This is a well-understood dynamic in education. Assessment researchers distinguish between deficit-based and strengths-based approaches to evaluation:
- Deficit-based assessment asks: "What is the student missing? Where did they go wrong?"
- Strengths-based assessment asks: "What has the student demonstrated? What are they doing well?"
Both approaches can identify the same issues, but they produce different evaluative behaviour. A marker primed to find faults will find more faults. A marker primed to recognise quality will be more balanced in their judgement.
The Inadvertent Effect
We had, without realising it, built a deficit-based assessment frame into the prompt. The AI dutifully found faults — in essays where the HKEAA's chief examiner panel had found excellence.
The Experiment
Rather than simply reverting the change, we wanted to understand the mechanism. We designed three alternative framings of the same text-type conventions and tested each against the full 120-essay calibration set.
Three Variants
Experiment A — Positive framing
Reframed conventions as qualities to recognise and reward. Removed the mistakes list entirely. Instead of "watch for these problems," the prompt said "when you see these features, give the student credit."
Experiment B — Minimal injection
Stripped conventions down to just two lines: the text type name, expected register, and tone. No features list, no paragraph guidance, no mistakes. We wanted to test whether a gentle nudge was better than detailed instruction.
Experiment C — Full conventions with a scoring guard
Kept the detailed convention information but added an explicit instruction: "These conventions guide your FEEDBACK COMMENTS to be format-specific and helpful. They should NOT cause you to lower scores." We wanted to test whether the AI could be told to use conventions for feedback without letting them affect scoring.
Experimental Setup
Each experiment used the same setup:
- All 120 HKEAA exemplars
deepseek-chatmodel- Production prompt
- Baseline strictness
The only variable was how the text-type conventions were framed.
The Results
Overall Performance Comparison
| Variant | Level QWK | Exact Match | Within-One | Level MAE | Level Bias |
|---|---|---|---|---|---|
| Experiment A (positive framing) | 0.539 | 37.5% | 80.8% | 0.83 | -0.48 |
| Baseline (no injection) | 0.532 | 38.3% | 80.8% | 0.83 | -0.46 |
| Experiment B (minimal) | 0.52 | 36.7% | 80.8% | 0.84 | -0.46 |
| Original (deficit framing) | 0.48 | 33.3% | 78.3% | 0.91 | -0.52 |
| Experiment C (guard clause) | 0.477 | 33.3% | 80.8% | 0.89 | -0.47 |
Per-Criterion Bias
| Criterion | Exp A (positive) | Exp B (minimal) | Exp C (guard) |
|---|---|---|---|
| Content | -0.15 | -0.14 | -0.17 |
| Language & Style | -0.66 | -0.67 | -0.71 |
| Organisation | -0.49 | -0.47 | -0.47 |
Level 4-5 Underscoring (Key Diagnostic)
The most revealing comparison is how each variant handles strong essays — Level 4 and Level 5 — where underscoring is the dominant error.
| Criterion Bias (L4-5 essays only) | Exp A (positive) | Exp B (minimal) | Exp C (guard) |
|---|---|---|---|
| Content | -0.96 | -0.94 | -1.04 |
| Language & Style | -1.63 | -1.60 | -1.67 |
| Organisation | -1.46 | -1.50 | -1.48 |
What We Learned
Three findings stand out.
1. Positive Framing Is the Only Variant That Improved Accuracy
Experiment A achieved QWK 0.539 — slightly above the 0.532 baseline with no text-type injection at all. This means we can have text-type-aware feedback (which gives students genuinely useful format-specific guidance) without sacrificing scoring accuracy.
In fact, we gain a small amount.
The improvement is modest. With 120 essays and bootstrap confidence intervals, the difference between 0.532 and 0.539 is within noise. But the direction is consistent, and crucially, it's the only variant that moved accuracy in the right direction.
We adopted it as our production prompt.
2. More Detail Makes Scoring Worse, Not Better
The ranking tells a clear story:
- Positive framing (QWK 0.539) — key features only, framed as qualities to recognise
- No injection (QWK 0.532) — the model uses its own knowledge
- Minimal injection (QWK 0.52) — just register and tone
- Deficit framing (QWK 0.48) — features + mistakes + "falls short" language
- Guard clause (QWK 0.477) — full details + "don't lower scores"
The more convention detail we included, the worse the scoring became. This held true even when we explicitly told the model not to let conventions affect scores (Experiment C).
The Checklist Effect
The detailed convention list functions as a checklist. The model evaluates each item, and any miss becomes a penalty — regardless of what the surrounding instructions say.
3. Guard Clauses Do Not Work
Experiment C was our most instructive failure. We told the model:
"These conventions guide your FEEDBACK COMMENTS to be format-specific and helpful. They should NOT cause you to lower scores. A Level 5 essay that slightly deviates from format expectations is still a Level 5 essay."
The model scored QWK 0.477 — worse than the deficit-framed version (0.48) and significantly worse than baseline (0.532). The guard clause was not just ineffective; it may have made things slightly worse by drawing even more attention to the conventions as evaluative criteria.
The lesson: Language models do not compartmentalise instructions the way humans might. You cannot say "here is a detailed checklist" and then "but don't use it for scoring." The checklist's presence in the prompt shapes the model's evaluative behaviour regardless of surrounding instructions.
The Pedagogical Parallel
There is a direct parallel here to educational assessment practice.
How Human Markers Are Affected by Framing
When markers are trained with deficit-oriented rubrics — "deduct marks for X, watch for error Y, note the absence of Z" — they tend to score more harshly than markers trained with the same criteria framed positively — "award marks when you see X, recognise achievement in Y, credit the presence of Z."
The content of the rubric is identical. The criteria are the same. What changes is the evaluative stance: are you looking for what's missing, or recognising what's present?
The Same Dynamic in AI Scoring
We found the same dynamic in our AI scoring. The model we use (deepseek-chat) is, at its core, following instructions:
- When those instructions orient it toward finding faults, it finds faults — even in essays that Hong Kong's chief examiner panel rated as excellent
- When those instructions orient it toward recognising quality, it produces scores that align more closely with expert human judgement
A Broader Lesson
This isn't a quirk of one model or one prompt. It reflects something fundamental about how evaluative framing shapes evaluative outcomes. It's worth being aware of for anyone using AI in assessment contexts.
What's Still Hard
We want to be transparent about the limits of this improvement.
The Persistent Level 4-5 Underscoring Problem
The Level 4-5 underscoring problem persists across all variants. Even with positive framing, the AI underscores Language & Style by an average of 1.63 points and Organisation by 1.46 points for the strongest essays.
The AI handles the lower end of the scale well (Levels 1-3 are consistently within one level of the HKEAA grade) but struggles to distinguish good from excellent writing.
Beyond Prompt Engineering
This is not a prompt framing problem. It's a deeper limitation of the model's calibration. The AI appears to have an internal ceiling for how highly it's willing to score, and that ceiling sits below where the HKEAA places Level 4-5 work.
Addressing this likely requires techniques beyond prompt engineering — an area we're actively exploring.
If You're Aiming for Levels 4-5
If you're a strong writer aiming for Levels 4-5, be aware that EssayHero's level predictions may underestimate your performance. The qualitative feedback (paragraph-by-paragraph analysis, text-type-specific suggestions) is more reliable than the level number at the top end of the scale.
How We Work
This episode illustrates our approach to accuracy. We don't treat the scoring system as "done." Every change to the prompt — no matter how sensible it seems in theory — gets tested against the same 120 HKEAA exemplars before it reaches students.
Our Testing Process
- Hypothesise — We thought text-type conventions would improve accuracy
- Test — We ran calibration and found it made things worse
- Investigate — We identified deficit framing as the likely cause
- Experiment — We designed three alternatives and tested each against the full corpus
- Adopt what works — Positive framing won. We deployed it
- Disclose — You're reading this
If a change improves accuracy, we ship it. If it doesn't, we revert it — or, as in this case, find a better way to achieve the original goal.
The calibration pipeline catches problems before students see them.
Why Transparency Matters
We think this kind of methodological transparency is important. If you're trusting an AI to give you feedback on your writing, you should know how seriously the people behind it take accuracy.
Not as a marketing claim. As a practice.
Where We Are Now (February 2026 Update)
Since this analysis was conducted, EssayHero migrated from DeepSeek Chat to Google Gemini 3 Flash Preview. The results speak for themselves:
- Level QWK: 0.558 (DeepSeek) → 0.833 (Gemini) — from moderate to almost perfect agreement
- Level 4 exact match: 8.3% → 62.5%
- Level 5 exact match: 8.3% → 29.2%
- Within-one rate: 80% → 98.3%
- Bias: -0.56 → -0.08 (nearly zero systematic error)
The persistent Level 4-5 underscoring problem described in this post — where the AI scored strong essays as mediocre — has been substantially resolved. The lessons about framing effects remain valid and continue to inform our prompt design.
For teachers with predominantly strong classes, a "Thorough scoring" mode uses deeper model reasoning that achieves 87.5% exact match on Level 4 and 66.7% on Level 5.
For current results and methodology, see How We Validate Our Scores.
EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for Hong Kong students. Questions about our methodology? Email hello@essayhero.app.
Related Articles
How We Validate Our Scores
Our methodology for testing AI scoring accuracy against 120 official HKEAA exemplar essays, with full results and limitations.
Read moreHow We Validate Our Scores
Our methodology for testing AI scoring accuracy against 119 official HKEAA exemplar essays, with full results and limitations.
Read moreHow EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)
A transparent explanation for teachers and tutors of how EssayHero assesses HKDSE English Paper 2 writing, how scoring works, and where AI falls short.
Read more