Key Takeaways
EssayHero now supports university-level psychology essay assessment with four discipline-specific criteria scored 0-25 each (total 0-100).
The assessment is built around psychological conventions (critical evaluation of research, theoretical application, empirical evidence use) rather than generic "good writing" metrics.
The AI is explicitly instructed not to verify whether cited studies exist or whether reported statistical findings are accurate, because it cannot access journal databases.
This is a formative feedback tool, not a substitute for lecturer marking.
You're Probably Sceptical. Good.
If you're a psychology lecturer reading this, you've likely had the same reaction most academics have when someone says "AI essay feedback": a mix of wariness and mild irritation. You've spent years developing the expertise to assess critical evaluation, methodological reasoning, and theoretical sophistication.
You know the difference between a student who understands Baddeley's working memory model and one who can merely name it. The idea that software could do what you do is, at best, implausible.
I'm not going to argue with that. EssayHero can't do what you do.
What This Post Is
But it might be able to do something useful alongside what you do. And if you're going to consider recommending it to your students, you deserve to know exactly how it works, what it looks for, and where it falls short.
This post is that explanation.
Who Built This
I'm Joseph Lin. I've been marking essays for over twenty years, from primary school through to PhD dissertations.
I built EssayHero originally for HKDSE students in Hong Kong who weren't getting enough feedback between assignments. It's free, it has no commercial aims, and it's expanded to university level because lecturers asked for it.
A psychology professor colleague posed two direct questions: "Can it actually tell if students are engaging critically with the literature?" and "Does it know what APA-style argumentation looks like?"
This post answers both.
How University Assessment Differs from Exam-Board Marking
Standardised Exams Are Easier
EssayHero's original configurations were built for standardised exams: HKDSE, IELTS, Cambridge IGCSE. These exams have published marking criteria, official band descriptors, and examiner-graded exemplar essays.
The AI's job is to apply those criteria consistently. The criteria are fixed, the mark schemes are public, and calibration is straightforward because you can validate against official scores.
University Psychology Is Different
University psychology essays are a different problem. There is no single published rubric that every psychology department uses.
Expectations vary by institution, by module, and by year of study. A first-year introductory essay on memory and a third-year critical review of neuroimaging methodology require fundamentally different skills.
The "right answer" isn't a band on a scale but a demonstration of psychological reasoning grounded in empirical evidence.
Our Approach
So building a university psychology configuration meant starting from different assumptions. Instead of replicating an exam board's mark scheme, we built assessment criteria around what experienced psychology lecturers consistently look for across institutions:
- Critical analysis — evaluating research, not just describing it
- Evidence integration — using empirical sources to build arguments
- Theoretical application — applying psychological frameworks as analytical tools
- Scientific writing — precision, clarity, and appropriate register
These aren't generic. They're informed by how psychology is actually taught and assessed as an empirical discipline.
The Scoring Scale
The scoring scale is different too. Instead of HKDSE's 1-7 per criterion or IELTS's 0-9 bands, psychology essays are scored 0-25 on each of four criteria, totalling 0-100.
This mirrors the percentage-based marking that most university psychology departments use.
What the AI Actually Looks For
The four criteria are Critical Analysis, Research and Evidence, Theoretical Application, and Academic Writing. Each has five bands with detailed descriptors.
Here is what they measure.
Critical Analysis (0-25 Points)
This criterion assesses whether the student can evaluate research studies rather than merely describe them.
What the AI looks for:
- Methodological critique — sample characteristics, ecological validity, demand characteristics, operationalisation of variables
- Alternative explanations — considering rival hypotheses for reported findings
- Evidence weighting — judging strength before drawing conclusions
Top-band performance (21-25): Students who synthesise conflicting findings into a coherent narrative, show awareness of the replication crisis, and recognise limitations of particular research paradigms (e.g., over-reliance on WEIRD samples).
Bottom-band performance (0-5): Purely descriptive essays where studies are summarised but never evaluated.
Research and Evidence (0-25 Points)
This criterion evaluates how the student uses empirical sources.
What counts:
- Breadth — seminal studies alongside recent research
- Currency — not only citing classics without acknowledging subsequent developments
- Integration — sources woven into the argument, not listed in sequence
The AI recognises the evidence hierarchy:
- Meta-analyses and systematic reviews carry more weight than individual studies for general claims
- Well-controlled experiments carry more weight than case studies for causal claims
Top-band performance (21-25): Students who draw on multiple paradigms and sub-disciplines, cite key landmark studies alongside contemporary replications, and demonstrate genuine breadth.
Bottom-band performance (6-10): Thin evidence base, sources "name-dropped" without engagement, key studies absent.
Theoretical Application (0-25 Points)
This criterion examines whether the student uses psychological theories as analytical tools rather than as decoration.
Top-band performance (21-25): Theories are central to the essay's structure. The student explains core assumptions and mechanisms, applies them to interpret evidence, and compares competing frameworks systematically.
Satisfactory performance (11-15): Theories are described accurately but not effectively used to drive the analysis. They are included because they are expected, not because they illuminate the question.
Key distinction: The AI can detect the difference between a student who genuinely understands social identity theory and one who simply states "Tajfel and Turner proposed social identity theory" before moving on.
Academic Writing (0-25 Points)
This criterion covers the mechanics of scientific prose:
- Appropriate register
- Logical structure
- Paragraph coherence
- Clarity of expression
- Accurate use of technical terminology
What it rewards: Concise expression of complex ideas.
What it penalises: Unnecessary verbosity.
Citation Formatting Ignored
The AI accepts both British and American English conventions and does not comment on citation formatting. Whether a student uses APA 7th edition or another referencing style is irrelevant to the substantive assessment.
The Feedback Tone
One deliberate choice worth mentioning: the AI provides feedback in the voice of a collegial peer reviewer, not an authoritative examiner.
Examples:
- "Consider strengthening this section" rather than "You should have included"
- "The argument could be extended by" rather than "You failed to"
This was intentional. Students respond better to constructive suggestion than to top-down correction, and the feedback is more useful when it points toward specific improvements rather than merely cataloguing faults.
What We Can't Do
This is the section that matters most, so I'll be direct.
We Can't Verify Cited Studies
The AI has no access to PsycINFO, PubMed, or any journal database.
If a student cites "Smith et al. (2019)" and claims the study found a significant effect of cognitive load on decision-making, the AI cannot confirm that this study exists or that the findings were as described.
What it can do: Assess whether the citation is used effectively in the argument.
What it can't do: Verify whether the citation is accurate.
No Database Access
We instruct the AI to stay silent on citation accuracy rather than guess, because a wrong guess in either direction is worse than no comment at all.
We Can't Verify Statistical Claims
If a student reports that a study found a large effect size (d = 0.8) or that results were significant at p < .001, the AI has no way to check this against the original paper.
What it can assess: Whether the student demonstrates awareness of statistical concepts and uses them appropriately in the argument.
What it can't do: Fact-check specific numbers.
This matters in a discipline where the precise strength of evidence is central to the argument.
We Can't Assess Research Design Quality
The AI can evaluate whether a student demonstrates methodological awareness — whether they discuss internal validity, confounding variables, or sampling limitations.
But it cannot independently judge whether the methodology of a cited study was actually sound. If a student claims a study was well-controlled, the AI takes that at face value.
Only someone who has read the original paper can verify this.
We Can't Evaluate Clinical Recommendations
For essays discussing interventions, therapeutic approaches, or policy recommendations, the AI assesses the quality of the argument:
- Are claims supported by evidence?
- Are limitations acknowledged?
- Are alternatives considered?
But it cannot verify whether a recommended intervention is genuinely evidence-based or clinically appropriate. That requires expertise that the AI does not possess.
We Can't Replace Summative Marking
If you give an essay 58 and EssayHero gives it 72, you are right.
The AI is applying generalised criteria without knowing:
- Your module's specific expectations
- Your institution's marking conventions
- The particular learning outcomes you've set
The scores are useful as a rough benchmark for students working between drafts, not as a predictor of the mark they'll receive.
This Is Formative, Not Summative
The tool is designed for the gap between drafts, not for final assessment.
It's useful in the same way that a study group is useful: it gives you another perspective, it catches structural weaknesses, and it forces you to articulate your argument clearly.
But it's not a marker. It's a practice partner.
What It Is Good For
After that list of limitations, you might reasonably ask: so what's the point?
Faster Iteration Cycles
The point is faster iteration. A student working on a critical review of attachment theory at midnight can:
- Submit a draft
- Get paragraph-by-paragraph feedback on critical evaluation and evidence integration
- Identify that their theoretical comparison is superficial
- Revise it
- Bring a better draft to your office hours
That revision cycle is where the learning happens, and most students don't get enough of it because feedback is scarce and slow.
Consistent Criteria
The criteria don't change based on workload or mood. If a student submits the same essay on a Monday and a Friday, the feedback will be consistent.
That's useful for building a student's understanding of what the criteria actually mean, even if the scores themselves are rough estimates.
Common issues it catches:
- Critical analysis is descriptive rather than evaluative
- Evidence base is too narrow
- Theoretical application is surface-level rather than analytical
These are all things a student can fix before submission.
Calibrated Expectations
The strictness modes let students calibrate their expectations:
- Lenient — benefit of the doubt, focuses on strengths
- Baseline — standard marking criteria
- Harsh — rigorous standards where 21-25 scores are reserved for genuinely excellent work
A student who scores well on harsh mode has reason to feel confident. One who struggles on lenient mode knows there's significant work to do.
Not a Replacement
None of this replaces your feedback. But it might mean that when students do come to you, they've already caught the structural and argumentative weaknesses they could have found themselves.
Full Transparency
Open Source Criteria
The complete assessment criteria, the detailed band descriptors, and the full instructions that the AI receives are published in the source code.
EssayHero is open source under AGPL-3.0. You can read every line of the prompt configuration and decide for yourself whether the standards align with what you'd expect.
Privacy Commitment
Essays are processed and discarded. They are:
- Not stored (unless the student is logged in and chooses to save)
- Not used for model training
- Not accessible to anyone after the feedback is generated
If a student is logged in, they can choose to save their analysis to their own account, but that's opt-in.
Privacy matters, and in an academic context it matters more than usual.
Try It Yourself
EssayHero is free. No account required.
How to test it:
- Go to essayhero.app/?exam=uni-psychology
- Paste a sample essay
- Read the output
- Decide whether it's something worth sharing with your students
Feedback Welcome
If you think it could help your students iterate faster between drafts, share it.
If you think the criteria don't align with your expectations, or the feedback isn't useful, I'd genuinely like to hear why. Email hello@essayhero.app.
I built this to help students write better. If it can do that for your students, I'm glad. If not, I understand.
EssayHero is free, has no commercial aims, and is built by a Hong Kong teacher for students worldwide. Questions? Email hello@essayhero.app.
Related Articles
How EssayHero Marks Business Essays (And What It Can't Do)
A transparent look at how EssayHero assesses university business and management essays, what the criteria actually measure, and where AI falls short.
Read moreHow EssayHero Marks Law Essays (And What It Can't Do)
A transparent look at how EssayHero assesses university law essays, what the criteria actually measure, and where AI falls short.
Read moreHow EssayHero Marks HKDSE Paper 2 Essays (And Why You Should Know)
A transparent explanation for teachers and tutors of how EssayHero assesses HKDSE English Paper 2 writing, how scoring works, and where AI falls short.
Read more