We’re about to launch automarking powered by AI - here’s how we got here

September 3, 2025

Joshua Perry

Secondary Schools

Primary Schools

Custom Assessments

HeadStart Primary

AI-Auto-Marking

Online assessments

There’s a lot of talk about AI as a tool to mark students’ work right now. We’ve been working on this for almost two years, and following in the footsteps of the ever-admirable No More Marking, we think we have a responsibility to be transparent about what we’re doing and why. This blog is therefore our first instalment in a series which will shed some light on how we’ve come up with an approach to automarking that saves teachers time AND improves the insights they can derive from an assessment.

We started experimenting with Smartgrade for automarking in September 2023, working with our sister company Carousel Learning. The conclusions from our early research were: “this will be possible someday, but not yet”. So rather than rush something to market, we set out some principles that we had to satisfy in order to be able to roll out automarking to our customers.

The Smartgrade AI principles are:

Automarking accuracy has to be better than teacher marking accuracy.
No personally identifiable data should be shared with external Large Language Models (LLMs).
Teachers must be able to moderate AI marking.
The student experience must remain the same (or be improved).
Anything that’s lost from the use of AI for marking (e.g. learning about pupil performance while marking) must be considered and, where possible, replaced by something better (e.g. feedback reports).
The solution must be cost-effective, given that schools' budgets are tight.

Our Journey

Our early research focused on accuracy. Although it wasn’t all smooth sailing, a major breakthrough came in May 2024 when Carousel participated in writing an academic paper called Can Large Language Models Make the Grade? (Henkel, Boxer, Hills and Roberts). To quote the abstract, “We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75).”

That was encouraging, but the performance wasn’t yet sufficient to pass the “better than teachers” test. So we kept researching, and by autumn 2024 we had a permanent team working on automarking and a steering group including the industry expert Owen Henkel.

Over the following months, we created four test datasets spanning multiple subjects across both primary and secondary phases, and created a research infrastructure that made it quick and easy to run the datasets through different models to test their performance and cost. Automarking was set up as a separate service (we call it Powermark), which Smartgrade accesses via secure API without needing to share any personally identifiable data with Powermark.

The next big breakthrough came in March 2025. We’d been noticing that the performance of commercially available models was improving over the months up to this point, but nonetheless, this was when we had the real breakthrough: all test datasets were being marked more accurately by AI than by teachers. In summary, we found that teacher accuracy was anything between 75 and 94%, depending on the quality of the markscheme, the complexity of the question and the stakes involved in the test (in our experience, teacher marking accuracy increases along with the importance of the test being taken). Our best automarking models at this point were achieving accuracy rates of between 85 and 95%. Further testing and refinement of our process meant we started consistently achieving 94%+ accuracy where we have well written questions and marking guidance. We were well on the way to building automarking into the product!

That said, we still had to address points 3, 4 and 5 in our list of principles. Point 3 (marking moderation) was easy enough: For some time, Smartgrade has allowed for teachers to mark open ended questions from online tests, so the solution was simply to ensure that automarking results were available on our moderation screens for teachers to check and change if needed. Point 4 (student experience) also didn’t detain us for long, as we made an early decision to keep the student experience the same. We know other products give students prompts when they get answers wrong, but we don’t think this is for us - Smartgrade is typically used for more summative assessments where it’s important to test what students know without nudging them towards the right answer.

Point 5 (replace anything that’s lost by not manually marking) took quite a lot of research and validation, but we’ve arrived on an approach we’re really proud of: detailed feedback reports for teachers and leaders at school and MAT level. These replace the unstructured insights you accumulate while marking with succinctly structured information for teachers on where the class or cohort struggled, allowing you to plan interventions accordingly.

So now you know how we’ve spent the last year getting ready for automarking powered by AI. In the next two blogs I’ll be talking about how schools and MATs can automark with Smartgrade, including a super-exciting product release!

If you can't wait to here more, get in touch with us to book a chat about our AI automarking.

Sign up to our newsletter

Join our mailing list for the latest news and product updates.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Everything you need for smarter assessments