“Since the rise of large language models like ChatGPT there have been lots of anecdotal reports about students submitting AI-generated work as their exam assignments and getting good grades. So, we stress-tested our university’s examination system against AI cheating in a controlled experiment,” says Peter Scarfe, a researcher at the School of Psychology and Clinical Language Sciences at the University of Reading.
His team created over 30 fake psychology student accounts and used them to submit ChatGPT-4-produced answers to examination questions. The anecdotal reports were true—the AI use went largely undetected, and, on average, ChatGPT scored better than human students.
Rules of engagement
Scarfe’s team submitted AI-generated work in five undergraduate modules, covering classes needed during all three years of study for a bachelor’s degree in psychology. The assignments were either 200-word answers to short questions or more elaborate essays, roughly 1,500 words long. “The markers of the exams didn’t know about the experiment. In a way, participants in the study didn’t know they were participating in the study, but we’ve got necessary permissions to go ahead with that”, Scarfe claims.
Shorter submissions were prepared simply by copy-pasting the examination questions into ChatGPT-4 along with a prompt to keep the answer under 160 words. The essays were solicited the same way, but the required word count was increased to 2,000. Setting the limits this way, Scarfe’s team could get ChatGPT-4 to produce content close enough to the required length. “The idea was to submit those answers without any editing at all, apart from the essays, where we applied minimal formatting,” says Scarfe.
Overall, Scarfe and his colleagues slipped 63 AI-generated submissions into the examination system. Even with no editing or efforts to hide the AI usage, 94 percent of those went undetected, and nearly 84 percent got better grades (roughly half a grade better) than a randomly selected group of students who took the same exam.
“We did a series of debriefing meetings with people marking those exams and they were quite surprised,” says Scarfe. Part of the reason they were surprised was that most of those AI submissions that were detected did not end up flagged because they were too repetitive or robotic—they got flagged because they were too good.
Which raises a question: What do we do about it?
AI-hunting software
“During this study we did a lot of research into techniques of detecting AI-generated content,” Scarfe says. One such tool is Open AI’s GPTZero; others include AI writing detection systems like the one made by Turnitin, a company specializing in delivering tools for detecting plagiarism.
“The issue with such tools is that they usually perform well in a lab, but their performance drops significantly in the real world,” Scarfe explained. Open AI claims the GPTZero can flag AI-generated text as “likely” AI 26 percent of the time, with a rather worrisome 9 percent false positive rate. Turnitin’s system, on the other hand, was advertised as detecting 97 percent of ChatGPT and GPT-3 authored writing in a lab with only one false positive in a hundred attempts. But, according to Scarfe’s team, the released beta version of this system performed significantly worse.