Australian universities’ hits and misses detecting ChatGPT in assignments

Australian students accused of cheating in assignments say the technology universities are using to detect AI-generated content is turning out false positives.

Two vendors pitching solutions to identify the use of AI tools like ChatGPT have been adopted by the sector: Turnitin and Cadmus Analytics, though it’s mostly Turnitin that’s being used.

Students at the University of Melbourne, the University of Southern Queensland and the University of Adelaide have been investigated for academic misconduct after Turnitin detected AI-generated text in their assignments.

At least one student said that they had submitted authentic work; others could not be located or reached for comment.

Turnitin’s AI detection feature has also been adopted by the UNSW Sydney, the University of Tasmania, the University of Queensland and Western Sydney University since it was launched in April.

Turnitin’s executives say industry best practice is not to use its software as the sole evidentiary basis for an allegation.

“We’ve been stressing that we don’t believe that the AI report should be taken as a singular piece of evidence,” Turnitin’s APAC vice president James Thorley said.

Universities that have signed on to use Turnitin’s software to detect alleged instances of AI-generated text in assignments also say they follow this guidance.

But it’s still early days for AI detection, and some students say they’re feeling the brunt of instances where the best practice isn’t necessarily being followed.

‘My anxiety was through the roof’

A University of Melbourne Masters student in the faculty of social sciences told iTnews that she received an email from her subject coordinator saying that Turnitin had detected ChatGPT in one of her assignments in late April.

Speaking under a pseudonym, Rachel told iTnews that “my anxiety levels were through the roof.”

“I’d never been accused of misconduct before. I was told AI had been detected and I’d hear from an academic integrity committee later on about when my hearing would be.

“I told my subject coordinator that I hadn’t used AI and sent her articles I read about how there had been false positives, but she just said I’d get to talk about it at the hearing.

“When I got an email saying when the hearing would be, I was upset that it was more than a month away.

“I didn’t want it all hanging over my head.”

The University of Melbourne told iTnews that it “requires that staff consider additional evidence before making an allegation of academic misconduct and do not rely solely on the results of the tool.”

Rachel said that when her subject coordinator agreed to speak to her two days before the hearing, no evidence was cited that she had used generative AI besides Turnitin’s detection.

“I got the opportunity to provide [the coordinator] with the evidence I’d put together for my submission to the hearing,” Rachel said.

“I showed her screenshots of my browser history to show I’d done research and drafts of the assignment; then the matter was dropped before the hearing.”

Turnitin’s Thorley said the case showed the weakness of using a detection as the sole basis for an allegation.

“The assessment design can allow for more sharing and showing students’ workings and how they got to that end product; that was kind of done retrospectively in that case.”

A student at the University of Southern Queensland’s work was similarly flagged by Turnitin’s AI detection feature, but there was exonerating evidence that they only used the tool to rephrase their assignment with better grammar.

University of Adelaide deputy vice-chancellor and vice-president (academic) Professor Jennie Shaw told iTnews that it uses “Turnitin’s AI detection tool … to identify inappropriate use of generative AI.”

“The university has had an increase in reported suspected academic integrity breaches, including allegations of the inappropriate use of generative AI,” she said.

“These cases are currently going through our academic integrity process so it Is not possible to put a figure on the number”.

While noting some false positives may be possible among detections, she – like other universities – indicated that Turnitin is “not used in isolation”.

“The AI score alone should not be grounds for an academic integrity report,” Shaw said.

“For instance, the lecturer or tutor may have noticed incorrect references appearing in a student’s work, which is a frequent error of generative AI.”

The next probable word is…

Most Australian universities already use Turnitin to catch plagiarism by automatically cross-referencing students’ assignments with online materials and Turnitin’s repositories.

Rating the similarity of submitted assignments to existing content is uncontroversial because the universities can cite the source documents.

There is not the same consensus on whether machine learning has developed to the stage of distinguishing generative AI from human writing with high confidence.

OpenAI, for example, recently shut down its own AI-generated text detection tool after concluding it made too many false positives.

Turnitin uses classifiers trained on AI-generated content and authentic academic writing.

Turnitin’s Thorley said that OpenAI’s discontinued detection tool could not be compared to Turnitin’s model because the latter had been trained to specifically differentiate AI-generated text from academic writing.

“In terms of questions around ‘how we can be so confident when OpenAI has shut down their detector,’ we’re very much focused on student writing and we believe that it’s possible to focus on student writing and to look at it in that context,” he said.

“If you’re trying to detect every type of generative AI, in any kind of format – which is potentially what OpenAI we’re trying to do – that’s a lot harder, and it’s increased in complexity incredibly.”

Turnitin’s methodology for evaluating whether a sentence was AI-generated notes that Large Language Models (LLMs) generate sequences of words in a “consistent and highly probable fashion.”

This is because LLMs like OpenAI’s GPT-3, Google’s Bard and Meta’s LLaMA are trained on publicly available online content and “essentially taking that large amount of text and generating sequences of words based on picking the next highly probable words.”

In contrast, “human writing…. tends to be inconsistent and idiosyncratic, resulting in a low probability of picking the next word the human will use in the sequence.”

Trained on academic writing by humans and LLMs, Turnitin’s classifiers detect sentences containing words in sequences that LLMs have a high probability of generating them in.

Thorley said, “the way that the AI detection works is to say that it’s possible to say that a machine writes ‘this way’ and a human writes ‘this way’.”

Turnitin’s error rate

According to Turnitin, its false positive rate for an individual sentence is four percent and its false positive rate for an entire document is one percent, provided that at least 20 percent of the document is AI-generated.

Turnitin assesses every sentence of a document with a binary score of whether it was AI-generated; it then aggregates the sentences’ scores to a percentage value of AI-generated content in the document.

As a guardrail against false positives, if the model detects less than 20 percent of AI-generated sentences in the entire document it reserves judgement because its confidence is too low.

“Part of the focus that we’ve put on minimising false positives is to say that if we’re in a grey area we’re going to err well on the side of caution, and to only make firm statements where we’re very, very confident,” Thorley said.

In response to a question about the mathematics behind how Turnitin’s testing concluded that the model has these false positives rates – four percent for sentences and one percent for entire documents – Thorley said: “We are going to be releasing a white paper soon going into a lot more depth on a technical level.”

After Washington Post journalist Geoffrey A. Fowler interviewed students in the US who accused Turnitin of falsely flagging their work, he tested 16 samples of real, AI-fabricated and mixed-source essays through the model and found “it got over half of them at least partly wrong.”

The AI Institute of the UNSW’s chief scientist Professor Toby Walsh told iTnews that Turnitin’s claims about the model’s false positive rate “are problematic.”

“Turnitin works by comparing the word probabilities with the probabilities generated by LLMs,” Walsh said.

“This only gives you a probability. It does not give you certainty.”

While Turnitin has said its model is trained on “authentic academic writing across geographies,” according to Walsh, there is no evidence it has been trained on a wide enough variety of data sources to be applied at different universities across the world without biased outcomes.

“We have evidence that tools work in biased ways. We don’t know if our university’s data is drawn from a similar or different distribution.”

Walsh said that Turnitin is also likely to have been trained disproportionately on assignments written by students whose first language is English.

“The false positive rates are likely to be much higher on students who are not writing in their first language.”

Support from Cadmus-tracked workings

While normally students write assignments in Word, Google Docs or whatever their preferred word processing platform is, some subject coordinators compel them to use Cadmus to monitor how they produce their assignments in real time.

Cadmus provides assessors with real-time data about how students complete their assignments while using its platform.

Although traditionally used to catch contract cheating, Cadmus said in January that it can help “educators to detect the use of AI in assessments” by spotting “techniques that are uniquely attributed to students using ChatGPT.”

The Cadmus Workspace can, among its features, flag habits that are associated with “using ChatGPT”, such as “students who paste their assessment content into the Cadmus workspace and make slight edits in an attempt to create an authentic submission.”

The hours spent on the assignment, the origin of copy-pasted text, keyboard patterns and students’ location data such as the address of their device’s Internet connection are some examples of data that Cadmus collects.

The University of Southern Queensland used Cadmus to catch a nursing student using paraphrasing software earlier this year.

Senior lecturer Dr Liz Ryan said in an online presentation about Cadmus that “student paste logs revealed that all their assessment work had been pasted in from Quillbot, which is an AI [paraphrasing] tool that can only be used with explicit approval from the course coordinator.”

“This student was then referred to the academic integrity team.”

The University of Southern Queensland did not reply to a request for comment from iTnews about whether the student was found to have breached its academic integrity policies.

Other universities signed up to Cadmus don’t use it for AI detections at all.

A spokesperson for the University of Adelaide said it “does not expressly use Cadmus to identify AI-generated text.

“Instead, staff are encouraged to use this platform in order to see students’ progressive drafts and development of ideas.”

A University of Melbourne spokesperson also said that “some of the University’s subjects are using Cadmus for their assessment tasks,” but not explicitly to detect AI.

“In Cadmus, the assignment is digitally ‘observed’ as the entire assignment writing process is conducted within Cadmus which teaching staff can ‘observe’ and review as needed.”

Walsh told iTnews he also did not think Cadmus was the best solution for universities to protect academic integrity against generative AI.

“The Cadmus methodology will make it harder but not impossible to cheat.”

“There’s a very simple (and cheap) way to ensure people are not cheating: put them in exam conditions; a closed room with no access to technology.”

Cadmus did not reply to iTnews’ request for comment.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts