Thursday, November 02, 2006

Plagiarism detection with - countering the woo

Dean Dad recently posted about hunting for plagiarism with Plagiarism detection software (like scans electronically submitted copies of student papers against databases of websites to check for material that has been copied verbatim. also keeps a record of every paper ever submitted to it, and scans newly submitted papers against these archived papers.

I regularly check my students' papers for plagiarism, and electronic plagiarism detection programs (including and similar websites) are an essential tool in my fight against plagiarism. While no tool is perfect (I hand-check all results, and don't rely on plagiarism checking programs exclusively), plagiarism detection programs allow me to quickly check for instances of word-for-word plagiarism in all of my student papers.

Many of Dean Dad's commenters came forward to question the ethics and legality of using sites like; one commenter even linked to "Guilty Until Proven Innocent: The Well-Known Secret about," an article that purports to reveal the evils of The article is filled with so much misinformation that I just had to share it here.

Here's the beginning:
Millions of honest, hard-working students attend the world's public and private schools. Every day, students write countless essays, reports, and term papers in perfect compliance with their schools' code of ethics and standard guidelines for proper citation. Regardless, these honest—yet inexperienced and naive—students are intimidated and coerced by professors to submit their papers, or intellectual property (IP), to third-party, for-profit ventures (e.g., without their willing consent.

We can understand the monetary motives behind the questionable tactics of a for-profit corporation, but what we do not understand is how or why professors have forced so many innocent students to relinquish their rights.
There's published data available (see this post) showing that nearly half of all college students report that they've plagiarized papers. So, while there are millions of students who don't plagiarize, there are also millions of students who do plagiarize, and thus arguing that only a tiny fraction of students plagiarize is simply not valid.

The paper's introduction also alludes to another common element of anti-plagiarism-scanning arguments: instructors are somehow prevented by copyright law from using plagiarism detecting tools unless students consent to having their papers checked for plagiarism. This argument is the dream of many a plagiarizing student, but is completely without merit. Searching for academic dishonesty is a perfectly acceptable practice that instructors are allowed to do (in fact are required to do) as a portion of grading any given assignment. has (obviously) thought about the copyright issues involved in student papers being submitted to their service, and has a document (PDF) detailing the legal implications of their company's procedures.'s legal document specifically deals with the question of whether instructors can submit student papers to be checked for plagiarism:
If copyright is present in a particular student’s work, the submission of the work to a teacher as part of the student’s coursework necessarily carries with it the expectation that the teacher will use the work in certain ways, consistent with the goal of evaluating and grading the student’s work. Specifically, by submitting the work, the student implicitly agrees that the teacher may comment on, criticize and otherwise evaluate the academic quality of the work, an evaluation that should include consideration of both the work’s content and integrity.


The question of whether the scope of such collateral rights [of evaluation licenses specified by universities] extends to electronic submission of a written work to a computer database for purposes of review, “fingerprinting”, and/or archiving has not been tested in the courts, nor is it addressed explicitly by statute. However, legal precedent in other contexts strongly suggests that student submission of a work for grading provides the teacher with the right to utilize available technologies and tools to accomplish the grading task. Such a right necessarily encompasses the ability to transfer the work to other media (e.g., by scanning the work), where such transfer is required for the teacher’s personal use of a particular grading tool. See, e.g., Foad Consulting Group, supra at 828-831 (copying, distribution and modification of a work to make it usable for the intended purpose necessarily a part of the implied license to use the work); as well as Recording Industries Ass’n of Am. V. Diamond Multimedia Sys., Inc., 180 F.3d 1072, 1079 (9th Cir. 1999)(transfer of a work, such as music, into another media, such as an MP3 file, for personal use of the person making the transfer is a fair use), and Sony Corp.v. Universal City Studios, Inc., 464 U.S. 417, 449-50 (1984)(copying of broadcast productions onto videotape for the later viewing using a VCR is a fair use); compare, A&M Records, Inc., et al. v. Napster, Inc., 239 F.3d 1004, 1019 (9th Cir. 2001)(copying of a work for personal use not fair when coupled to simultaneous distribution of the entire work to the general public). Hence, by itself, teacher submission of a student work to Turnitin is within the scope of the evaluation license provided by the student to the teacher on submission of the work for grading. The implied license may not extend to other aspects of the TURNITIN system, such as archiving, however, such aspects are allowable as “fair uses” of the copyrighted material.
But, of course, reasoning like that doesn't sit well with this (unnamed) author, and thus the author goes off on another tangent:
At no time have the overwhelming majority of students given their professors any reason to believe that they are untrustworthy, corrupt, immoral cheaters, but do thousands of professors treat honest students like "guilty until proven innocent" criminals nonetheless? You bet!
How the author goes from the observation that most students don't plagiarize to the conclusion that professors who check for plagiarism are acting as though students are "'guilty until proven innocent' criminals" is beyond me. Let's make this clear: searching for academic dishonesty in a paper is not the same as accusing students of being academically dishonest, it's just a part of the regular grading process. Professors who look for cheating as they proctor tests aren't implying that everyone is cheating, and police officers who use radar to check the speed of your car aren't accusing you of speeding (at least not until they see that you're going 20 over the limit).

But, of course, we eventually find out from the author that it all boils down to money.
In April of 2004, Wired News revealed Turnitin's 2003 revenue to be $10,000,000, "a figure that Barrie does not dispute, but regrets having put on record." Turnitin's 2004–2006 revenue is surely much higher, considering Turnitin's expansion in the last three years.


Despite raking in what probably amounts to at least $50,000,000 since 1998, we are unaware of paying a single penny in royalties to any one of the countless, unwilling students around the world whose intellectual property Turnitin has copied, stored, and used to create for-profit, derivative works-based service. is a private company that is making money; tell me something I didn't already know (I've used their product, and have investigated getting a site license for our campus). However, the larger issue that the author brings up here is a complex one - is not only scanning student papers when they are submitted, but is also keeping a record of student papers in their databases (something that, I should note, not all plagiarism checking programs do). However, the issue is more complex than it might otherwise appear to be, as while is using information derived from student papers on a regular basis, they never publish or share that information with anyone, and don't directly make money off of any individual paper. In fact, when they find a similar paper that was submitted previously, all they report is that a paper written by a student in professor X's class at Y university contains a certain amount of content that is identical in wording to that of the current paper. If the instructor wants to get more information, they must contact the professor listed. specifically deals with the issue of archiving student works by first acknowledging that they're a for-profit venture (and thus can't qualify as a non-profit educational fair use), and then saying this (PDF):
Commercial use of a work may still be “fair use” under U.S. Copyright Law (17 U.S.C. §107), especially when less than the entire work is being used, and/or the use does not “materially impair the marketability of the work which is copied.” Harper & Row Publishers, Inc. v. Nation Enters., 471 U.S. 539, 566-67 (1985). Here, the actual work is used by the TURNITIN system only as a reference, for purposes of creating a separate work, the digital “fingerprint”. If there is a match between a submitted work and fingerprinted portions of an archived student work, only that matching text is highlighted in the originality report.

The identification of a textual match between documents relays a fact, which is not protected from disclosure by the Copyright laws. 17 U.S.C. § 102(b). Where there is no way to express the fact in question except by copying of the underlying material, the fact and the portion of the material representing it are said to have “merged”, excluding the material itself from the ambit of copyright protection. Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 349 (1991); Harper & Row Publishers, supra at 556; Veeck v. Southern Building Code Congress Int’l, Inc., 293 F.3d 791 (9th Cir., June 7, 2002). Because one cannot identify a passage as having been copied without matching it to the material that was putatively copied from, display of the matching material is not prohibited by copyright.

No other portions of the archived work are displayed, used, published, distributed or further copied without prior author consent. Compare, A&M Records, et al. v. Napster, at 1015 and 1019 (distribution of a copied work to the public without the copyright holder’s consent implies that the copyright in the copied material may have been infringed). As such, the archival does not publish the work as a whole, or otherwise impinge on the author’s ability to exploit the work commercially. Because the “primary objective of copyright is not to reward the labor of authors
but ‘[to] promote the Progress of Science and the useful Arts” (Veeck, supra as reported at 2002 U.S. App.LEXIS 10963, *25), the minimal use of a student’s work to ferret out plagiarism in others works, without making the work itself available to the public, is a fair use that does not infringe any copyright which may be present in the archived work. also discusses the FERPA (privacy) issues related to their receipt and storage of student work:
Since the paper is not part of the education record and no personally identifiable information is released when a paper is archived, FERPA does not restrict this aspect. If there is a match to another submitted student paper, only the e-mail address and name of the instructor whose student submitted the first paper will be given to the instructor of the matching paper, along with a paper ID #. Only the instructor of the originally submitted paper would be able to use this ID# to determine the student’s personal information, and they already have access to such information.
So, while's analysis is likely somewhat biased (e.g., this FERPA letter suggests that term papers actually are a part of the "education record"), suffice it to say that the author of the website is taking an overly simplistic view of the copyright and privacy issues involved. It is almost certainly legally permissible for faculty to use software to check student work for academic dishonesty, and (at least believes that) it's also probably legal for to keep archived copies of papers submitted to them1. At the very least, it's clear that one can't go around screaming that is obviously violating copyright and privacy laws.

But, of course, privacy rights aren't the only thing the author complains about:
In addition to unfairly violating students' intellectual property rights and costing schools a fortune, Turnitin has become extremely ineffective. The small percentage of students who cheat tend to do so in very intelligent ways. They know about Turnitin, and it doesn't "scare" or dissuade them any longer. What professors may not understand is that Turnitin tends to catch only the most blatantly obvious, word-for-word plagiarism.
The entire point of is to catch word-for-word plagiarism; it's hardly fair to accuse it of being useless because it doesn't do more than it was designed for. We've also returned here to the "only a small percentage of students cheat" argument, which is clearly contradicted by data (see above).

I can also report, from personal experience, that students regularly turn in papers with copious amounts of word-for-word plagiarism, even when they know ahead of time that their papers will be scanned electronically. I see it every semester. So no, students are not all cheating in "very intelligent" ways; there are (and probably always will be) a number of not-so-intelligent cheaters.

Oh, and lest we've forgotten that doesn't catch some types of plagiarism, the author makes the point again:
The program is practically useless if a student uses a thesaurus to change every other word in a paper to a new word of equivalent meaning. Turnitin is also completely impotent in detecting that a student paid a ghostwriter to compose a paper from scratch.
And a radar gun only catches speeders, not red light runners. So?

Has this author ever actually tried to use a thesaurus to change every other word in a sentence? It's amazingly difficult (e.g., "Additionally, a speed trap only apprehends speeders, anti red-beacon runners. And?" or "Additionally, a radio detection and ranging gun solely catches zoomers, not traffic signal runners. So?").

The article then lists a whole bunch of negatives about
* emails word-for-word copies of students' papers to third parties, upon request
* professors intimidating, extorting, and coercing students to cede their rights
* renders students "guilty until proven innocent"
* professors becoming too dependent on machines to do their jobs
* invades students' privacy
* teaches students that rules do not apply to big corporations
* violates students' intellectual property rights
* creates an atmosphere of distrust
* fosters a negative learning environment
* teaches students that it is acceptable to take advantage of "the little guy"
* makes schools vulnerable to lawsuits
Oh boy. Let's take these one at a time.
  • never automatically shares a full student paper with anyone else (unless the entire paper has been copied by someone, in which case reports that fact). Instructors are allowed to send the full text of a student's paper to other instructors, but then it's the instructor who might be violating privacy and copyright law, not
  • No student rights are being lost by having their papers scanned for plagiarism (see above).
  • No, students are not "guilty until proven innocent" when instructors scan papers for plagiairsm. See above.
  • Oh, yeah. Now that I scan for plagiarism I never read any of my students' papers anymore; heck, even spits out a grade for me too. Great service. Worth every penny. (hint: that was sarcasm)
  • How does invade student privacy? The company never reveals the name (or other personally identifying information) of the student who wrote the paper to anyone other than the original instructor. Additionally, remember that the student voluntarily submitted their paper to the instructor to be graded; if the student didn't want the content of that paper read by anyone else (i.e., their instructor), they never should have submitted it as a course assignment.
  • Somehow I don't see how a company helping instructors catch plagiarizing students will be the "evil corporation" story of the 21st century.
  • I think we covered this already.
  • Not any more than watching students taking a test to ensure that they're not cheating creates a negative environment. Actually, if anything, letting your students know you're actively hunting for plagiarism will build a more positive class environment, as students know that cheaters won't be able to get a free ride in your class.
  • See above.
  • Didn't we cover this already?
  • Thousands of schools use; I guarantee you that every one of those has a team of lawyers specially trained in how to avoid lawsuits. I suspect that wouldn't have been able to stay in business for the last 5 years if there was a major risk of schools getting sued1.
But then we get to the kicker - the article reports that there's really no need for, as there's an easy way for instructors to check for plagiarism:
Another way for schools to avoid Turnitin's huge price tag is to force professors to become familiar with each student's writing style (foreign concept, we know). It seems that many apathetic professors have decided to lighten their workloads by allowing machines to "teach" and police our children. Perhaps this is one of the reasons why the American educational system is in serious trouble, and students in other industrialized nations grossly outperform ours? (Sorry, frustration often leads me to digress.)

Professors should make students write an in-class essay before assigning any take-home writing assignments. That will enable professors to become familiar with each student's writing capabilities and style. It will also provide a sure-fire template against which professors may compare all subsequent works completed outside of the classroom.
OK, I teach biology. My job is to train students in scientific thinking and help them learn the basics of organismal biology. I am not going to be able to take the time to have each student write an in-class essay to demonstrate their writing style, and then compare their work to said essay each time they submit it. Sorry, not going to happen.

And even if I were able to find the time to get writing samples for every student, I would never get good, solid proof of plagiarism from those writing samples. Based on a writing sample all I could say is "Johnny, I don't think you've written this," to which Johnny would reply, "Yes, I did write that," and then I'd be out of evidence. It's rather difficult to justify giving Johnny an F based on that exchange. However, after using I can show Johnny the printouts of five different websites and say "Look, Johnny, your paper contains content copied from these five websites. I've even gone to the trouble of underlining the copied sentences; funny, but I've underlined every sentence in your paper." Johnny doesn't have much of a reply, and I've got the hard evidence I need to justify giving Johnny an F.

The article goes on to talk about how hypocritical it is to use the service and give mediocre explanations of the fair use doctrine. But it gets interesting again when it gives a few scenarios that demonstrate how much harm can do; I'll just include the first for your reading pleasure.
Scenario #1

A college student named Mary decides to apply for an internship at a major newspaper. She takes the best research paper that she has ever written, and presents it to the executive editor as a writing sample. The editor is so impressed with her writing that he stops just short of hiring her on the spot! Mary is ecstatic. The editor tells her that he just needs to take care of a few formalities, and he'll call her tomorrow. The next day, Mary enthusiastically answers the phone, only to have her heart torn from her chest. The editor informs Mary that "her" research paper has proven to be plagiarized by Turnitin, chastising, "Plagiarizers don't have much of a future in journalism." He hangs up.
It was a nice story up until the ending, which the author gets wrong. Actually, what happened is that Mary's potential new boss (who probably never scanned Mary's paper in the first place, but we'll ignore that) sees that Mary's paper was reported as sharing 100% of its content with a paper written at Mary's old university; on doing a little research (or asking Mary about this), the employer discovers that it's Mary's own paper. Mary is happily employed the next day.

The article continues on (and on, and on), hitting the points of fair use, FERPA, how to stop from indexing your website ( steals lots of your bandwidth, didn't you know?), and the dangers of hackers and lawsuits. Go read it all, if you want. It ends with this bit of legal advice:
If you are a student who is concerned with Turnitin violating your intellectual property rights, you can place the following copyright notice at the bottom of your paper to prevent your school from submitting your writing/ideas to If your school ignores your copyright notice and does submit your property to Turnitin or any other service/program/database, you can sue the service and/or your school for up to $150,000 per incident, as allowed by the Digital Millennium Copyright Act (Cornell Law School).

Copyright 2006 [STUDENT NAME]. All Rights Reserved. Aside from my professor's sole, personal review as part of his/her private, single-human, software-free grading process, neither my professor nor my academic institution may otherwise transfer, distribute, reproduce, publicly/privately perform, publicly/privately claim, publicly/privately display, or create derivative works (including "digital fingerprints") of my copyrighted document, or intellectual property. The same restrictions apply to and all similar services. Neither my professor nor my academic institution may submit my copyrighted document, in whole or in part, to be transformed, manipulated, altered, or otherwise used by or stored in a physical or electronic database or retrieval system without my personal, explicit, voluntary, uncoerced, written permission. Regardless of supposed intent (e.g., "to create a digital fingerprint"), no part of my copyrighted document may be temporarily or permanently transferred, by any party, to or any other service, program, database, or system for analysis, comparison, storage, or any other purpose whatsoever. Violators will be monetarily punished to the fullest extent allowed by the DMCA (Digital Millennium Copyright Act) and/or international law.

Students can Set a Trap for Violators

The first step in preventing your school from submitting your intellectual property to Turnitin is to place the aforementioned copyright notice at the bottom of your paper. The second step is to make sure that if your professor ignores your copyright notice, you have the necessary evidence to make your entire school district legally regret summarily dismissing your rights.

At least 24 hours prior to submitting the paper to your professor (with copyright notice included), send a copy of the paper through the postal system, addressed to your mother and/or father, at their address. Seal the envelope extremely well. Tell your parents to expect the envelope, but make sure that they do NOT open the envelope when it arrives! Store the envelope somewhere safe.

If you later find out that your professor submitted the paper to TurnItIn, your postmarked (dated) envelope—containing an exact copy of the copyrighted document that you submitted to your professor—will serve as evidence that you clearly warned your professor/school in advance that they may not transfer or grant third-party license to your work. They will have no defense, and you will almost certainly be awarded monetary compensation if you file a civil suit.
Can we say "wishful thinking?" Good.

Even ignoring that the legal advice is ridiculous1, the first thing I'd do with a paper that contained a notice such would be to go over every sentence with a fine-tooth comb hunting for plagiarism.

1Note that I am not a lawyer, and I'm not trying to over-simplify the discussion of the legal principles involved; they're clearly complex. See this post (and its comments) for a lawyer's look at the situation.

No comments: