24 January 2008

Medline copycats found out

Nature News (published online 23 January 2008 | doi:10.1038/news.2008.520) has revealed that as many as 200,000 of the 17 million articles in the Medline database might be duplicates, either plagiarized or republished by the same author in different journals. The full article is available here (Errami M, Garner H. A tale of two citations. Nature 2008 Jan 24;451:397-399.)

Analysis with text-matching software produced estimates that 0.04% of a random sample of 62,000 articles might be plagiarized, and 1.35% might be duplicates with the same author. Employing a clever shortcut, the researchers examined more than 7 million Medline abstracts with listed related articles, running their algorithm against just the original abstract and its "most related" abstract. This method revealed 70,000 potential duplicates, which have been loaded onto a publicly accessible database called Déjà vu. It is likely that tools such as Déjà vu and text-comparison software will act as future deterrents to plagiarism.

Publishers are already taking part in tests of anti-plagiarism tools. One of these, CrossCheck, compares new manuscripts against already published materials in its database. CrossCheck searches for similar or identical parts of manuscripts, and when it detects questionable text, it highlights those sections for a suspicious editor to scrutinize.

iParadigm, in Overland Park, Kansas, is working with the IEEE and the other publishers to develop CrossCheck. This is the same company that developed Turnitin.com, an online resource that helps university educators detect plagiarism in student papers. The program has been very successful as a deterrent, although not without stirring up controversy. CrossCheck is expected to be its equivalent

As Oliver Obst (from whom this post was plagiarized), comments in his blog medinfo, this kind of text mining would be easier and more useful generally if all articles were Open Access. It would then be possible to compare more than abstracts, which are brief and not as textually significant; plagiarism could also be determined at the syntax level of an article's full text — truly an alarming scenario for the cribber.