SimPaD: A Word-Similarity Sentence-Based

1
Web Intelligence and Agent Systems: An International Journal 0 (2009) 1–0
IOS Press
SimP aD: A Word-Similarity
Sentence-Based Plagiarism Detection Tool on
Web Documents
Maria Soledad Pera and Yiu-Kai Ng ∗
3361 TMCB, Computer Science Department
Brigham Young University
Provo, Utah, U.S.A.
E-mail: {ng, mpera}@cs.byu.edu
Abstract. Plagiarism is a serious problem that infringes copyrighted documents/materials, which is an unethical practice and
decreases the economic incentive received by their legal owners. Unfortunately, plagiarism is getting worse due to the increasing
number of on-line publications and easy access on the Web, which facilitates locating and paraphrasing information. In solving
this problem, we propose a novel plagiarism-detection method, called SimP aD, which (i) establishes the degree of resemblance
between any two documents D1 and D2 based on their sentence-to-sentence similarity computed by using pre-defined wordcorrelation factors, and (ii) generates a graphical view of sentences that are similar (or the same) in D1 and D2 . Experimental
results verify that SimP aD is highly accurate in detecting (non-)plagiarized documents and outperforms existing plagiarismdetection approaches.
Keywords: Plagiarism, word-correlation factor, sentence similarity, word manipulation, graphical view
1. Introduction
Plagiarism, or the act of using other’s words or ideas
as his own, is a prolific problem, especially in the
academic world. This problem is getting worse because the volume of on-line publications has become
larger and larger during the past decades. Common
plagiarism-detection methods either (i) simply duplicate material from a(n) (non-)electronic source, or (ii)
copy material from a given source and intentionally
modify its wordings or sentence structures without alternating its content [17]. The latter is more difficult to
identify than the former due to its complexity.
In order to identify plagiarized documents, a number
of new approaches have recently been proposed by (i)
integrating information retrieval techniques into previous plagiarism-detection methods and (ii) analyzing
* Corresponding
author. E-mail: [email protected].
potential on-line plagiarized documents on the Web
which are easier to process than hard copies. Popular
plagiarism-detection approaches (i) compute the overlapping among n-grams in any two documents [23],
(ii) analyze the writing, i.e., syntactical and grammatical, styles of the authors of various documents [33],
(iii) identify words substituted by their synonyms and
split/merged sentences [34], and (iv) detect plagiarized documents based on their fingerprints [20]. Besides using synonyms, hypernyms, and hyponyms, majority of these approaches rely on exact word/phrase
matching in finding the portion of a document that
is plagiarized. Relying on exact-word/phrase matching for plagiarism detection, however, is insufficient
and inaccurate, since it is a common practice to paraphrase words by using similar ones for plagiarizing
a source document. In order to handle the shortcomings of existing plagiarism-detection methods, we propose a new, novel Similarity-based Plagiarism Detec-
c 2009 – IOS Press and the authors. All rights reserved
1570-1263/09/$17.00 2
M.S. Pera et al. / A Plagiarized Document Detection Tool
tion approach, denoted SimP aD, which considers not
only (similar) word substitution, addition, and deletion, but also sentence splitting and merging based on
word-similarity measures.
Given any two documents D 1 and D2 , SimP aD
conducts sentence-to-sentence comparison to determine the degree of resemblance between D 1 and D2
using the pre-computed word-correlation factors defined in [15], which can be applied for detecting
semantically-related, as well as exact, words in different sentences. Since SimP aD is based on the wordcorrelation factors, determining the degree of resemblance between any two given (words/sentences in)
documents is a simple and computationally effective
process. SimP aD can detect plagiarized documents
by identifying in a plagiarized document (i) sentences
that are split/merged from sentences in a source document and (ii) sentences in which words have been
substituted in, added to, or deleted from sentences in
a source document but retain similar content. Unlike
existing plagiarism-detection approaches, SimP aD is
unique, since it (i) allows partial similarity matching
as opposed to the strict exact matches, and (ii) uses a
graphical view to display the plagiarized sentences in a
plagiarized document matched with the corresponding
sentences in a source document with various degrees of
similarities. SimP aD is primarily designed to detect
plagiarized Web documents. Documents evaluated by
SimP aD, which include (i) a (non-)plagiarized document D and (ii) a collection of archived documents
that includes the (potential) sources of D, are text documents available online. Experimental results generated by using existing plagiarized-document corpora
show that SimP aD is highly accurate in detecting
(non-)plagiarized documents.
This paper is organized as follows. In Section 2, we
present our similarity-based, plagiarism-detection approach. In Section 3, we introduce different plagiarism corpora used for evaluating the performance of
SimP aD and assessing its effectiveness in detecting
plagiarism. In Section 4, we discuss a number of existing methods for detecting text plagiarism. In Section 5,
we give a conclusion and directions for future work.
2. Our Plagiarism-Detection Approach
Plagiarism can be detected by establishing the “content similarity" among documents [33]. SimP aD
identifies DP as a document plagiarized from a source
document D S , if DP contains (words in) sentences
with high degrees of similarity to (words in) sentences
in DS . In reality, plagiarism detection is not as simple as matching sentences with sentences, since sentences in DS may not be copied entirely into D P ,
i.e., a “cut and paste" plagiarism; instead, they could
be reordered, split, and merged, and/or have words in
them added to, deleted from, or replaced in D P . Indeed, establishing which sentences of D S have been
plagiarized is quite broad in scope. For this reason,
SimP aD considers a number of plagiarism strategies on words and sentences, which are discussed in
the following subsections. Note that these plagiarism
strategies are not mutually exclusive, e.g., a sentence
S that has been split to yield two sentences S 1 and
S2 may have words in S reordered in S 1 and S2 , and
some words in S are deleted from and others added
to S1 and/or S2 . SimP aD is designed for detecting
sentences created by these strategies in tandem, rather
than independently, which complement each other in
identifying plagiarized sentences/documents.
2.1. Document Representation without Stopwords
and Short Sentences
Prior to analyzing potential plagiarized documents,
we first remove all the stopwords 1 and reduce all the
non-stopwords in a document D to their grammatical roots, i.e., stems, using the Porter’s Stemmer algorithm [26]. In addition, as part of the pre-processing
step, short sentences (i.e., semantically uninformative
with regard to plagiarism detection) are removed from
D due to the high probability that independent authors can create (semantically the) same short sentences rather than long, similar ones. The longer the
two sentences are, the less likely they are similar by
chance.
Example 1 Consider the sentences in the two small
documents, D 1 and D2 , given below.
D1 : “Many people believe that lemmings are prone
to frequently jumping off a cliff in mass suicide.
This is not true."
D2 : “One may assume that this chemical reaction
is unfeasible due to the steric hindrance. This is
not true."
1 Stopwords are words that have little meaning, such as articles,
conjunctives, and prepositions, which can be removed from a document without significant information loss. We have considered
the stopword list posted under http://students.cs.byu.edu/˜rmq3/Stop
wordList.txt
M.S. Pera et al. / A Plagiarized Document Detection Tool
Clearly, D1 and D2 are different in content, and neither author of D 1 nor D2 plagiarizes from another.
Nonetheless, the same sentence “This is not true" appears in both documents, which is accounted to the
tendency of some words and sentences that naturally
appear more frequently regardless of authorship. 2
We exclude from documents to be evaluated by
SimP aD sentences that are sufficiently short. Gildea
[9] estimates the average number of words in an English sentence varies between 15 and 20 words, and
LaRocque [16] considers every sentence with less
than 12 words (including stopwords) as short, whereas
Liddy [19] claims that 40% of words in a sentence are
stopwords. Hence, we remove (short) sentences in a
document with fewer than 7 non-stop, stemmed words
and consider only the remaining semantically informative sentences during the process of plagiarism detection.
2.2. Manipulation of Words
Words in a source sentence may have been reordered, substituted, deleted, or added to yield a plagiarized sentence. We consider different plagiarism
strategies on manipulating words in sentences and apply a word-similarity approach to compute the similarity values of words in sentences for detecting plagiarized sentences and documents.
2.2.1. Word Reordering
It is quite common that a plagiarized sentence is created from a source sentence S by reordering the words
in S. In the simplest case of word reordering, the same
keywords2 in S are present and placed in a different
order, along with probably additional words, in a plagiarized sentence P .
Example 2 Consider the following source sentence S
and plagiarized sentence P :
S: “Over 45% of all current high school students
are involved in intramural sports of some kind."
P : “Of all the current high school students, over
45% are involved in some kind of intramural
sports."
2 From now on, unless stated otherwise, (key)words refer to nonstop, stemmed words.
3
In both sentences the same set of keywords is used,
i.e., current, high, school, students, involved, kind, intramural, and sports, which yields the same content,
even though the keywords in S and P are placed in a
different order. 2
As shown in Example 2, the order of words does not
affect the content of P and S 3 . Thus, SimP aD discards the order of words in comparing any (sentences
in) documents and considers word substitution, addition, and deletion and sentence splitting/merging instead.
2.2.2. Word Substitution
Word substitution can be viewed as deleting a word
in a source sentence S followed by adding a (similar)
word in S.
Example 3 Consider the sentences S and P below.
S: “Many dairy farmers today use machines for
operations from milking to culturing cheese."
P : “Today many cow farmers perform different
tasks from milking to making cheese using
automated devices." 2
As stated in [33], the problem of word substitution is
a complex one to address in plagiarism detection, partially due to the lack of plagiarism-detection schemes
which measure the degrees of similarity among words.
In developing such a scheme for determining content
similarity of (words in) two sentences, we first consider how a human may compare words in them. Consider the sentences in Example 3. A person may initially notice several identical words, such as “farmers", “today", “milking", and “cheese", in both S and
P , and the presence of these words alone may not
provide sufficient evidence to suggest a common origin of S and P . The person who further evaluates
the content of each sentence should notice that “making cheese" (“automated devices" and “tasks", respectively) is quite similar to “culturing cheese" (“machines" and “operations", respectively). Each word in
P that has a high degree of similarity to a word in
S suggests common word content. A significant number of (non-)identical words with similar/same meaning in two sentences provide solid evidence that the
sentences come from the same origin.
3 Word-reordering has been widely-used in modern plagiarismdetection approaches. See [33] for details.
4
M.S. Pera et al. / A Plagiarized Document Detection Tool
2.2.3. Word Addition/Deletion
A word deleted from (added to, respectively) a sentence without adding (deleting, respectively) another
word can be considered as a special case of word substitution. We realize that the similarity of sentences
P and S is higher when words that are added to
P are closely related to the words in S. However,
adding non-related words (in terms of similarity with
the words in S) to P yields lower sentence-to-sentence
similarity of P with respect to S.
2.2.4. Word-Correlation Factors
In establishing the degrees of similarity among
non-identical keywords for plagiarism detection, we
adopt the word-correlation factors in a pre-computed
word-correlation matrix defined in [15]. The wordcorrelation factors between any two words i and j,
denoted Sim(i, j), were pre-computed using 880,000
documents in the Wikipedia collection (downloaded
from http://www.wikipedia.org/) 4 based on their (i)
frequency of co-occurrence and (ii) relative distance in
each Wikipedia document as defined below.
Sim(i, j) =
wi ∈V (i)
1
wj ∈V (j) d(wi ,wj )+1
|V (i)| × |V (j)|
(1)
where d(wi , wj ) is the distance between any two
words wi and wj in any Wikipedia document D, V (i)
(V (j), respectively) is the set of stem variations of i
(j, respectively) in D, and |V (i)| × |V (j)| is the normalization factor.
The Wikipedia collection is an ideal and unbiased
choice for establishing word similarity, since (i) documents within the collection were written by close to
90,000 authors with different writing styles and word
usage, (ii) the Wikipedia documents cover an extensive
range of topics, and (iii) the words within the documents appear in a number of on-line dictionaries, such
as 12dicts-4.0 (prdownloads.sourceforge.net/wordlist/
12dicts-4.0.zip), Ispell (www.cs.ucla.edu/geoff/ispell.
html), and BigDict (packetstormsecurity.nl/Crackers/
bigdict.gz). Compared with the word-correlation factors, WordNet (wordnet.princeton.edu/) provides synonyms, hypernyms, holonyms, antonyms, etc. for a
given word. There is, however, no partial degree of
similarity measures, i.e., weights, assigned to any pair
of words. For this reason, word-correlation factors are
more sophisticated in measuring word similarity than
4 Words
within the Wikipedia documents were stemmed (i.e., reduced to their root forms) and stopwords were removed.
the word pairs in WordNet which does not provide any
measures on their degrees of “closeness" that exist.
2.2.5. Scaled Word-Correlation Factors
The correlation factor of words w 1 and w2 is assigned a value between 0 and 1, such that ‘1’ denotes
an exact match and ‘0’ denotes total dissimilarity between w1 and w2 . Note that even for highly similar,
non-identical words, their correlation factors in [15]
are on the order of 5 × 10 −4 or less. For example,
the degree of similarity between “tire" and “wheel" is
3.1 × 10−6 , which can be treated as 0.00031% similar and 99.99% dissimilar. When the degree of similarity of two sentences is computed using the correlation factors of their words, we prefer to ascertain how
likely the words are on a scale of 0% to 100% in sharing the same semantic meaning. Hence, we assign a
similarity value of 1, or rather 100%, to exact-word
matches and comparable values (e.g., 0.7-0.9 or 70%90%) for highly-similar, non-identical words. We accomplish this task by scaling the word-correlation factors in [15]. Since correlation factors of non-identical
word pairs are less than 5 × 10 −4 and word pairs with
correlation factors below 1 × 10 −7 do not carry much
weight in the similarity measure, we use a logarithmic
scale, i.e., ScaledSim, which assigns words w 1 and
w2 the similarity value V of 1.0 if they are identical,
0 if V < 1 × 10−7 , and a value between 0 and 1 if
1 × 10−7 ≤ V ≤ 5 × 10−4 , which is formally defined
below.
ScaledSim(w1 , w2 ) =
⎧
⎨1
⎩ M ax(0, 1 −
−4
5×10
ln( Sim(w
1 ,w2
−4
ln( 5×10−7 )
if w1 = w2
)
)
) Otherwise
(2)
1×10
where Sim(w1 , w2 ) is the word-correlation factor of
w1 and w2 as defined in the word-correlation matrix
using Equation 1.
We formulate Equation 2 so that the proportion
of the high/low word-correlation factors in the wordcorrelation matrix are adequately retained using their
scaled values. Furthermore, using the natural logarithm, i.e., ln, in Equation 2, we can represent a large
range of data values, i.e., the word-correlation factors,
which is in the order of 10 −4 or less, in a more manageable, intuitive manner, i.e., values between 0% and
100%.
2.2.6. N-gram Correlation Factors
As mentioned in Section 2.2.1, SimP aD does not
consider the order of words in sentences. However, in
M.S. Pera et al. / A Plagiarized Document Detection Tool
some cases, disregarding the order of the words in a
sentence might yield a higher degree of similarity of
sentences than what it is supposed to be, which could
falsely classify a legitimate document as plagiarized,
generating a false positive during the plagiarism detection process. In order to reduce the number of false
positives, we can consider n-gram, phrase-correlation
factors (2 ≤ n ≤ 3), which are computed by combining the correlation factors of the corresponding words
in each n-gram of sentences as defined in [25], if
needed. Since experimental results (presented in Section 3) show that the Sim values on words are adequate in detecting (non-)plagiarized documents accurately, n-gram, phrase-correlation factors are not further considered for plagiarism detection in this paper.
2.3. Sentence Similarity
SimP aD computes the degree of similarity of any
two sentences P and S using
LimSim(P, S) =
m
n
i=1 M in(1,
j=1 ScaledSim(i, j))
m
(3)
where m (n, respectively) denotes the number of
keywords in a (potential) plagiarized (source, respectively) sentence P (S, respectively), i (j, respectively) is a word in P (S, respectively), and
ScaledSim(i, j) is the scaled word-correlation factor
of i and j as defined in Equation 2. LimSim(P, S) =
LimSim(S, P ), unless P = S.
Using the LimSim function, instead of simply
adding the Sim or ScaledSim value of each word in
P with respect to each word in S, we restrict the highest possible sentence-similarity value between P and
S to 1, which is the value for exact matches. By imposing this constraint, we ensure that if P contains a word
k that is (i) an exact match of a word in S, and (ii)
similar to (some of) the other words in S, then the degree of similarity of P with respect to S cannot be significantly impacted/affected by k to ensure a balanced
similarity measure of P with respect to S.
2.3.1. Merged/Split Sentences
Besides considering word addition, deletion, and
substitution in detecting plagiarism, we identify sentences in a (plagiarized) document D P created by
splitting and/or merging sentences in a source document DS . Identifying these split/merged sentences
in DP not only measures the document similarity of
5
DP with respect to DS more accurately, this information is also useful to SimP aD users who are interested in knowing which sentences in D S have been
split/merged to yield the corresponding sentences in
DP .
Some plagiarism-detection methods, such as [34],
consider sentence rearrangement, i.e., sentence merging and splitting, by setting a threshold value V so that
each pair of sentences with a number of words in common that is higher than V is further evaluated. Relying on the proportion of common words among sentences for detecting split/merged sentences, however,
is a limitation, since as previously mentioned, words
in a given source sentence S may have been replaced
by other similar ones to yield a plagiarized sentence
P , and hence the number of common words between
S and P is lower than what it should be.
We claim that a split sentence P is “subsumed" by
its original sentence S if majority of the words in P
are (semantically) the same as (some of) the words in
S. By adopting the threshold value of 0.93, which was
established and verified using text documents in [25],
SimP aD treats P as a split (subsumed) sentence from
(of) S if LimSim(P, S) ≥ 0.93. The same strategy
can be applied to detect merged sentences, i.e., source
sentences S1 , . . ., Sn (n ≥ 2) are merged to yield a
plagiarized sentence P , if LimSim(S i , P ) ≥ 0.93,
∀1≤i≤n .
2.3.2. Sentence-to-Document Similarity
Having determined the LimSim value of each sentence P in a (potential) plagiarized document D P with
respect to each sentence in a source document D S ,
SimP aD identifies the highest degree of similarity of
P with the sentences in DS , which yields the probability of P having the same content as one of the sentences in DS , as
SenSim(P, DS ) =
⎧
M ax{∀Sj ∈DS LimSim(P, Sj )}, if ∃! or ∃
⎪
⎪
⎪
⎪
⎨ Sj such that LimSim(Sj , P ) ≥ 0.93
(4)
n
⎪
⎪
M in(1, j=1 LimSim(P, Sj )) such that
⎪
⎪
⎩
LimSim(Sj , P ) ≥ 0.93), otherwise
where n is the number of sentences in D S that are
subsumed by P . SenSim(P, D S ) returns the highest
LimSim of P with respect to the sentences in D S ,
if P is not created by merging two or more sentences
in DS ; otherwise, the combined similarity of the sentences in DS that are merged to yield P is assigned to
6
M.S. Pera et al. / A Plagiarized Document Detection Tool
be the sentence-to-document value of P with respect
to DS . Using the M in value in Equation 4, we impose
the same restriction as in Equation 3, i.e., in combining the LimSim values of P with respect to the sentences in DS merged to create P , we limit the combined LimSim values to 1, an exact match.
2.3.3. Dotplot Views of Similar Sentences
Using the SenSim (LimSim, respectively) values
computed by Equation 4 (Equation 3, respectively),
we can (i) determine for each sentence in a (potential) plagiarized document D P its most highly-related
sentence in a source document D S , in addition to sentences in DP that are split/merged sentences from sentences in DS , and (ii) graphically display these related
sentences.
Dotplot view [11] was designed for visualizing patterns of string matches in different kinds of texts, e.g.,
news articles, programming code, etc. We adapt the
Dotplot view to provide an intuitive, conceptual diagram that shows similar sentences in a source and a
plagiarized document visually. We did modify, however, the Dotplot view using the scatter graph in
Microsoft Office Excel and call the modified graph
Plagiarism View (or P laV iew). In each P laV iew,
the x- (y-, respectively) axis represents the sentences
(by numbers in a chronical order of their appearance) in a plagiarized document D P (source document
DS , respectively), whereas each dot, denoted “•", in
P laV iew represents the sentence S in D S that is the
most highly similar to the sentence P in D P , which is
dictated by the SenSim(P, D S ) value. Furthermore,
P laV iew graphically displays the sentences P 1 , . . .,
Pn in DP that are the split version of a sentence S
in DS , if LimSim(Pi , DS ) ≥ 0.93, and the “dots"
of (P1 , S), . . ., (Pn , S), 1 ≤ i ≤ n, are horizontally
aligned in P laV iew. In addition, a cross symbol, i.e.,
“x", in P laV iew denotes a merged sentence P in D P
that combines several sentences S 1 , . . ., Sm (m ≥ 2)
in DS , such that LimSim(Si , P ) ≥ 0.93, 1 ≤ i ≤ m,
and the “crosses" of (S i , P ) are vertically aligned. Furthermore, the larger the dot (cross, respectively) size,
the higher the content similarity of the corresponding
sentences, which captures the SenSim value for each
sentence S in a source document D S and a sentence
P in a (potential) plagiarized document D P . Figure 1
shows the P laV iew of the document “Student File
Management under Primos" and its plagiarized version
in W ebis-P C, which is one of the datasets used for
experiments and is discussed in Section 3, and v 1 and
v2 of <v1 , v2 > in Figure 1 denote the SenSim and
LimSim values, respectively of sentences S and P .
2.4. Document Similarity
Having identified the sentences in a source document DS that are most closely related to the sentences
in a (potential) plagiarized document D P , we determine the overall percentage of plagiarism of D P with
respect to DS as
Resem(DP , DS ) =
|DP |
i=1
SenSim(Pi , DS )
(5)
|DP |
where |DP | is the number of sentences in D P , and
Resem(DP , DS ) = Resem(DS , DP ), if DP = DS .
Resem(DP , DS ) is the degree of resemblance of D P
with respect to DS .
By averaging the computed SenSim values of sentences in DP , SimP aD determines the ratio of the
(segments of) sentences in D P that are related to the
sentences in DS . Using Equation 5 and a threshold
value defined in Section 3, SimP aD can classify
(non-)plagiarized documents.
3. Experimental Results
In this section, we introduce the datasets used for
conducting various empirical studies on SimP aD and
present several evaluation measures for analyzing the
performance of SimP aD in terms of accuracy and F measure in detecting (non-)plagiarized documents.
3.1. Datasets
In assessing the performance of SimP aD, we
used three plagiarism corpora. The first one, denoted
W ebis-P C, is the Bauhaus University Plagiarism Corpus W ebis-P C-08 [36], which consists of 101 original English documents downloaded from the ACM
Digital Library (http://portal.acm.org/dl.cfm). There is
a plagiarized version for each original document D
in W ebis-P C, which was generated by (i) including
exact paragraphs in D, (ii) excluding some sentences
from D, and/or (iii) adding sentences with words similar to the ones in D.
The second corpus, the M eter Corpus [8], was constructed as part of the Measuring Text Reuse Project
at the University of Sheffield in U.K. (The) M eter
(corpus) consists of 265 unique stories provided by the
M.S. Pera et al. / A Plagiarized Document Detection Tool
7
Fig. 1. P laV iew of sentences in the source and plagiarized version of the document “Student File Management under Primos", where X and
Y in <X, Y> denote the SenSim and LimSim values, respectively of the corresponding sentences
British Press Association (PA)5 and clustered into two
different subject areas: entertainment and law/court reporting, which were collected from July 1999 to June
2000. For each of the 265 stories, M eter provides
one or more (non- 6)derived newspaper articles, which
translates into 944 pairs of news articles published in
a variety of newspapers such as The Sun, Daily Mirror, Daily Mail, etc. Each news article pair in M eter
is classified as wholly-derived (i.e., when the PA stories are copied/paraphrased entirely), partially-derived
(i.e., when PA is the major source used for writing a
news article), and non-derived (i.e., when PA is not the
original source). In rewriting news articles based on
PA stories, Gaizauskas et al. [8] observe that common
rewriting strategies include (i) using the exact content
from a source sentence, (ii) paraphrasing text from the
source story to report the same information, and (iii)
including new text, i.e., reporting PA stories using a
different context. As a result, wholly- (partially-, respectively) derived articles include sentences “cut and
pasted" exactly from a PA story (sentences from the PA
source story in which words have been replaced, reordered, added, and/or deleted, respectively). In eval5 According
to [8], PA is the most prestigious press agency in the
U.K., which provides news to 86 different national newspapers, as
well as 470 radio and television broadcasts.
6 Non-derived news articles are published articles that report (but
do not plagiarize) the 265 stories provided by PA.
uating the performance of SimP aD, article pairs labeled as wholly- and partially-derived are treated as
plagiarized articles, the same assumption made by [32]
in detecting plagiarism and using M eter for performance evaluation.
The third test dataset, denoted P AN 09 [27], is the
development corpus created for the First PlagiarismDetection Competition of the 3 rd PAN Workshop on
Plagiarism Detection (http://www.webis.de/pan-09).
The P AN 09 corpus consists of 7,214 suspicious, i.e.,
potential plagiarized, documents, and 7,214 source
documents. Documents in P AN 09 are plain text files,
and each text file T includes an XML file with meta
information of T , i.e., the plagiarized fragments in T
along with their corresponding sources. The suspicious
documents in P AN 09 are generated by applying random text operations, i.e., adding, deleting, or relocating a word, as well as including a word from an external source and/or replacing a word with a synonym,
antonym, hypernym, or hyponym.
The P AN 09 dataset is a large-scale corpus of artificial plagiarism. Since the competition is still in
progress, the “competition" portion of the corpus is
not annotated. As a result, we have conducted our
empirical study on the “development" portion of the
P AN 09 dataset instead. Furthermore, since the results
on P AN 09 generated by various plagiarism detection
systems involved in the competition are not available,
we cannot compare the performance of SimP aD with
8
M.S. Pera et al. / A Plagiarized Document Detection Tool
other approaches using P AN 09, except conducting
the measures on our own.
To the best of our knowledge, besides W ebis-P C,
M eter, and P AN 09, no other benchmark datasets
are available for evaluating the performance of a plagiarism detection approach based on adding, deleting,
or replacing words, and/or sentence merging/splitting.
Hence, W ebis-P C, M eter, and P AN 09 are the
most ideal choices for evaluating the performance of
SimP aD.
3.2. Resemblance Values
As discussed in Section 2, in comparing any two
documents SimP aD computes their Resem values.
Figure 2(a) (2(b), respectively) shows the Resem
value of each of the 101 plagiarized documents (944
pairs of articles, respectively) and its corresponding source in W ebis-P C (M eter, respectively). As
shown in Figure 2(b), partially-derived news article
pairs have a lower degree of resemblance than the
wholly-derived news article pairs, but a higher degree of resemblance than non-derived pairs in M eter.
Based on the Resem values shown in Figure 2, we
observe that SimP aD adequately detects the proportion of content shared by documents, i.e., the percentage of plagiarism found in the (potential) plagiarized
documents.
3.3. A Threshold for Plagiarism Detection
Prior to determining the accuracy of SimP aD in
detecting (non-)plagiarized documents, we set an appropriate threshold value, ResemT H, for automatically labeling a (non-)plagiarized document D P of a
source document D S as (non-)plagiarized using the
Resem(DP , DS ) value7 .
In defining ResemT H, we used the ID3 implementation of the decision tree [28], since ID3
is commonly-used for inductive inference based on
a given training set of instances. We randomly selected 40 (60, respectively) documents from W ebisP C (M eter, respectively) and their corresponding
plagiarized ((non-)plagiarized, respectively) version,
which yield 100 training instances (i.e., document
pairs) for constructing the decision tree. Each training instance includes a Resem value, and the class
7 SimP aD
is fully automated, since it does not require any
user’s interaction or involvement during the process of determining
whether an input document is plagiarized.
value of the instance previously set (i.e., “plagiarized"
for W ebis-P C and “(non-)plagiarized" as defined in
M eter). Using the constructed decision tree, a document DP is classified as a plagiarized version of D S ,
if Resem(DP , DS ) ≥ 0.27, which is the ResemT H
value. Using ResemT H and the Resem values of the
document pairs computed by SimP aD as shown in
Figure 2, we observe that SimP aD correctly classifies all the plagiarized documents in W ebis-P C and
the selected wholly-derived documents in M eter and
almost all the selected partially-derived documents in
M eter.
3.4. Performance Evaluation
Using the established ResemT H value and the
computed Resem values of the document pairs in
W ebis-P C, M eter, and P AN 09, we evaluated the
Accuracy =
N umber_of _Correctly_Classif ied_Documents
|corpus|
(6)
of SimP aD in correctly detecting (non-)plagiarized
documents, where |corpus| is the total number of document pairs in a corpus, and the
Error_Rate = 1 − Accuracy
(7)
for misclassification. As shown in Figure 3, in detecting the plagiarized documents in W ebis-P C and
P AN 09, SimP aD yields 100% and 90% accuracy 8,
respectively, and classifies the (non-)plagiarized news
article pairs in M eter with a 96% accuracy rate. Note
that none of the wholly-derived and non-derived news
article pairs in M eter were misclassified, and of the
thirty-six misclassified news article pairs (4% of the
total number of 944 classified pairs) in the partiallyderived category (with a total of 438 news article
pairs), each of its plagiarized copy yields a Resem
value lower than ResemT H due to the small size of its
corresponding news article, which includes only 2 to
4 sentences and only half of these sentences are (partially) derived from the corresponding PA source ar8 We randomly selected a subset of 204 documents in P AN 09,
which are plagiarized from a unique source document to compute
the accuracy value using SimP aD on the selected subset of documents.
M.S. Pera et al. / A Plagiarized Document Detection Tool
9
(a) Degrees of similarity of documents in W ebis-P C
(b) Degrees of Similarity of documents in M eter
Fig. 2. Resem values of documents and their (non-)plagiarized versions computed by SimP aD
ticle. Even though SimP aD misclassified 4% of the
articles in M eter as non-plagiarized, which are false
negatives, SimP aD did not generate any false positives, i.e., all of the non-plagiarized articles were correctly identified.
Furthermore, as shown in Figure 3, even though
SimP aD achieves an average 95% accuracy on all
three datasets, SimP aD achieves a lower accuracy
ratio on the selected set of documents in P AN 09
when compared with the accuracy ratios achieved using SimP aD on W ebis-P C or M eter. We have
observed that the 10% misclassified documents in
P AN 09 are not entirely plagiarized. In some cases
only a small portion of (a few sentences in) a document D is plagiarized, even though D is classified as
plagiarized in P AN 09.
3.5. Comparing SimP aD’s Performance
In order to further assess the effectiveness of
SimP aD in detecting (non-)plagiarized documents,
we compare its performance, in terms of accuracy,
10
M.S. Pera et al. / A Plagiarized Document Detection Tool
Fig. 3. Accuracy and Error Rates generated by SimP aD on
W ebis-P C, M eter, and P AN 09
Fig. 4. Accuracy and Error Rates generated by SimP aD and methods in [6] and [32] on the M eter corpus
with other existing plagiarism-detection approaches
whose performance evaluations are based on M eter.
(None of the performance evaluations of existing
plagiarism-detection methods are based on W ebisP C, which is relatively new).
The plagiarized-detection method proposed by [32]
uses a binary (i.e., similar and non-similar) classifier, which is based on style features, such as frequent words in a document, and vocabulary features,
i.e., tf × idf (the term frequency-inverse document
frequency)-weighted vectors of unigrams, in a given
document to identify copyright infringement. The
combined approach yields an accuracy of 70.5% in detecting (non-)plagiarized news articles out of the 88
selected pairs in M eter. In addition, Clough [6] proposes using the overlapping between n-grams in any
two documents to determine the proportion of shared
content. Experimental results conducted on whollyderived and non-derived, law/court news article pairs
in M eter report an overall 74% accuracy rate [6]. As
shown in Figure 4, SimP aD outperforms the two approaches (by a huge margin), which are the only ones
that we could find based on M eter and using the accuracy measure for performance evaluation.
Besides using the accuracy as an evaluation metric,
we have further verified the effectiveness of SimP aD
in detecting (non-)plagiarized documents using Precision, Recall, and F -measure, which are commonlyused metrics, to provide additional evidence of the efficiency of SimP aD in plagiarism detection. The F measure combines and weights the precision (a measure of fidelity) and the recall (a measure of completeness) evenly, whereas accuracy 9 is sensitive only to
the number of errors [21]. Using the three additional
metrics on the M eter corpus, we compare the performance of SimP aD and the plagiarism detection
methods in [2] and [3], denoted P D 1 and P D2 , respectively. P D1 is discussed in details in Section 4,
whereas P D2 , as P D1 does, determines the overlap
between (the n-grams in) a sentence S in a potential plagiarized document P and (the n-grams in) a
source document, which dictates whether S should be
treated as plagiarized based on a previously-set threshold value. As opposed to P D 1 , P D2 does not apply
search space reduction to detect the possible sources
of S.
In order to tailor the evaluation measures computed
by SimP aD so that they are comparable with the ones
in [2] and [3], we treat each sentence S in a potential plagiarized document as an individual document
itself and adopt the ResemT H threshold defined in
Section 3.3 to determine whether S should be considered plagiarized when compared with potential source
documents. Using ResemT H, we examine and verify
whether the proposed document-based threshold value
can be adapted as a sentence-based threshold value.
Since the performance evaluations as presented in [2]
and [3] consider various n-grams (1 ≤ n ≤ 5), our
comparisons are based on the highest Precision, Recall, and F -M easure values reported for P D 1 and
P D2 , which were achieved when using bigrams, i.e.,
when n = 2.
As shown in Figure 5, SimP aD outperforms P D 1
and P D2 by at least 10%, 5%, and 8% in terms of
Precision, Recall, and F -M easure, respectively.
9 Accuracy is also called precision at 1 [7], i.e., P@1, which means
that when only one document is retrieved, the document is either
relevant or non-relevant, and thus its accuracy value is either 1 or 0,
respectively.
M.S. Pera et al. / A Plagiarized Document Detection Tool
11
Fig. 5. Performance evaluation of SimP aD, P D1 in [2], and P D2 in [3] in terms of Precision, Recall, and F -M easure using the Meter
corpus
3.6. Retaining/Removing Short Sentences and/or
Stopwords
To verify the correctness of the assumption that removing stopwords and short sentences (as discussed
in Section 2.1) from each document does not affect the effectiveness of SimP aD in detecting (non)plagiarized documents, we have used a randomlyselected subset of 100 documents from the M eter corpus, denoted StSe-dataset 10 , and computed the accuracy in correctly identifying the (non-)plagiarized
documents using (i) SimP aD without removing short
sentences, and (ii) SimP aD without removing stopwords.
As shown in Figure 6, applying (i) SimP aD on
documents with short sentences and stopwords removed and (ii) SimP aD on documents without removing short sentences both achieved the same accuracy, i.e., 89%, in identifying (non-)plagiarized documents using the StSe-dataset, which verifies that
removing short sentences does not affect the performance of SimP aD, as expected. Furthermore, the accuracy of using SimP aD on documents without removing stopwords is 11% lower than the accuracy of
applying SimP aD on documents with the removal
of stopwords and short sentences. The lower accuracy ratio occurs because non-plagiarized documents
share a fair amount of stopwords with other sources,
which translates into a higher degree of resemblance
than expected between two of these documents; as
Fig. 6. Accuracy of SimP aD computed (without) using stopwords
and/or short sentences in the StSe-dataset documents
a result, a number of non-plagiarized documents are
mistakenly classified as plagiarized when stopwords
are retained during the plagiarism-detection process
using SimP aD. Based on the conducted empirical
study, we conclude that removing short sentences and
stopwords from (potential) plagiarized documents and
their corresponding sources yield the highest accuracy
value in detecting plagiarism.
4. Related Work
10 The
StSe-dataset consists of 50 non-plagiarized documents
and 50 plagiarized documents, each randomly selected from its respective (non-)plagiarized document subset in the M eter corpus.
Many attempts have been made in the past to detect plagiarized documents. In this section, we discuss
12
M.S. Pera et al. / A Plagiarized Document Detection Tool
a number of recently proposed plagiarism detection
methods along with their strengths and shortcomings.
Lukashenko et al. [20] compare two documents and
determine their degree of similarity using different
metrics such as the Euclidean distance, Cosine similarity, the percentage of shared n-grams, and the resemblance among estimated language models, whereas
Hoad and Zobel [12] recommend two different methods for identifying co-derivatives, i.e., documents that
are originated from the same source, which can detect
potential plagiarized documents. The first method relies on an identity measure that determines the degree
of resemblance between two documents, assuming that
similar documents contain similar words. The second
method is based on fingerprinting, which identifies the
content of a document by hashing substrings within the
document. By using fingerprints among documents,
one can detect those that are highly likely derivates of
another, i.e., plagiarized copies.
Kienreich et al. [14] introduce a plagiarism-detection
approach in which a set of news articles are (i) represented by their shingles, i.e., sequences of non-stop
words within the news articles, and (ii) clustered according to their relative similarity values. The similarity values are calculated using a hash function among
the shingles of any two news articles to compute the
colliding values among them, and articles are clustered
together if they share a significant number of textual
terms. Hereafter, in each subset of similar news articles, another evaluation is performed to detect plagiarized documents. This evaluation method involves
performing sentence-to-sentence comparison between
any two news articles, A1 and A2 , on which transformation operations, such as insertion, deletion, and update, are applied to a sentence in A 1 to convert it into a
sentence in A2 . The less operations required, the more
alike the two sentences are. Although Kienreich et al.
[14] show promising experimental results, the number
of undetected plagiarized documents is still significant.
Monostori et al. [23] present a plagiarism-detection
system, denoted M atchDetect Reveal (MDR), using
suffix-trees. MDR, which is capable of detecting overlapping in (potential) plagiarized documents, uses a
string-matching algorithm to identify suspicious documents from where suffix-trees are constructed using
a modified Ukkonen’s algorithm. The overlap between
the constructed suffix-trees of the suspicious documents is evaluated to determine which documents, if
there are any, are plagiarized.
Niezgoda and Way [24] design an algorithm, denoted SNITCH, for automatically detecting plagia-
rized documents. SNITCH analyzes a set of windows,
each of which is a portion of a document D containing
a certain number of words (as defined by the user), and
according to the number of characters in each word in
a window W , SNITCH assigns an average weight to
W . Hereafter, according to the weight given to each
of the windows, the highest weighted windows (each
of which is treated as a query) are searched for on
the Web and based on the results, a report is generated which includes the percentage of plagiarism in D,
along with the Web sites from where the original documents were extracted. Unfortunately, SNITCH is accurate only when identified “cut-and-paste" are exact,
i.e., original information are copied exactly, which is
not always the case.
Zini et al. [35] focus on detecting not only “cutand-paste" plagiarized documents, but plagiarized documents that are modified in a subtler way by inserting,
deleting, or substituting a word, a period, or a paragraph in the original document to generate plagiarized
ones. This approach extracts the document structure at
different levels (i.e., word, period, and paragraph level)
and compares the overall similarity among the structures of any two documents using the Levenshtein distance [18].
Khmelev and Teahan [13] use the R-measure to recognize plagiarized documents. The R-measure adds
the lengths of the substrings in a given document that
are included in another document in a collection. By
considering the normalized R-measure value, it is possible to establish the “repeatedness" of a document
with respect to others, which establishes the degree of
plagiarism in the corresponding documents.
Tashiro et al. [31] introduce EPCI, which is a tool
for finding copyright infringement texts. Given a potential plagiarized document D, EPCI extracts several sequences of words, i.e., seed text, and generates
queries that retrieve a set of Web documents W that
could be the source of the content of D. Hereafter,
EPCI computes the similarity between D and the documents in W . The higher the similarity value between
D and any document in W , the more likely that infringement has occurred.
Metzler et al. [22] establish several levels of similarity among documents to identify those that are exact
copies of a given document D, as well as the ones that
are modified versions of D. In accomplishing the task,
Metzler et al. [22] first determine the similarity between sentences within any two documents, and based
on the sentence-to-sentence similarity score, the overall similarity value of the documents is determined.
M.S. Pera et al. / A Plagiarized Document Detection Tool
The plagiarism-detection approach introduced in
[10] determines the authors of different documents,
rather than comparing their contents to detect plagiarism. Gruner and Naven [10] develop a tool based on
stylometry, which is a statistical approach used for establishing the authorship of documents. By using stylometry, Gruner and Naven [10] recognize the similarity of the writing style between the authors of any two
given documents and consider two different types of
plagiarism: (i) documents that should be written by the
same author, while in fact are not, and (ii) documents
that should be written by different authors, while in
fact are not.
In [1] the authors present a cross-lingual approach
for detecting plagiarized documents written in different languages. The similarity between a potential plagiarized document P and the source document S (written in another language) is measured using (i) a statistical model that establishes the probability that (a fragment of) P is related to (a fragment of) S regardless of
the order in which the words appear in P and S, and
(ii) a statistical bilingual dictionary created by a parallel corpus of text in English and Spanish. Although
effective, this approach requires the construction of the
cross-lingual corpus, which is a difficult task on itself
as mentioned in [1].
Barrón-Cedeño et al. [2] propose a Kullback-Leibler
(KL) distance-based method, which should drastically
reduce the search space for plagiarism detection and
facilitate the task of locating the source document S
of a potential plagiarized one P within a large corpus
C. The KL distance is used to determine the equality
between the probability distribution 11 of P and S and
extract from C the subset of documents D, including
S, that have the lowest KL distance with respect to P ,
which are the documents most likely to be the source
of P . Hereafter, by (i) computing the overlap between
the n-grams in each sentence P i in P and the n-grams
in each document in D and (ii) using a previously defined threshold value, the source of P i , if it exists, is
determined.
Sediyono et al. [29], who view plagiarism detection as the “problem of finding the longest common
parts of string", treat each source document S in a
document collection C as a list of consecutive words.
Given an observed, i.e., potential plagiarized, document P , Sediyono et al. determine the common con11 The probability distribution of P (S, respectively) is composed
of a set of features, such as term frequency, that characterize P (S,
respectively) [2].
13
secutive words in P and S (to establish that S is the
source of sections of text in P ) as well as their relative
locations in both P and S (to aid users in visualizing
the common portions of text in P and S), as opposed
to SimP aD and P laV iew, which are sentence-based.
SVDPlag is a Singular Value Decomposition (SVD)based plagiarism detection method [4]. Given a particular document collection C, SVDPlag (i) removes
stopwords, (ii) stems the remaining words, and (iii) extracts n-grams of pre-defined length. After eliminating
n-grams that belong to at most one document in C or
appear in all the documents in C, SVDPlag employs
a Latent Semantic Analysis (LSA) framework to infer
latent semantic associations between pairs of distinct
n-grams in every pair of text documents in C. Hereafter, SVDPlag computes for each pair of documents
in C their level of similarity by calculating the number
of n-grams the documents have in common and determines which ones should be treated as plagiarized
using a pre-defined threshold value.
Leung and Chan [17] propose using a natural language processing method to facilitate the detection of
plagiarized documents, not only among the ones created by “cut and paste", but also documents in which
both the text and the structure of their original sentences are altered while the content of the documents
are intact. The approach in [17] considers (i) word replacement by using WordNet and assigning weights to
different semantic word relations, such as hypernyms,
hyponyms or synonyms, when performing sentenceto-sentence comparisons to establish the degree of resemblance among sentences, and (ii) syntactic (semantic, respectively) processing to analyze the syntactic
structure (meaning, respectively) of the sentences. As
stated in [17], their detection approach is slow and
should be reserved for detecting plagiarism only when
a document that is very likely the source of P is available.
M LP lag, the multilingual plagiarism detection
method presented in [5], relies on (i) a multilingual
database of words and their relations, denoted EWN
thesaurus, and (ii) the analysis of position of words in
documents to detect plagiarized ones from a document
written in another language. M LP lag, like SimP aD,
considers plagiarism strategies such as word replacement and repositioning. M LP lag, however, depends
exclusively on EWN thesaurus, which has yet to be
fully developed [5].
Su et al. [30] propose a hybrid plagiarism detection
method based on the Levenshtein distance LD and a
simplified version of the Smith-Waterman algorithm.
14
M.S. Pera et al. / A Plagiarized Document Detection Tool
LD is a metric that measures the edit distance between
any two strings, whereas the Smith-Waterman algorithm is commonly used for detecting similarities, i.e.,
local alignment, between biological sequences, which
in [30] is adapted to detect local similarity between
any two documents and identify plagiarized ones. As
stated by the authors, the main disadvantage of the proposed detection method is that the Smith-Waterman algorithm does not scale well [30].
reduce the number of documents that are misclassified
as (non-)plagiarized. A direction for future work is to
apply natural language processing techniques to further enhance SimP aD by considering (i) the semantic meaning of (a sequence of) individual words to perform word sense disambiguation, and (ii) the (syntactical) structure of words in sentences, as opposed to just
relying on the (similarity of) individual words in sentences, which is the strategy SimP aD currently applies, in detecting plagiarized documents.
5. Conclusions
We have proposed a similarity-based plagiarismdetection tool, SimP aD, which relies on precomputed word-correlation factors for determining
sentence-to-sentence similarity values that yield the
degree of resemblance of any two documents to detect
the plagiarized one, if it exists. SimP aD is designed
to detect (non-)plagiarized text documents, which are
digitalized and posted online, using a collection of
Web documents which includes the original documents of their plagiarized versions. SimP aD, which
can handle various plagiarism methods such as substitution, addition, and deletion of words in sentences,
as well as sentence splitting and merging, provides the
users a visual representation of sentences in a source
document that are paraphrased in its plagiarized version.
We have conducted an empirical study which shows
that SimP aD achieves high accuracy, i.e., 95.3% on
the average, in detecting (non-)plagiarized documents
using three different benchmark datasets: W ebis-P C
[36], M eter [8], and P AN 09 [27]. Furthermore, we
have compared the performance of SimP aD with
the existing plagiarism-detection approaches introduced in [6] and [32] using the M eter corpus, which
shows that SimP aD outperforms these documentbased plagiarism-detection approaches in accuracy by
a huge margin, i.e., 22%.
We have also evaluated the performance of SimP aD
in identifying individual plagiarized sentences and
validated its effectiveness based on Precision, Recall, and F -measure. The experimental results show
that SimP aD achieves an 83% F -measure using
the M eter corpus, which outperforms the alternative
sentence-based plagiarism detection approaches introduced in [2] and [3] by at least 8%.
Although highly effective, the empirical study
conducted on SimP aD shows that the proposed
plagiarism-detection approach could be improved to
References
[1] A. Barrón-Cedeño, P. Rosso, D. Pinto, and A. Juan. On CrossLingual Plagiarism Analysis Using a Statistical Model, in:
Proceedings of European Conference on Artificial Intelligence
Workshop on Uncovering Plagiarism, Authorship and Social
Software Misuse (ECAI), 2008, pp. 9–13.
[2] A. Barrón-Cedeño, P. Rosso, and J. Bened. Reducing the Plagiarism Detection Search Space on the Basis of the KullbackLeibler Distance, in: Proceedings of the International Conference on Computational Linguistics and intelligent Text Processing (CiCLing), 2009, pp. 523–534.
[3] A. Barrón-Cedeño and P. Rosso. On Automatic Plagiarism Detection Based on n-Grams Comparison, in: Proceedings of the
European Conference on IR Research on Advances in Information Retrieval (ECIR), 2009, pp. 696–700.
[4] Z. Ceska. Plagiarism Detection Based on Singular Value Decomposition, in: Proceedings of the International Conference on
Advances in Natural Language Processing, 2008, pp. 108–119.
[5] Z. Ceska, M. Toman, and K. Jezek. Multilingual Plagiarism Detection. in: Proceedings of the International Conference on Artificial intelligence: Methodology, Systems, and Applications,
LNAI 5253, 2008, pp. 83–92.
[6] P. Clough, Measuring Text Reuse in a Journalistic Domain, in:
Proceedings of the Computational Linguistics UK (CLUK) Colloquium, 2001, pp. 53–63.
[7] W. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2010.
[8] R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel, P. Clough, and
S. Piao, The Meter Corpus: a Corpus for Analyzing Journalistic
Text Reuse, in: Proceedings of Corpus Linguistics, 2001.
[9] D. Gildea, Loosely Tree-based Alignment for Machine Translation, in: Proceedings of the Association for Computational Linguistics (ACL), 2003, pp. 80–87.
[10] S. Gruner and S. Naven, Tool Support for Plagiarism Detection
in Text Documents, in: Proceedings of the ACM Symposium on
Applied Computing, 2005, pp. 776–781.
[11] J. Helfman, Dotplot Patterns: A Literal Look at Pattern Languages, Theory & Practice of Object Systems 2(1) (1996), pp.
31–41.
[12] T. Hoad and J. Zobel, Methods for Identifying Versioned and
Plagiarized Documents, Journal of the American Society for Information Science and Technology (JASIST) 54(3) (2003), 203–
215.
[13] D. Khmelev and W. Teahan, A Repetition Based Measure for
Verification of Text Collections and for Text Categorization, in:
M.S. Pera et al. / A Plagiarized Document Detection Tool
Proceedings of ACM International Conference on Research and
Development in Information Retrieval (SIGIR), 2003, pp. 104–
110.
[14] W. Kienreich, M. Granitzer, V. Sabol, and W. Klieber, Plagiarism Detection in Large Sets of Press Agency News Articles, in:
Proceedings of the 17th International Conference on Database
and Expert Systems Applications (DEXA), 2006, pp. 181–188.
[15] J. Koberstein and Y.-K. Ng, Using Word Clusters to Detect
Similar Web Documents, in: Proceedings of the First International Conference on Knowledge Science, Engineering and
Management (KSEM’06), LNAI 4092, 2006, pp. 215–228.
[16] P. LaRocque, The Book on Writing: the Ultimate Guide to Writing Well, Marion Street Press, 2003.
[17] C. Leung and Y. Chan, A Natural Language Processing Approach to Automatic Plagiarism Detection, in: Proceedings of
ACM Conference on Information Technology Education (SIGITE), 2007, pp. 213–218.
[18] V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady 10(8)
(1996), pp. 707–710.
[19] E. Liddy, How Search Engines Work, Searcher (The Magazine
for Database Professionals) 9(5) (2001), 38–45.
[20] R. Lukashenko, V. Graudina, and J. Grundspenkis, Computerbased Plagiarism Detection Methods and Tools: an Overview,
in: Proceedings of International Conference on Computer Systems and Technologies (CompSysTech), 2007, pp. 1–6.
[21] C. Manning and H. Schutze, Foundations of Statistical Natural
language Processing, MIT Press, 1999.
[22] D. Metzler, Y. Bernstein, W. Croft, A. Moffat, and J. Zobel,
Similarity Measures for Tracking Information Flow, in: Proceedings of ACM International Conference on Information and
Knowledge Management (CIKM), 2005, pp. 517–524.
[23] K. Monostori, A. Zaslavsky, and H. Schmidt, Document Overlap Detection System for Distributed Digital Libraries, in: Proceedings of the 5th ACM Conference on Digital Libraries,
2000, pp. 226–227.
[24] S. Niezgoda and T. Way, SNITCH: A Software Tool for
Detecting Cut and Paste Plagiarism, in: Proceedings of the
SIGCSE Technical Symposium on Computer Science Education
(SIGCSE), 2006, pp. 51–55.
[25] M.S. Pera and Y.-K. Ng, Utilizing Phrase-Similarity Measures for Detecting and Clustering Informative RSS News Arti-
15
cles, Journal of Integrated Computer-Aided Engineering (ICAE)
15(4) (2008), pp. 331–350, IOS Press.
[26] M.F. Porter, An Algorithm for Suffix Stripping, Program 14(3)
(1980), pp. 130–137.
[27] M. Potthast, A. Eiselt, B. Stein, A. Barron, and P. Rosso.
Plagiarism Corpus PAN-PC-09, Webis at Bauhaus-Universitat
Weimar and NLEL at Universidad Polytecnica de Valencia,
2009, available at www.webis.de/research/corpora.
[28] J. Quinlan, Induction of Decision Trees, Machine Learning
1(1) (1986), pp. 81–106.
[29] A. Sediyono and K. Ku-Mahamud. Algorithm of the Longest
Commonly Consecutive Word for Plagiarism Detection in Text
Based Document, in: Proceedings of the International Conference on Digital Information Management (ICDIM), 2008, pp.
253-259.
[30] Z. Su, B. Ahn, K. Eom, M. Kang, J. Kim, and M. Kim. Plagiarism Detection Using the Levenshtein Distance and SmithWaterman Algorithm, in: Proceedings of the International Conference on Innovative Computing, Information, and Control
(ICICIC), 2008, pp. 569-572.
[31] T. Tashiro, T. Ueda, T. Hori, Y. Hirate, and H. Yamana, EPCI:
Extracting Potentially Copyright Infringement Texts from the
Web, in: Proceedings of the World Wide Web (WWW), 2007,
pp. 1151–1152.
[32] O. Uzuner, R. Davis, and B. Katz, Using Empirical Methods
for Evaluating Expression and Content Similarity, in: Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS) 2004.
[33] O. Uzuner, B. Katz, and T. Nahnsen, Using Syntactic Information to Identify Plagiarism, in: Proceedings of the ACL Workshop on Educational Applications, 2005, pp. 37–44.
[34] D. White and M. Joy, Sentence-based Natural Language Plagiarism Detection, ACM Journal on Educational Resources in
Computing (JERIC) 4(4) (2004), pp. 1–20.
[35] M. Zini, M. Fabbri, M. Moneglia, and A. Panunzi, Plagiarism Detection through Multilevel Text Comparison, in: Proceedings of the International Conference on Automated Solutions for Cross Media Content and Multi-Channel Distribution
(AXMEDIS), 2006, pp. 181–185.
[36] S. zu Eissen, B. Stein, and M. Kulig, Plagiarism Corpus
Webis-PC-08, (2008), Web Technology and Information Systems Group Bauhaus University Weimar.