Saugata Bose 12204 M.Sc-II Natural Language Processing: Plagiarism Detection
Jul 17, 2015
Scope, Objectives, Significance
Propose a Framework
Investigate the role of machine learning in the proposed
framework.
Significance of the Project
Degradation of
Education Quality
Scope External Plagiarism
Plagiarism
Natural Language Processing
“The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft."
computer science
+ artificial intelligence
+linguistics Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”
on 2000
Shallow
Deep
Direct copy or paraphrase of n grams
Can be of various length
Information Retrieval
Word Segmentation
Sentence Breaking
Word Sense Disambiguation
Works Influence us… …
SCAM(Shivakumar and Garcia-Molina (1995, 1996))
The more complex the metrics are, the more processing power is required.(Lancaster and Culwin (2003)
PRAISE(Culwin and Lancaster (2001)
N gram overlap Method(Roman Tesar, Massimo Poesio, Vaclav Strnad, and Karel
Jezek)
Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris
Katz)
Plagiarism Pattern Checker(Nam Oh Kang, Alexander Gelbukh, and Sang Yong
Han)
Use of VSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)
Limitations!!!!
Our Initiatives… …
Frequency Comparison
Approach
N gram Similarity
Measure along with
Jaccard Index
Shallow NLP
Experimental Setup
Corpus of Plagiarised Short Answers
-------Clough & Stevenson (2009)
Original source documents : 5
Plagiarised documents : 57
----------Near copy : 19
----------Light revision : 19
----------Heavy revision :19
----------Non-plagiarised documents : 38
Experimental Setup(cont…)
Text Pre-
processing &
NLP
Techniques
Comparison
Methodologies
Machine Learning
Accuracy Score
Feature
Selection Machine Learning
Construction of a
Train Model
Plagiarism Detection
Suspicious Documents
Original Documents
Machine Learning
Accuracy
Corpus
Test Model
Experimental Setup(cont…)
Text pre-processing & NLP techniques:
Lower Case
Without Stop
Word
Stop Word
Punctuation
No Punctuation No Punctuation Punctuation
Stemming
No Stemming
Lemmatizing
No Lemmatizing
Stemming
No Stemming
Stemming
No Stemming
Stemming
No Stemming
Lemmatizing
No Lemmatizing
Lemmatizing
No Lemmatizing
Lemmatizing
No Lemmatizing
Sentence Segmentation
Tokenization
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them.”]
[ To die, to sleep no more – and by a sleep to say we
end the heartache and the thousand natural shocks
that flesh is heir to – ‘tis a consummation devoutly to
be wished.]
“To be or not to be– that is the question:”
[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]
[question] [:]
“To be or not to be– that is the question:”
to be or not to be– that is the question
“To be or not to be– that is the question:”
be or not be - question:
“Hello Dear, how are You?
Hello Dear how are you
Produced Produce
Produced/ Product/ Produce Produc
Computational Comput
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them. To die, to
sleep no more – and by a sleep to say we end the
heartache and the thousand natural shocks that flesh
is heir to – ‘tis a consummation devoutly to be
wished.]
Experimental Setup(cont…)
Comparison Methodologies
Machine learning algorithm:
N gram Frequency based similarity measure
N gram Similarity measure using Jaccard Index
J48 Classifier, Naïve Bais Classifier
N gram Similarity Measure
1 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
1 gram representation
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]
[talking] [with] [her] [friend]]
[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%
[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]
[with] [her] [friend]]
[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30% No SP and P
With SP and P
N gram Similarity measure using
Jaccard Index 2 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
2 gram representation
Similarity Index
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The girl],[girl is],[is standing],[standing outside],[outside
of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with
her],[her friend],[friend ]]
[[The boy],[boy is],[is talking],[talking with],[with his],[his
friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis
]]
2/20= 10%
Experiment and Findings-1
Generating Decision Tree
95 instances 121 attributes
Selecting Features
Build train model
95 instances 27 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
Experiment and Findings-2
Generating Decision Tree
95 instances 121 attributes
Use Filter Metrics
Build train model
95 instances 26 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise