Top Banner
Saugata Bose 12204 M.Sc-II Natural Language Processing: Plagiarism Detection
14
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural language processing

Saugata Bose

12204

M.Sc-II

Natural Language Processing:

Plagiarism Detection

Page 2: Natural language processing

Scope, Objectives, Significance

Propose a Framework

Investigate the role of machine learning in the proposed

framework.

Significance of the Project

Degradation of

Education Quality

Scope External Plagiarism

Page 3: Natural language processing

Plagiarism

Natural Language Processing

“The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft."

computer science

+ artificial intelligence

+linguistics Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”

on 2000

Shallow

Deep

Direct copy or paraphrase of n grams

Can be of various length

Information Retrieval

Word Segmentation

Sentence Breaking

Word Sense Disambiguation

Page 4: Natural language processing

Works Influence us… …

SCAM(Shivakumar and Garcia-Molina (1995, 1996))

The more complex the metrics are, the more processing power is required.(Lancaster and Culwin (2003)

PRAISE(Culwin and Lancaster (2001)

N gram overlap Method(Roman Tesar, Massimo Poesio, Vaclav Strnad, and Karel

Jezek)

Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris

Katz)

Plagiarism Pattern Checker(Nam Oh Kang, Alexander Gelbukh, and Sang Yong

Han)

Use of VSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)

Limitations!!!!

Page 5: Natural language processing

Our Initiatives… …

Frequency Comparison

Approach

N gram Similarity

Measure along with

Jaccard Index

Shallow NLP

Page 6: Natural language processing

Experimental Setup

Corpus of Plagiarised Short Answers

-------Clough & Stevenson (2009)

Original source documents : 5

Plagiarised documents : 57

----------Near copy : 19

----------Light revision : 19

----------Heavy revision :19

----------Non-plagiarised documents : 38

Page 7: Natural language processing

Experimental Setup(cont…)

Text Pre-

processing &

NLP

Techniques

Comparison

Methodologies

Machine Learning

Accuracy Score

Feature

Selection Machine Learning

Construction of a

Train Model

Plagiarism Detection

Suspicious Documents

Original Documents

Machine Learning

Accuracy

Corpus

Test Model

Page 8: Natural language processing

Experimental Setup(cont…)

Text pre-processing & NLP techniques:

Lower Case

Without Stop

Word

Stop Word

Punctuation

No Punctuation No Punctuation Punctuation

Stemming

No Stemming

Lemmatizing

No Lemmatizing

Stemming

No Stemming

Stemming

No Stemming

Stemming

No Stemming

Lemmatizing

No Lemmatizing

Lemmatizing

No Lemmatizing

Lemmatizing

No Lemmatizing

Sentence Segmentation

Tokenization

[ “To be or not to be– that is the question: whether

'tis nobler in the mind to suffer the slings and arrows

of outrageous fortune, or to take arms against a sea

of troubles and, by opposing, end them.”]

[ To die, to sleep no more – and by a sleep to say we

end the heartache and the thousand natural shocks

that flesh is heir to – ‘tis a consummation devoutly to

be wished.]

“To be or not to be– that is the question:”

[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]

[question] [:]

“To be or not to be– that is the question:”

to be or not to be– that is the question

“To be or not to be– that is the question:”

be or not be - question:

“Hello Dear, how are You?

Hello Dear how are you

Produced Produce

Produced/ Product/ Produce Produc

Computational Comput

[ “To be or not to be– that is the question: whether

'tis nobler in the mind to suffer the slings and arrows

of outrageous fortune, or to take arms against a sea

of troubles and, by opposing, end them. To die, to

sleep no more – and by a sleep to say we end the

heartache and the thousand natural shocks that flesh

is heir to – ‘tis a consummation devoutly to be

wished.]

Page 9: Natural language processing

Experimental Setup(cont…)

Comparison Methodologies

Machine learning algorithm:

N gram Frequency based similarity measure

N gram Similarity measure using Jaccard Index

J48 Classifier, Naïve Bais Classifier

Page 10: Natural language processing

N gram Similarity Measure

1 gram similarity measure (Pre- processing +NLP+ Comparison)

Original Document

Suspicious Document

1 gram representation

The girl is standing outside of PUCSD and talking with her

friend

The boy is talking with his friend outside of Symbiosis

[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]

[talking] [with] [her] [friend]]

[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]

[Symbiosis ]]

7/10= 70%

3/10= 30%

[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]

[with] [her] [friend]]

[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]

[Symbiosis ]]

7/10= 70%

3/10= 30% No SP and P

With SP and P

Page 11: Natural language processing

N gram Similarity measure using

Jaccard Index 2 gram similarity measure (Pre- processing +NLP+ Comparison)

Original Document

Suspicious Document

2 gram representation

Similarity Index

The girl is standing outside of PUCSD and talking with her

friend

The boy is talking with his friend outside of Symbiosis

[[The girl],[girl is],[is standing],[standing outside],[outside

of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with

her],[her friend],[friend ]]

[[The boy],[boy is],[is talking],[talking with],[with his],[his

friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis

]]

2/20= 10%

Page 12: Natural language processing

Experiment and Findings-1

Generating Decision Tree

95 instances 121 attributes

Selecting Features

Build train model

95 instances 27 attributes

Accuracy: 94.6809 % on J48

Accuracy: 65.9574 % % on Naïve Baise

Accuracy: 71.2766 % on Naïve Baise

Accuracy: 93.617 % on J48

Accuracy: 89.0052 % on J48

Accuracy: 86.3874 % on NaiveBaise

Page 13: Natural language processing

Experiment and Findings-2

Generating Decision Tree

95 instances 121 attributes

Use Filter Metrics

Build train model

95 instances 26 attributes

Accuracy: 94.6809 % on J48

Accuracy: 65.9574 % % on Naïve Baise

Accuracy: 71.2766 % on Naïve Baise

Accuracy: 93.617 % on J48

Accuracy: 89.0052 % on J48

Accuracy: 86.3874 % on NaiveBaise

Page 14: Natural language processing

Future Improvements

Integrate Wordnet with current framework

Address Paraphrasing

Address multi-lingual plagiarism detection