Top Banner
Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics University of Wolverhampton, UK. *[email protected] 23 rd June 2010 4 th International Plagiarism Conference Northumbria University, Newcastle upon Tyne, UK. 1
38

Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Dec 26, 2015

Download

Documents

Jonah Potter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Using Natural Language Processing for Automatic Plagiarism Detection

Miranda Chong*, Lucia Specia, Ruslan MitkovResearch Group in Computational Linguistics

University of Wolverhampton, UK.*[email protected]

23rd June 20104th International Plagiarism ConferenceNorthumbria University, Newcastle upon Tyne, UK. 1

Page 2: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Overview• Introduction• Challenges• Aims• NLP Explained• Experimental Setup• Findings• Discussion• Further Developments• Summary

2

Page 3: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Introduction

• What is plagiarism?• What is plagiarism detection?• As humans it is easy to judge “similar”

passages.• But can computers perform this judgement?

3

Page 4: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Challenges

• Existing methodologies: Limitations

4

Lexical changes: synonymy, related concepts

Structural changes: active/passive voice, word order, joining/splitting sentences

Textual Entailment: sentence paraphrase & other semantic variations

Multi-source Plagiarism

Multi-lingual Plagiarism

Page 5: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

5

Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

The vector space model has the following limitations: 1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) 2. Search keywords must precisely match document terms; word substrings might result in a "false positive match"

Page 6: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

6

Page 7: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Aims

• Current research focus:

• Proposed framework: Existing approaches + NLP = Improve accuracy

• Scope of research:Tackle genuine plagiarism cases

7

External plagiarism Monolingual (English) Free text Document Level

Page 8: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

NLP Explained

[ Natural Language Processing ]Computer system to analyse Written/ Spoken Human Speech

8

Linguistics

Computer Science

Mathematics

NLP

Machine Translation

…etc

Information ExtractionDocument

Summarisation

Question Answering

ArtificialIntelligence

Page 9: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Experimental Setup (1)

• Corpus of Plagiarised Short Answers– Clough & Stevenson (2009)

• Original source documents (wiki articles) : 5• Plagiarised documents : 57– Near copy : 19– Light revision : 19– Heavy revision :19

• Non-plagiarised documents : 38

9

Page 10: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Experimental Setup (1 cont.)• 4 levels of suspicious plagiarised documents– Near copy (copy & paste without changes)– Light revision (minor alteration)– Heavy revision (rewriting and restructuring)– Non-plagiarised (original text not given)

• Alternatively, 2 levels of classification– Plagiarised (Near copy + Light revision + Heavy

revision)– Non-plagiarisedNote: The 2 level classification was not used in the paper. Please

see poster presentation for a comparison. 10

Page 11: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Experimental Setup (2)

• System architecture pipelineSuspicious Documents

Original Documents

Text Pre-processing &

NLP Techniques

Comparison Methodologies

Machine Learning

AlgorithmAccuracy

Score

Corpus Raw Text

Processed Text

Features Sets

Classifier

Page 12: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Experimental Setup (3)• Text pre-processing & NLP techniques:

• Syntactic processing techniques:

12

Baseline

Sentence segmentation

Tokenisation

Lowercase

Part-of-speech tagging

Stop-word removal

Punctuation removal

Number replacement

Lemmatisation

Stemming

Dependency parsing Chunking

Page 13: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Experimental Setup (4)• Comparison methodologies:

• Comparative baseline: Ferret Plagiarism Detector (Lyon et al., 2000)

• Machine learning algorithm:

13

Trigram similarity measures

Language model probability measure

Longest common subsequence

Dependency relations matching

Naïve Bayes Classifier

Page 14: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Sentence segmentation

• Determine sentence boundaries• Split text in document into sentences• Allow sentence level matching

[ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them.”] [“To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.” ]

- Quote from William Shakespeare's Hamlet

14

Page 15: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Tokenisation

• Determine words, punctuation symbols boundaries

• Isolate punctuations from words

“To be or not to be– that is the question:”↓

[To] [be] [or] [not] [to] [be] [–] [that] [is] [the] [question] [:]

15

Page 16: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Lowercase

• Substitute uppercase characters with lowercase

• Generalise word matching

“To be or not to be– that is the question:”↓

to be or not to be– that is the question:

16

Page 17: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Part-of-speech tagging

• Assign grammatical tags to each word• Analyse sequence of tags on syntactic level

“To be or not to be– that is the question:”↓

TO VB CC RB TO VB : WDT VBZ DT NN :

17

Page 18: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Stop-word removal

• Remove irrelevant words• Keep content words (verbs, adverbs, nouns,

adjectives)

“To be or not to be– that is the question:”↓

be or not be - question:

18

Page 19: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Punctuation removal

• Remove punctuation

“To be or not to be– that is the question:”↓

To be or not to be that is the question

19

Page 20: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Number replacement

• Replace numbers with dummy symbol• Generalise words

“63.75 percent of all statistics are made up, including this one.”

↓ [NUM] percent of all statistics are made up, including this one.

20

Page 21: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Lemmatisation

• Transform words into their dictionary base forms

• Allow matching of similar words

Produced Produce

21

Page 22: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Stemming

• Transform words into their base forms

Produced/ Product/ Produce Produc

Computational Comput

22

Page 23: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Dependency parsing

• Syntactic analysis of sentences• Stanford parser• Allow matching for related pairs of words at constituent

level

“To be or not to be– that is the question:”aux(be-2, To-1) cc(be-2, or-3) neg(be-6, not-4) aux(be-6, to-5) conj(be-2, be-6) nsubj(question-11, that-8) cop(question-11, is-9) det(question-11, the-10) parataxis(be-2, question-11)

23

Page 24: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Chunking• Shallow parsing generates parse tree• Keep only the identifiers and structure

“To be or not to be”(S

(VP (TO To) (VP (VB be))))

(CC or) (PP (RB not) (IN to)

(VP (VB be))))

↓ VP VP CC PP VP 24

Page 25: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Trigram similarity measures“To be or not to be”

{“To”, “be”, “or”} {“be”, “or”, “not”} {“or”, “not”, “to”} {“not”, “to”, “be”}

• Jaccard similarity coefficient - Ferret Plagiarism Detector

Matching Trigrams in suspicious & original Documents ÷

All Trigrams in suspicious & original Documents

• Containment measure – Clough & StevensonMatching Trigrams in suspicious & original Documents ÷

All Trigrams in Suspicious Documents25

Page 26: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Longest common subsequence

• Calculates the longest sequence of matching words between sentences

Sentence 1: to be or not to be– that is the question.Sentence 2: should we trust our new PM? that is the question for

many voters.

LCS = “that”, “is”, “the”, “question” = 4

26

Page 27: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Language model probability measure

• N-grams statistical model • SRILM – language modelling toolkit (Stolcke,

2002)• Calculates level of similarity between

document pairs• Combining probabilities of n-gram overlaps– Unigrams, Bigrams, Trigrams (tokenised corpus)– 4-grams & 5-grams (chunked corpus)

27

Tokenisation

Chunking

Page 28: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Dependency relations matching

• Count number of matching parsed data between documents

• Dependency = Overlapping relations ÷ Number of relations in Suspicious doc = 2 ÷ 4 = 0.5

28

Suspicious doc

nsubj(question, that) cop(question, is) det(question, the) parataxis(be, question)

Original doc

aux(be, to) cc(be, or) neg(be, not) aux(be, to) conj(be, be) nsubj(question, that) cop(question, is)

Page 29: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Machine learning algorithm

• WEKA – machine learning toolkit (Hall et al., 2009)

• Use feature scores for training• Naïve Bayes classifier to learn a model• The model classify documents according to

their level of plagiarism

29

Near-Copy Light Revision

Heavy Revision Non-Plagiarism

What does a classifier do?

Page 30: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Findings (1)

Comparison results of feature sets

30

Pre-processing Techniques

Comparison Methodology

Feature Sets

1. Trigram containment measure: baseline dataset

2. Ferret: baseline dataset3. Ferret: baseline + lemmatisation4. Ferret: baseline + stop-word removal

+ punctuations removal + number replacement

5. Language model: Bigram perplexity6. Language model: Trigram perplexity

7. Longest common subsequence

8. Dependency relations matching

Page 31: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

31

Trigram Con-tainment Mea-

sure

Ferret: Base-line

Ferret: Base-line + Lemma-

tisation

Ferret: Base-line + Stop-

words removal + Punctuation

removal + Number re-placement

Language Model - Bi-

gram Perplex-ity

Language Model - Tri-

gram Perplex-ity

Longest Com-mon Subse-

quence

Parse Tree Dependency

Relations

Correla-tion

0.660000000000002

0.57 0.57 0.55 0.600000000000001

0.600000000000001

0.26 0.670000000000002

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

Correlation coefficient scores of best features

Page 32: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Findings (2)

• Naïve Bayes classifier 10-fold cross-validation

32

• Trigram Containment Measure: Baseline• Ferret: Baseline + Lemmatisation• Ferret: Baseline + Stop-words removal +

Punctuation removal + Number replacement

• Language model: Bigram perplexity • Language model: Trigram perplexity• Longest Common Subsequence • Dependency Relations Matching

Best Features Set 70% accurate

Ferret Baseline

66% accurate

• 41 features in total All features

60% accurate

Page 33: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Discussion (1)

• NLP enhances existing approaches• Effective : distinguish between plagiarised &

non-plagiarised documents

• Deep NLP Techniques (Parsing) + Machine Learning = Promising Framework

33

Accuracy of best features set on two levels (plag/non-plag): 94.74%

Page 34: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Discussion (2)

• Final human judgement needed to establish cases

• Potential educational purposes– Identify suspicious cases for further investigation– Pre-emptive tool to detect incorrectly referenced

materials

34

Page 35: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Further Developments

• Identify paraphrased texts– Wordnet : correlation 0.72

• Future plans:– Passage level– Integrate Wordnet with current framework– Perform experiments on other corpora (METER,

PAN)– Address multi-lingual plagiarism detection

35

Parse tree dependency relations0.67

Page 36: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Summary

• Plagiarism detection methodologies can be improved using NLP

• These tools can identify possible plagiarised cases

• Human intervention will always be required to judge plagiarised cases

36

Page 37: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

THE ENDReferences• Clough, P., & Stevenson, M. (2009). Developing a corpus of plagiarised short answers.

Language Resources and Evaluation, LRE 2010.• Ferret (2009). University of Hertfordshire. [Accessed: 21/3/2010] Available at:

http://homepages.feis.herts.ac.uk/~pdgroup/• Gumm, H. P. (2010). Plagiarism or “naturally given” ? Decide for yourself …. Philipps-

Universität Marburg. [Accessed: 17/5/2010] Available at: < http://www.mathematik.uni-marburg.de/~gumm/Plagiarism/index.htm>

• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. (2009). The WEKA Data Mining Software: An Update. In ACM Special Interest Group on Knowledge Discovery and Data Mining, SIGKDD Explorations, (11)1. (pp.10-18).

• iParadigms (2010). Turnitin [Accessed: 11/5/2010] Available at: <http://turnitin.com/>• Lyon, C., Barrett, R., & Malcolm, J. (2001). Experiments in Electronic Plagiarism Detection.

[Accessed: 21/3/2010] Available at: <homepages.feis.herts.ac.uk.>• Stolcke, A. (2002). SRILM- An extensible language modelling toolkit. In Proceedings of

the Seventh International Conference on Spoken Language Processing, 3, (pp. 901-904).• ZEIT Online. (2010). Abrechnung im Netz. [Accessed: 17/5/2010] Available at:

<http://www.zeit.de/studium/hochschule/2010-05/mathematik-plagiate> 37

Page 38: Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.

Trigram Containment Measure

Ferret: Baseline

Ferret: Baseline + Lemma

Ferret: Baseline + Stopword + Punctuations+ Number

Language Model – Bigram

Language Model – Trigram

Longest Common Subsequence

Parse Tree Dependency Relations

Plagiarism Level

0.008163265 0.005894 0.005917 0.003378 0.860969 0.822155 0.045104 0.02551non-plag

0.698689956 0.381503 0.381503 0.377926 0.066008 0.067699 0.883869 0.759259copy

0.56 0.429487 0.428115 0.325243 0.166063 0.130039 0.390746 0.664894light

0.123152709 0.065463 0.065611 0.045283 0.797519 0.735923 0.128136 0.202454heavy

0.019323671 0.008837 0.00885 0.002457 0.458233 0.439833 0.685169 0.031646non-plag

0.006134969 0.007326 0.007299 0.003067 0.624461 0.595563 0.166293 0.008368non-plag

0.024305556 0.015131 0.01511 0 0.993904 0.946112 0.145788 0.060185non-plag

38

0.191111111 0.163435 0.172702 0.108108 0.197605 0.135869 0.303056 0.203593copy

0.012195122 0 0 0 0.845157 0.800283 0 0.016304non-plag