Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics University of Wolverhampton, UK. *[email protected]23 rd June 2010 4 th International Plagiarism Conference Northumbria University, Newcastle upon Tyne, UK. 1
38
Embed
Using Natural Language Processing for Automatic Plagiarism Detection Miranda Chong*, Lucia Specia, Ruslan Mitkov Research Group in Computational Linguistics.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Using Natural Language Processing for Automatic Plagiarism Detection
Miranda Chong*, Lucia Specia, Ruslan MitkovResearch Group in Computational Linguistics
• What is plagiarism?• What is plagiarism detection?• As humans it is easy to judge “similar”
passages.• But can computers perform this judgement?
3
Challenges
• Existing methodologies: Limitations
4
Lexical changes: synonymy, related concepts
Structural changes: active/passive voice, word order, joining/splitting sentences
Textual Entailment: sentence paraphrase & other semantic variations
Multi-source Plagiarism
Multi-lingual Plagiarism
5
Vector space model (or term vector model) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
The vector space model has the following limitations: 1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) 2. Search keywords must precisely match document terms; word substrings might result in a "false positive match"
• Scope of research:Tackle genuine plagiarism cases
7
External plagiarism Monolingual (English) Free text Document Level
NLP Explained
[ Natural Language Processing ]Computer system to analyse Written/ Spoken Human Speech
8
Linguistics
Computer Science
Mathematics
NLP
Machine Translation
…etc
Information ExtractionDocument
Summarisation
Question Answering
ArtificialIntelligence
Experimental Setup (1)
• Corpus of Plagiarised Short Answers– Clough & Stevenson (2009)
• Original source documents (wiki articles) : 5• Plagiarised documents : 57– Near copy : 19– Light revision : 19– Heavy revision :19
• Non-plagiarised documents : 38
9
Experimental Setup (1 cont.)• 4 levels of suspicious plagiarised documents– Near copy (copy & paste without changes)– Light revision (minor alteration)– Heavy revision (rewriting and restructuring)– Non-plagiarised (original text not given)
• Alternatively, 2 levels of classification– Plagiarised (Near copy + Light revision + Heavy
revision)– Non-plagiarisedNote: The 2 level classification was not used in the paper. Please
see poster presentation for a comparison. 10
Experimental Setup (2)
• System architecture pipelineSuspicious Documents
Original Documents
Text Pre-processing &
NLP Techniques
Comparison Methodologies
Machine Learning
AlgorithmAccuracy
Score
Corpus Raw Text
Processed Text
Features Sets
Classifier
Experimental Setup (3)• Text pre-processing & NLP techniques:
• Syntactic processing techniques:
12
Baseline
Sentence segmentation
Tokenisation
Lowercase
Part-of-speech tagging
Stop-word removal
Punctuation removal
Number replacement
Lemmatisation
Stemming
Dependency parsing Chunking
Experimental Setup (4)• Comparison methodologies:
• Comparative baseline: Ferret Plagiarism Detector (Lyon et al., 2000)
• Machine learning algorithm:
13
Trigram similarity measures
Language model probability measure
Longest common subsequence
Dependency relations matching
Naïve Bayes Classifier
Sentence segmentation
• Determine sentence boundaries• Split text in document into sentences• Allow sentence level matching
[ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them.”] [“To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.” ]
• Assign grammatical tags to each word• Analyse sequence of tags on syntactic level
“To be or not to be– that is the question:”↓
TO VB CC RB TO VB : WDT VBZ DT NN :
17
Stop-word removal
• Remove irrelevant words• Keep content words (verbs, adverbs, nouns,
adjectives)
“To be or not to be– that is the question:”↓
be or not be - question:
18
Punctuation removal
• Remove punctuation
“To be or not to be– that is the question:”↓
To be or not to be that is the question
19
Number replacement
• Replace numbers with dummy symbol• Generalise words
“63.75 percent of all statistics are made up, including this one.”
↓ [NUM] percent of all statistics are made up, including this one.
20
Lemmatisation
• Transform words into their dictionary base forms
• Allow matching of similar words
Produced Produce
21
Stemming
• Transform words into their base forms
Produced/ Product/ Produce Produc
Computational Comput
22
Dependency parsing
• Syntactic analysis of sentences• Stanford parser• Allow matching for related pairs of words at constituent
level
“To be or not to be– that is the question:”aux(be-2, To-1) cc(be-2, or-3) neg(be-6, not-4) aux(be-6, to-5) conj(be-2, be-6) nsubj(question-11, that-8) cop(question-11, is-9) det(question-11, the-10) parataxis(be-2, question-11)
23
Chunking• Shallow parsing generates parse tree• Keep only the identifiers and structure
• Future plans:– Passage level– Integrate Wordnet with current framework– Perform experiments on other corpora (METER,
PAN)– Address multi-lingual plagiarism detection
35
Parse tree dependency relations0.67
Summary
• Plagiarism detection methodologies can be improved using NLP
• These tools can identify possible plagiarised cases
• Human intervention will always be required to judge plagiarised cases
36
THE ENDReferences• Clough, P., & Stevenson, M. (2009). Developing a corpus of plagiarised short answers.
Language Resources and Evaluation, LRE 2010.• Ferret (2009). University of Hertfordshire. [Accessed: 21/3/2010] Available at:
http://homepages.feis.herts.ac.uk/~pdgroup/• Gumm, H. P. (2010). Plagiarism or “naturally given” ? Decide for yourself …. Philipps-
Universität Marburg. [Accessed: 17/5/2010] Available at: < http://www.mathematik.uni-marburg.de/~gumm/Plagiarism/index.htm>
• Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. (2009). The WEKA Data Mining Software: An Update. In ACM Special Interest Group on Knowledge Discovery and Data Mining, SIGKDD Explorations, (11)1. (pp.10-18).
• iParadigms (2010). Turnitin [Accessed: 11/5/2010] Available at: <http://turnitin.com/>• Lyon, C., Barrett, R., & Malcolm, J. (2001). Experiments in Electronic Plagiarism Detection.
[Accessed: 21/3/2010] Available at: <homepages.feis.herts.ac.uk.>• Stolcke, A. (2002). SRILM- An extensible language modelling toolkit. In Proceedings of
the Seventh International Conference on Spoken Language Processing, 3, (pp. 901-904).• ZEIT Online. (2010). Abrechnung im Netz. [Accessed: 17/5/2010] Available at: