Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code
Mike Joy
25 February 2010
New Approaches for Detecting Similarities in Program Code
Overview of Talk
1) What is the Problem?2) Historical Overview3) New Approaches4) Where Next?
Part 1 – What is the Problem?
Document similarity– What do we mean?– Why is software an issue?– Why is this interesting?
Four stages
Collection
Detection
Confirmation
Investigation
From Culwin and Lancaster (2002).
Stage 1: Collection
• Get all documents together online– so they can be processed– formats?– security?
• BOSS (Warwick)
• Coursemaster (Nottingham)
• Managed Learning Environment
Stage 2: Detection
• Compare with other submissions
• Compare with external documents– essay-based assignments
• We’ll come back to this later– it’s the interesting bit!
Stage 3: Confirmation
• Software tool says “A and B similar”
• Are they?
• Never rely on a computer program!
• Requires expert human judgement
• Evidence must be compelling
• Might go to court
Stage 4: Investigation
• A from B, or B from A, or joint work?
• If A from B, did B know?– open networked file– printer output
• Did the culprit/s understand?
• University processes must be followed
Why is this Interesting?
How do you compare two programs?– This is an algorithm question– Stages 2 and 3: detection and confirmation
How do you use the results (of a comparison) to educate students?– This is a pedagogic question– Stage 4, and before stage 1!
Digression: Essays
Plagiarism in essays is easier to detect
Lots of “tricks” a lecturer can use!– Google search on phrases– Abnormal style– ... etc.
Software tools– Let's have a look ...
Pedagogy
Can be used by academics to– detect plagiarism– provide evidence
Can be used by students to– check their own work
Part 2 – Historical Overview
How has similar code been detected in the past?
How well do the approaches work?
Why not use Turnitin?
• It won’t work!• String matching algorithm inappropriate• Database does not contain code
• Commercial involvement– E.g. Black Duck Software
/* Program 1 */
public class Hello {
public static void main(String[] argv) {
System.out.println(“Hello World”)
}
}
/* Program 2 */
public class HelloWorld {
public static void main(String[] x) {
System.out.println(“hello world!”)
}
}
Is This Plagiarism?
• Is Program 2 derived from Program 1 in a manner which is “plagiarism”?
• Probably No– It's too simple– Too many copies in books / on the web– Most of it is generic syntax
Program 3
(Source code for MS Windows 7)
Program 4
(code 98% identical to the source code for MS Windows 7)
Is This Plagiarism?
• Is Program 4 derived from Program 3 in a manner which is “plagiarism”?
• Definitely Yes– It's too complicated to happen by chance
• Millions of lines of code
– The source is “closed”• Microsoft guard it very well!
/* Program 5 */
public class Sun {
static final double latitude=52.4;
static final double longitude=-1.5;
static final double tpi = 2.0*pi;
/* ... */
public static void main(String[] args) { calculate(); }
public static double FNrange(double x) {
double b = x / tpi;
double a = tpi * (b - (long)(b));
if (a < 0) a = tpi + a; return a;
};
public static void calculate() { /* ... */ }
/* ... */
/* Program 6 */
public class SunsetCalculator {
static float latitude=52.4;
static float longitude=-1.5;
/* ... */
public static void main(String[] args) { findSunsetTime(); }
public static double rangeCalc(float arg) {
float x = arg / tpi;
float y = 2*3.14159 * (x - (int)(x));
if (y < 0) y = 2*3.14159 + y; return y;
};
public static void findSunsetTime() { /* ... */ }
/* ... */
Is This Plagiarism?
• Is Program 6 derived from Program 5 in a manner which is “plagiarism”?
• Maybe– Structure is similar – cosmetic changes– But the algorithm is public domain– Maybe 6 derived from 5, maybe the other
way round
History ...
• First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976)
• More recent systems compare the structure of source-code programs
• Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.
Detection Tools (1)
Attribute counting systems (Halstead, 1972):
• Numbers of unique operators• Numbers of unique operands• Total numbers of operator occurrences• Total numbers of operand occurrences
Detection Tools (2)
Structure-based systems:
– Each program is converted into token strings (or something similar)
– Token streams are compared for determining similar source-code fragments
– Tools: JPlag, MOSS, and Sherlock
Example (code 1)
int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {
ans *= j;}
return ans;}
Example (code 2)
Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)
result *= f; return result;}
Example (tokenised)
type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end
return nameend
Detectors
MOSS (Berkeley/Stanford, USA)
JPlag (Karlsruhe, Germany)– Java only– Programs must compile?
Sherlock (Warwick, UK)
MOSS and JPlag are Internet resources– Data Protection?
MOSS
Developed by Alex Aiken in 1994
MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs.
MOSS is free, but you must create an account
MOSS home page:http://theory.stanford.edu/~aiken/moss/
MOSS – Algorithm
“Winnowing” (Schleimer et al., 2003)– Local document fingerprinting algorithm– Efficiency proven (33% of lower bound)– Guarantees detection of matches longer
than a certain threshold
Using MOSS
• Moss is being provided as an Internet service• User must download MOSS Perl script for
submitting files to the MOSS server• The script uses a direct network connection• The MOSS server produces HTML pages listing
pairs of programs with similar code• MOSS highlights similar code-fragments within
programs that appear the same• Data Protection? – US service• Maintenance?
JPlag
• Developed by Guido Malpohl in 1996• JPlag currently supports Java, C#, C, C++, Scheme,
and natural language text• Use of JPlag is free, but user must create an account• JPlag can be used to compare student assignments
but does not compare with code on the Internet• JPlag home page: www.ipd.uni-karlsruhe.de/jplag
JPlag – Algorithm
1) Parse (or scan) programs
2) Convert programs to tokens
3) Pairwise compare • “Greedy String Tiling”
- maximises percentage of common token strings- worst case θ(n3), average case linear
Prechelt et al. (2002)
JPlag File Processing
JPlag - Results
• Results in HTML Format
• Histogram of similarity values found for all pairs of programs
• Similar pairs and their similarity values displayed
• Select file pairs to view
JPlag - Matches
• Similar lines matched with the same colour
• Code fragment similarity values based on similar tokens found
Sherlock
• Developed at the University of Warwick Department of Computer Science
• Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced
• Sherlock detects plagiarism on source-code and natural language assignments
• BOSS home page: www.boss.org.uk
Sherlock - Preprocessing
WhitespaceCommentsNormalisationTokenisation
Sherlock – Results
• Results displayed• Similarity values of
suspicious files• Similarity values
depend on the length of similar lines found as a percentage of the whole file size
• Select suspicious matches to examine
• Mark suspicious files
Sherlock – Matches
Suspected sections marked with
**begin suspicious section**
and
**end suspicious section**
Sherlock – Document Set
• User can view graph
• Each node represents one submission
• An edge means two submissions
• Options to select threshold
• Click on lines to view or to mark suspicious matches
CodeMatch
Commercial productFree academic use for small data setsExact algorithm not published
• patent pending?
Example of Identical “Instruction Sequences”
/* File 1*/
for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; }
/* File 2*/
for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }
CodeMatch – Algorithm
1) Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions
2) Extract comments, and compare3) Extract identifiers, and count similar;
x, xxx, xx12345 are “similar”4) Combine (1), (2) and (3) to give
correlation score
Heuristics
Comments– Spelling mistakes– Unusual English (Thai, German, …)
Use of Search EnginesUnusual styleCode errors
Tool Efficiency
• MOSS, JPlag and Sherlock are effective• Results returned are similar• Results returned are not identical• User interface issues may be important
Part 3 – New Approaches
Eschew the “syntax driven” approach
Lateral thinking?
Case study: Latent Semantic Analysis
Digression: Similarity
What do we actually mean by “similar”?
This is where the problems start ...
(1) Staff Survey
We carried out a survey in order to:– gather the perceptions of academics on what
constitutes source-code plagiarism, and– create a structured description of what constitutes
source-code plagiarism from a UK academic perspective
– Cosma and Joy (2008)
Data Source
• On-line questionnaire distributed to 120 academics – Questions were in the form of small scenarios– Mostly multiple-choice responses– Comments box below each question– Anonymous – option for providing details
• Received 59 responses, from more that 34 different institutions
• Responses were analysed and collated to create a universally acceptable source-code plagiarism description.
Results
Grey areas include:
– O-O templates– Inappropriate collaboration– Translating between (programming)
languages– Re-use of work already submitted
Other Issues
Various issues on source-code plagiarism including:– Source-code reuse– Source-code self-plagiarism– Copying without adaptation– Copying with adaptation: minimal, moderate,
extreme– Converting source to another language– Using code-generator software– Collusion– Obtaining source-code written by other authors – False and “pretend” references
(2) Student Survey
We carried out a survey (Joy et al., 2008) in order to:– gather the perceptions of students on what
(source code) plagiarism means,– identify types of plagiarism which are poorly
understood, and– identify categories of student whoperceive the
issue differently to others
Data Source
• Online questionnaire answered by 770 students from computing departments across the UK
• Anonymised, but brief demographic information included
• Used 15 “scenarios”, each of which may describe a plagiaristic activity
Results (1)
No significant difference in perspectives in terms of
– university– degree programme– level of study (BS, MS, PhD)
Results (2)
Issues which students misunderstood:
– Open Source code– Translating between languages– Re-use of code from previous assignments– Placing references within technical
documentation
Latent Semantic Analysis
Documents as “bags of words”
• Known technique in IR
• Handles synonymy and polysemy
• Maths is nasty
Results reported in (Cosma and Joy, 2010)
Document Corpus
• m x n “term by document” matrix A• Rows = unique words• Columns = documents• Entries = no. of occurrences
Term Weighting
Algorithm to weight data in A• Local and global weights• Importance of terms in matrix A
Singular Value Decomposition (SVD)
Decompose m x n matrix A = U∑VT
U is an m x r “term by dimension” matrixV is an n x r “file by dimension” matrix∑ is an r x r “singular values” matrix
Truncate matrices to k dimensions, where k ≤ r
SVD (2)
Ak = Uk∑kVkT
Reduces “noise”Highlights important relations between
terms and documents
Size of k determined experimentally
SVD (3)
Given a “query” q (set of weighted keywords), can map to k-space:
Qk = qTUk∑k-1
Think of Q as a k-vector; can compare to vectors representing files using e.g. “cosine similarity” (dot product)
Uses of LSA
Essay gradingEssay feedbackIndexingLanguage independent processingCross-language information retrievalSource-code clusteringPlagiarism detection (natural language)
Summary
LSA can help detect plagiarism instances missed by other tools• Improved recall but poorer precision• Integration with structure-based tools is
effective
Visualisation of relative file similaritiesPredictability of LSA results is problematic
Where Next?
• Algorithms to include Internet-located code• “Blended” algorithms• Cross-language detection• Further exploration of LSA
References (1)
F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf ) 2002(
G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)
G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)
References (2)G. Cosma and M.S. Joy, “Source-code Plagiarism: a UK
Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)
M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)
M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair (2008), “Source Code Plagiarism – a Student Perspective” (under review)
M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)
References (3)K. Ottenstein, “An Algorithmic Approach to the
Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin 8(4) pp. 30-41 (1976)
L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)
S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)