Mike Joy 25 February 2010 New Approaches for Detecting Similarities in Program Code.

Mike Joy

25 February 2010

New Approaches for Detecting Similarities in Program Code

Overview of Talk

1) What is the Problem?2) Historical Overview3) New Approaches4) Where Next?

Part 1 – What is the Problem?

Document similarity– What do we mean?– Why is software an issue?– Why is this interesting?

Four stages

Collection

Detection

Confirmation

Investigation

From Culwin and Lancaster (2002).

Stage 1: Collection

• Get all documents together online– so they can be processed– formats?– security?

• BOSS (Warwick)

• Coursemaster (Nottingham)

• Managed Learning Environment

Stage 2: Detection

• Compare with other submissions

• Compare with external documents– essay-based assignments

• We’ll come back to this later– it’s the interesting bit!

Stage 3: Confirmation

• Software tool says “A and B similar”

• Are they?

• Never rely on a computer program!

• Requires expert human judgement

• Evidence must be compelling

• Might go to court

Stage 4: Investigation

• A from B, or B from A, or joint work?

• If A from B, did B know?– open networked file– printer output

• Did the culprit/s understand?

• University processes must be followed

Why is this Interesting?

How do you compare two programs?– This is an algorithm question– Stages 2 and 3: detection and confirmation

How do you use the results (of a comparison) to educate students?– This is a pedagogic question– Stage 4, and before stage 1!

Digression: Essays

Plagiarism in essays is easier to detect

Lots of “tricks” a lecturer can use!– Google search on phrases– Abnormal style– ... etc.

Software tools– Let's have a look ...

Pedagogy

Can be used by academics to– detect plagiarism– provide evidence

Can be used by students to– check their own work

Part 2 – Historical Overview

How has similar code been detected in the past?

How well do the approaches work?

Why not use Turnitin?

• It won’t work!• String matching algorithm inappropriate• Database does not contain code

• Commercial involvement– E.g. Black Duck Software

/* Program 1 */

public class Hello {

public static void main(String[] argv) {

System.out.println(“Hello World”)

}

}

/* Program 2 */

public class HelloWorld {

public static void main(String[] x) {

System.out.println(“hello world!”)

}

}

Is This Plagiarism?

• Is Program 2 derived from Program 1 in a manner which is “plagiarism”?

• Probably No– It's too simple– Too many copies in books / on the web– Most of it is generic syntax

Program 3

(Source code for MS Windows 7)

Program 4

(code 98% identical to the source code for MS Windows 7)

Is This Plagiarism?


• Definitely Yes– It's too complicated to happen by chance

• Millions of lines of code

– The source is “closed”• Microsoft guard it very well!

/* Program 5 */

public class Sun {

static final double latitude=52.4;

static final double longitude=-1.5;

static final double tpi = 2.0*pi;

/* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) {

double b = x / tpi;

double a = tpi * (b - (long)(b));

if (a < 0) a = tpi + a; return a;

};

public static void calculate() { /* ... */ }

/* ... */

/* Program 6 */

public class SunsetCalculator {

static float latitude=52.4;

static float longitude=-1.5;

/* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) {

float x = arg / tpi;

float y = 2*3.14159 * (x - (int)(x));

if (y < 0) y = 2*3.14159 + y; return y;

};

public static void findSunsetTime() { /* ... */ }

/* ... */

Is This Plagiarism?


• Maybe– Structure is similar – cosmetic changes– But the algorithm is public domain– Maybe 6 derived from 5, maybe the other

way round

History ...

• First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976)

• More recent systems compare the structure of source-code programs

• Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.

Detection Tools (1)

Attribute counting systems (Halstead, 1972):

• Numbers of unique operators• Numbers of unique operands• Total numbers of operator occurrences• Total numbers of operand occurrences

Detection Tools (2)

Structure-based systems:

– Each program is converted into token strings (or something similar)

– Token streams are compared for determining similar source-code fragments

– Tools: JPlag, MOSS, and Sherlock

Example (code 1)

int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {

ans *= j;}

return ans;}

Example (code 2)

Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)

result *= f; return result;}

Example (tokenised)

type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end

return nameend

Detectors

MOSS (Berkeley/Stanford, USA)

JPlag (Karlsruhe, Germany)– Java only– Programs must compile?

Sherlock (Warwick, UK)

MOSS and JPlag are Internet resources– Data Protection?

MOSS

Developed by Alex Aiken in 1994

MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs.

MOSS is free, but you must create an account

MOSS home page:http://theory.stanford.edu/~aiken/moss/

MOSS – Algorithm

“Winnowing” (Schleimer et al., 2003)– Local document fingerprinting algorithm– Efficiency proven (33% of lower bound)– Guarantees detection of matches longer

than a certain threshold

Using MOSS

• Moss is being provided as an Internet service• User must download MOSS Perl script for

submitting files to the MOSS server• The script uses a direct network connection• The MOSS server produces HTML pages listing

pairs of programs with similar code• MOSS highlights similar code-fragments within

programs that appear the same• Data Protection? – US service• Maintenance?

JPlag

• Developed by Guido Malpohl in 1996• JPlag currently supports Java, C#, C, C++, Scheme,

and natural language text• Use of JPlag is free, but user must create an account• JPlag can be used to compare student assignments

but does not compare with code on the Internet• JPlag home page: www.ipd.uni-karlsruhe.de/jplag

JPlag – Algorithm

1) Parse (or scan) programs

2) Convert programs to tokens

3) Pairwise compare • “Greedy String Tiling”

- maximises percentage of common token strings- worst case θ(n3), average case linear

Prechelt et al. (2002)

JPlag File Processing

JPlag - Results

• Results in HTML Format

• Histogram of similarity values found for all pairs of programs

• Similar pairs and their similarity values displayed

• Select file pairs to view

JPlag - Matches

• Similar lines matched with the same colour

• Code fragment similarity values based on similar tokens found

Sherlock

• Developed at the University of Warwick Department of Computer Science

• Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced

• Sherlock detects plagiarism on source-code and natural language assignments

• BOSS home page: www.boss.org.uk

Sherlock - Preprocessing

WhitespaceCommentsNormalisationTokenisation

Sherlock – Results

• Results displayed• Similarity values of

suspicious files• Similarity values

depend on the length of similar lines found as a percentage of the whole file size

• Select suspicious matches to examine

• Mark suspicious files

Sherlock – Matches

Suspected sections marked with

**begin suspicious section**

and

**end suspicious section**

Sherlock – Document Set

• User can view graph

• Each node represents one submission

• An edge means two submissions

• Options to select threshold

• Click on lines to view or to mark suspicious matches

CodeMatch

Commercial productFree academic use for small data setsExact algorithm not published

• patent pending?

Example of Identical “Instruction Sequences”

/* File 1*/

for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; }

/* File 2*/

for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }

CodeMatch – Algorithm

1) Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions

2) Extract comments, and compare3) Extract identifiers, and count similar;

x, xxx, xx12345 are “similar”4) Combine (1), (2) and (3) to give

correlation score

Heuristics

Comments– Spelling mistakes– Unusual English (Thai, German, …)

Use of Search EnginesUnusual styleCode errors

Tool Efficiency

• MOSS, JPlag and Sherlock are effective• Results returned are similar• Results returned are not identical• User interface issues may be important

Part 3 – New Approaches

Eschew the “syntax driven” approach

Lateral thinking?

Case study: Latent Semantic Analysis

Digression: Similarity

What do we actually mean by “similar”?

This is where the problems start ...

(1) Staff Survey

We carried out a survey in order to:– gather the perceptions of academics on what

constitutes source-code plagiarism, and– create a structured description of what constitutes

source-code plagiarism from a UK academic perspective

– Cosma and Joy (2008)

Data Source

• On-line questionnaire distributed to 120 academics – Questions were in the form of small scenarios– Mostly multiple-choice responses– Comments box below each question– Anonymous – option for providing details

• Received 59 responses, from more that 34 different institutions

• Responses were analysed and collated to create a universally acceptable source-code plagiarism description.

Results

Grey areas include:

– O-O templates– Inappropriate collaboration– Translating between (programming)

languages– Re-use of work already submitted

Other Issues

Various issues on source-code plagiarism including:– Source-code reuse– Source-code self-plagiarism– Copying without adaptation– Copying with adaptation: minimal, moderate,

extreme– Converting source to another language– Using code-generator software– Collusion– Obtaining source-code written by other authors – False and “pretend” references

(2) Student Survey

We carried out a survey (Joy et al., 2008) in order to:– gather the perceptions of students on what

(source code) plagiarism means,– identify types of plagiarism which are poorly

understood, and– identify categories of student whoperceive the

issue differently to others

Data Source

• Online questionnaire answered by 770 students from computing departments across the UK

• Anonymised, but brief demographic information included

• Used 15 “scenarios”, each of which may describe a plagiaristic activity

Results (1)

No significant difference in perspectives in terms of

– university– degree programme– level of study (BS, MS, PhD)

Results (2)

Issues which students misunderstood:

– Open Source code– Translating between languages– Re-use of code from previous assignments– Placing references within technical

documentation

Latent Semantic Analysis

Documents as “bags of words”

• Known technique in IR

• Handles synonymy and polysemy

• Maths is nasty

Results reported in (Cosma and Joy, 2010)

Document Corpus

• m x n “term by document” matrix A• Rows = unique words• Columns = documents• Entries = no. of occurrences

Term Weighting

Algorithm to weight data in A• Local and global weights• Importance of terms in matrix A

Singular Value Decomposition (SVD)

Decompose m x n matrix A = U∑VT

U is an m x r “term by dimension” matrixV is an n x r “file by dimension” matrix∑ is an r x r “singular values” matrix

Truncate matrices to k dimensions, where k ≤ r

SVD (2)

Ak = Uk∑kVkT

Reduces “noise”Highlights important relations between

terms and documents

Size of k determined experimentally

SVD (3)

Given a “query” q (set of weighted keywords), can map to k-space:

Qk = qTUk∑k-1

Think of Q as a k-vector; can compare to vectors representing files using e.g. “cosine similarity” (dot product)

Uses of LSA

Essay gradingEssay feedbackIndexingLanguage independent processingCross-language information retrievalSource-code clusteringPlagiarism detection (natural language)

Summary

LSA can help detect plagiarism instances missed by other tools• Improved recall but poorer precision• Integration with structure-based tools is

effective

Visualisation of relative file similaritiesPredictability of LSA results is problematic

Where Next?

• Algorithms to include Internet-located code• “Blended” algorithms• Cross-language detection• Further exploration of LSA

References (1)

F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf ) 2002(

G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)

G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)

References (2)G. Cosma and M.S. Joy, “Source-code Plagiarism: a UK

Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)

M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)

M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair (2008), “Source Code Plagiarism – a Student Perspective” (under review)

M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)

References (3)K. Ottenstein, “An Algorithmic Approach to the

Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin 8(4) pp. 30-41 (1976)

L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)

S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)