Top Banner
Duplicate code detection using anti- unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara
20

Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Dec 14, 2015

Download

Documents

Caitlin Malone
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Duplicate code detection using anti-unification

Peter BulychevMoscow State University

Marius MineaInstitute eAustria,

Timisoara

Page 2: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Outline

Code duplication problem Our anti-unification based

algorithm Comparison with existing methods Clone Digger, the tool for finding

software clones

Page 3: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

What is software clone?

Two fragments of code form clone if they are similar enough (according to a given measure of similarity)

for(int i=0; i<5; i++) for(j=0; j<=i; j++) cout << i+j;

for(int k=0; k<6; k++) for(m=0; m<=k; m++) cout << k+m;

Page 4: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Why is it important to detect code clones? 5% - 20% of code in software systems are

clones1

Why do programmers produce clones?2

Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident

Why is the presence of code clones bad? Errors in the original must be fixed in every clone

1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998.2. C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research,

2007.

Page 5: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Our clone definition Different clone definitions can be classified

according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information

We work on the AST level We consider two sequences of statements

a clone if one of them can be obtained from the other by replacing some subtrees

Page 6: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Example

x = a;y = f(x,i);cout << y;

x = a + b;y = f(x,j);cout << y;

;

= cout

x + y

a b

=

y f

x j

;

= cout

x a y

=

y f

x i

Page 7: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

The sketch of the algorithm

Partition similar statements into clusters

Find pairs of identical cluster sequences

Refine by examining identified code sequences for structural similarity

i=0 i++f(i)

k++ f(k)k=0

i=0 f(k)

Page 8: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Main problems How to compute similarity between two

trees? Use editing distance

How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is

expensive Compare new tree with an average value

stored for a cluster

Page 9: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Anti-unification Anti-unifier of two trees is the most

specific generalization that matches both

?

f

+ *?

x y x 2

f

+ /

x z x 2

f

+

x ?

Page 10: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Anti-unification features

Anti-unifier of a set of trees keeps common features: tree structure and common labels

Anti-unification can be used to compute editing distance between two trees:

Ө1 и Ө2 - substitutions, E0 Ө1=E1 и E0 Ө2=E2

distance = |Ө1| + |Ө2|

Page 11: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

The first phase:building clusters of statements

We use a simple one-pass clustering algorithm

for each tree in statement trees:

bestcluster = argmax(cluster.add_cost(tree))if bestcluster.add_cost(tree) < threshold

bestcluster.append(tree)else

clusters.append(new Cluster(tree))

Page 12: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Finding the best cluster What add_cost function should we use?

Cost value should be high for these cases: If cluster is large and by joining the new tree

the cluster’s average value changes significantly

If the average value of the new cluster is far away from the tree

add_cost = n * (|au| - |au’|) + (|tree| - |au’|) n – the old size of the cluster au – the old anti-unifier of the cluster au’ - the new anti-unifier of the cluster

Page 13: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Increase of effectiveness In order not to compare each AST with

each other AST we use hashing. The upper parts of the trees are hashed.

=

[ ] +

a bx 0

=

[ ] +

a +x 0

b c

Page 14: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Why is this not enough? By considering pairs from the same cluster

only individually we miss sequences of statements

We should find all pairs of identical cluster sequences and then check them for similarity

void f() { // cluster №1cin >> i; // cluster №2int j = i * 100; // cluster №3cout << i << j; // cluster №4}

void f(int j) { // cluster №5cin >> i; // cluster №2int j = i * 100; // cluster №3cout << j; // cluster №6}

Page 15: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

The second phase:finding all common subsequences

After the first phase each statement node is marked with the ID of its cluster

We want to find all pairs of similar sequences of cluster IDs

We do it using suffix trees Only long common subsequences

are considered

Page 16: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

The third phase:finding similar sequences of statements

i=0 k=3 f(i,k) k=0 n=3 f(k,n)

i=0 k=3 f(i,k) k=0 n=3 f(k,n)

Page 17: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Comparison with existing AST methods W. Yang, 1991

Editing distance between two trees I. Baxter, et. al, 1998

Hash functions on subtrees, some kind of editing distance

V. Wahler, 2004 Feature vectors comparison

S. Evans, et. al, 2007 Subtree patterns (similar to anti-unification),

hash functions on subtrees

Page 18: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Clone Digger The tool is written in Python Supported languages:

Python (ASTs are build using standard package “compiler”)

Java 1.5 (parser generator ANTLR) The information on found clones is

written to HTML with a highlighting of differences

It’s application to open-source projects NLTK and BioPython showed, that they are 12% clones

Page 19: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Clone Digger

Provided under the GPL license and can be downloaded from the site

http://clonedigger.sourceforge.net

Page 20: Duplicate code detection using anti-unification Peter Bulychev Moscow State University Marius Minea Institute eAustria, Timisoara.

Thank you!