Top Banner
CCFinder: A Mul.linguis.c Token- Based Code Clone Detec.on System for Large Scale Source Code Saima Sultana Tithi 03/15/2017 1
28

CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Jan 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

CCFinder: A Mul.linguis.c Token-Based Code Clone Detec.on System

for Large Scale Source Code

Saima Sultana Tithi 03/15/2017

1

Page 2: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Outline

• Overview• Duplicatedcodedetec2onprocess• Advantages• Results• Discussion

2

Page 3: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

About the paper

• DevelopedanalgorithmtodetectduplicatedcodeinasystemandimplementedatoolnamedCCFinder(CodeCloneFinder)•  Totalcita2ons:1306• Publishedin:IEEETransac2onsonSoJwareEngineering,Volume-28,issue-7• Publica2ondate:July2002• Authors:

§  ToshihiroKamiya,OsakaUniversity,Japan§  ShinjiKusumoto,OsakaUniversity,Japan§ KatsuroInoue,OsakaUniversity,Japan

3

Page 4: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Clone detec.ng process Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

4

Page 5: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 1: Lexical Analysis Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

5

Page 6: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 1: Lexical Analysis

•  Eachlineofsourcefilesisdividedintotokenscorrespondingtoalexicalruleoftheprogramminglanguage

sum=3+2;

tokenize/parsing

Token TokenCategory

sum Iden2fier

= Assignmentoperator

3 Integerliteral

+ Addi2onoperator

2 Integerliteral

; Endofstatement 6

Page 7: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 2: Transforma.on Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

7

Page 8: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 2: Transforma.on

• Transforma2onhas2steps§ Transforma)onbyTransforma)onRules:Thetokensequenceistransformedbasedonthetransforma2onrules

§ Parameterreplacement:AJertransforma2onbyrules,eachiden2fierrelatedtotypes,variables,andconstantsisreplacedwithaspecialtoken

8

Page 9: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Example of Transforma.on Rules

Removenamespaceacribu2ons std::ios_base::hexhexRemovetemplateparameters vector<int>vectorRemoveaccessibilitykeywords protectedvoidfoo()voidfoo()Converttocompoundblock if(a==1)b=2;if(a==1){b=2};…

•  Theauthorsdevelopedtransforma2onrulesforallprogramminglanguagessupportedbyCCFinder,whichwereC,C++,Java,COBOL

9

Page 10: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 2: Transforma.on 1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

SampleCode Transformedcodebytransforma*onrules10

Page 11: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 2: Transforma.on 1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

SampleCode Transformedcodebytransforma*onrules11

Page 12: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 2: Transforma.on 1.  void print_numbers (const set & s) {

2.  int c = 0;

3.  const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector & v) {

11.  int c = 0;

12.  const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

1.  $p $p ($p $p & $p) {

2.  $p $p = $p;

3.  $p $p = $p.$p();

4.  for (; $p != $p. $p(); ++ $p) {

5.  $p << $p << $p

6.  << *$p << $p;

7.  ++ $p;

8.  }

9.  }

10.  $p $p ($p $p & $p) {

11.  $p $p = $p ;

12.  $p $p = $p.$p();

13.  for (; $p != $p. $p(); ++ $p) {

14.  $p << $p << $p

15.  << *$p << $p;

16.  ++ $p;

17.  }

18.  }

Transformedcodebytransforma*onrules

Thecodea:erparameterreplacement

12

Page 13: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 3: Match Detec.on Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

13

Page 14: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 3: Match Detec.on

• Detectsimilarcodesegmentsbasedonsuffix-treematchingalgorithmTransformedTokenSequence:$p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}} $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}}

Createsuffixtreefrominputsequence

Longestcommonsubsequence:•  $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p

<< $p << *$p << $p;++ $p;}} •  $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p

<< $p << *$p << $p;++ $p;}} 14

Page 15: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 4: FormaOng Sourcefiles(input)

LexicalAnalysis

TokenSequence

Transforma2on

TransformedTokenSequence

MatchDetec2on Forma[ng

Clone-pairs(output)

CloneDetec*on

15

Page 16: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Step 4: FormaOng •  Fromtheoutputofsuffix-treematchingalgorithm,allclonesareconvertedtolinenumbersoftheoriginalcode• Here,line1-9andline10-18isaclonepair

1.  void print_numbers (const set<int>& s) {

2.  int c = 0;

3.  set<int>::const_iterator i = s.begin();

4.  for (; i != s.end(); ++i) {

5.  cout << c << ", "

6.  << *i << endl;

7.  ++c;

8.  }

9.  }

10.  void print_lines (const vector<string>& v) {

11.  int c = 0;

12.  vector<string>::const_iterator i = v.begin();

13.  for (; i != v.end(); ++i) {

14.  cout << c << ", "

15.  << *i << endl;

16.  ++c;

17.  }

18.  }

16

Page 17: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Advantages of using transforma.on step

public class MultiButtonUI extends ButtonUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiButtonUI();

return MultiLookAndFeel.createUIs(mui,

((MultiButtonUI)mui).uis, a);

}

public class MultiColorChooserUI extends ColorChooserUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiColorChooserUI();

return MultiLookAndFeel.createUIs(mui,

((MultiColorChooserUI)mui).uis, a);

} 17

Page 18: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Advantages of using transforma.on step

public class MultiButtonUI extends ButtonUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiButtonUI();

return MultiLookAndFeel.createUIs(mui,

((MultiButtonUI)mui).uis, a);

}

public class MultiColorChooserUI extends ColorChooserUI {

public static ComponentUI createUI(JComponent a) {

ComponentUI mui = new MultiColorChooserUI();

return MultiLookAndFeel.createUIs(mui,

((MultiColorChooserUI)mui).uis, a);

} 18

Page 19: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Advantages of using transforma.on step for (int i = 0; i < n; i++) {

if (a == 1) {

b = 2; }

}

for (int i = 0; i < n; i++) {

if (a == 1)

b = 2;

}

19

Page 20: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Advantages of using transforma.on step for (int i = 0; i < n; i++) {

if (a == 1) {

b = 2; }

}

for (int i = 0; i < n; i++) {

if (a == 1)

b = 2;

}

20

Page 21: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Implementa.on

• ImplementedinC++• Supports4programminglanguages:C,C++,Java,COBOL• Timeandspacecomplexityis𝑂(𝑛),wherenistotallengthofsourcefile

21

Page 22: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Results

• AppliedCCFinderonFreeBSD4.0(2.2Mlines),Linux2.4(2.4Mlines),NetBSD1.5(2.6Mlines)•  Time:108minutes

22

Cloneclasses Coverage(%LOC) Coverage(%file)

FreeBSD&Linux 1,091 0.8%FreeBSD0.9%Linux

3.1%FreeBSD4.6%Linux

FreeBSD&NetBSD 25,621 18.6%FreeBSD15.2%NetBSD

40.1%FreeBSD36.1%NetBSD

Linux&NetBSD 1,000 0.6%Linux0.6%NetBSD

3.3%Linux2.1%NetBSD

Page 23: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

23

Figure:Scacerplotofclonepairshavingatleast30sametokens(about13lines)

Page 24: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Later Works

• BasedonCCFinder,theauthorsdevelopedanothertoolAIST-CCFinderXin2005• CCFinderXisfreelyavailablefrom:hcp://www.ccfinder.net/ccfinderxos.html,hcps://github.com/petersenna/ccfinderx-core•  SomeothertoolsfromtheauthorsofCCFinder:

•  D-CCFinder(distributedCCFinder)•  Gemini(addGUItoviewtheoutputofCCFinder)•  Aries(refactorcodebasedonclonedetec2on)•  Agec(clonedetec2onfromJavabytecode)

24

Page 25: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Discussion

•  Strengthsofthispaper•  Clearexplana2onofthemethod•  Appliesthetoolondifferentcodebasesandshowsalltheresultsintermsof2meprofile,memoryprofile,numberofclonepairs,andpercentageofclones

25

Page 26: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Discussion

• Weaknessesofthispaper:•  DidnotcompareCCFinderwithotherexis2ngtoolswithrespecttorunning2meormemoryconsump2on

•  DidnotapplyCCFinderonanybenchmarkdatasetandcalculatetheaccuracyoftheresult

26

Page 27: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

Discussion

• HowtoimproveCCFinder?•  Tocomputetokensequencematching,CCFinderusessuffix-treebasedmatchingalgorithm,butsuffix-treeisnotspaceefficientforlargecodebases.AccordingtotheauthorsofSourcererCC,CCFinderrunsoutofmemoryforlargecodebases.Assuffix-arraybasedmatchingalgorithmismorespaceefficient,insteadofsuffix-treebasedmatchingalgorithms,suffix-arraybasedmatchingalgorithmcanbeused.

27

Page 28: CCFinder: A Mullinguisc Token- Based Code Clone Detec.on ...courses.cs.vt.edu/cs6704/spring17/slides_by_students/CCFinder-Tithi.pdf · CCFinder: A Mullinguisc Token-Based Code Clone

ThankYou!

28