CCFinder: A Mul.linguis.c Token- Based Code Clone Detec.on System for Large Scale Source Code Saima Sultana Tithi 03/15/2017 1
CCFinder: A Mul.linguis.c Token-Based Code Clone Detec.on System
for Large Scale Source Code
Saima Sultana Tithi 03/15/2017
1
About the paper
• DevelopedanalgorithmtodetectduplicatedcodeinasystemandimplementedatoolnamedCCFinder(CodeCloneFinder)• Totalcita2ons:1306• Publishedin:IEEETransac2onsonSoJwareEngineering,Volume-28,issue-7• Publica2ondate:July2002• Authors:
§ ToshihiroKamiya,OsakaUniversity,Japan§ ShinjiKusumoto,OsakaUniversity,Japan§ KatsuroInoue,OsakaUniversity,Japan
3
Clone detec.ng process Sourcefiles(input)
LexicalAnalysis
TokenSequence
Transforma2on
TransformedTokenSequence
MatchDetec2on Forma[ng
Clone-pairs(output)
CloneDetec*on
4
Step 1: Lexical Analysis Sourcefiles(input)
LexicalAnalysis
TokenSequence
Transforma2on
TransformedTokenSequence
MatchDetec2on Forma[ng
Clone-pairs(output)
CloneDetec*on
5
Step 1: Lexical Analysis
• Eachlineofsourcefilesisdividedintotokenscorrespondingtoalexicalruleoftheprogramminglanguage
sum=3+2;
tokenize/parsing
Token TokenCategory
sum Iden2fier
= Assignmentoperator
3 Integerliteral
+ Addi2onoperator
2 Integerliteral
; Endofstatement 6
Step 2: Transforma.on Sourcefiles(input)
LexicalAnalysis
TokenSequence
Transforma2on
TransformedTokenSequence
MatchDetec2on Forma[ng
Clone-pairs(output)
CloneDetec*on
7
Step 2: Transforma.on
• Transforma2onhas2steps§ Transforma)onbyTransforma)onRules:Thetokensequenceistransformedbasedonthetransforma2onrules
§ Parameterreplacement:AJertransforma2onbyrules,eachiden2fierrelatedtotypes,variables,andconstantsisreplacedwithaspecialtoken
8
Example of Transforma.on Rules
Removenamespaceacribu2ons std::ios_base::hexhexRemovetemplateparameters vector<int>vectorRemoveaccessibilitykeywords protectedvoidfoo()voidfoo()Converttocompoundblock if(a==1)b=2;if(a==1){b=2};…
• Theauthorsdevelopedtransforma2onrulesforallprogramminglanguagessupportedbyCCFinder,whichwereC,C++,Java,COBOL
9
Step 2: Transforma.on 1. void print_numbers (const set<int>& s) {
2. int c = 0;
3. set<int>::const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector<string>& v) {
11. int c = 0;
12. vector<string>::const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
1. void print_numbers (const set & s) {
2. int c = 0;
3. const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector & v) {
11. int c = 0;
12. const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
SampleCode Transformedcodebytransforma*onrules10
Step 2: Transforma.on 1. void print_numbers (const set<int>& s) {
2. int c = 0;
3. set<int>::const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector<string>& v) {
11. int c = 0;
12. vector<string>::const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
1. void print_numbers (const set & s) {
2. int c = 0;
3. const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector & v) {
11. int c = 0;
12. const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
SampleCode Transformedcodebytransforma*onrules11
Step 2: Transforma.on 1. void print_numbers (const set & s) {
2. int c = 0;
3. const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector & v) {
11. int c = 0;
12. const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
1. $p $p ($p $p & $p) {
2. $p $p = $p;
3. $p $p = $p.$p();
4. for (; $p != $p. $p(); ++ $p) {
5. $p << $p << $p
6. << *$p << $p;
7. ++ $p;
8. }
9. }
10. $p $p ($p $p & $p) {
11. $p $p = $p ;
12. $p $p = $p.$p();
13. for (; $p != $p. $p(); ++ $p) {
14. $p << $p << $p
15. << *$p << $p;
16. ++ $p;
17. }
18. }
Transformedcodebytransforma*onrules
Thecodea:erparameterreplacement
12
Step 3: Match Detec.on Sourcefiles(input)
LexicalAnalysis
TokenSequence
Transforma2on
TransformedTokenSequence
MatchDetec2on Forma[ng
Clone-pairs(output)
CloneDetec*on
13
Step 3: Match Detec.on
• Detectsimilarcodesegmentsbasedonsuffix-treematchingalgorithmTransformedTokenSequence:$p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}} $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p << $p << *$p << $p;++ $p;}}
Createsuffixtreefrominputsequence
Longestcommonsubsequence:• $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p
<< $p << *$p << $p;++ $p;}} • $p $p ($p $p & $p) {$p $p = $p; $p $p = $p.$p(); for (; $p != $p. $p(); ++ $p) {$p << $p
<< $p << *$p << $p;++ $p;}} 14
Step 4: FormaOng Sourcefiles(input)
LexicalAnalysis
TokenSequence
Transforma2on
TransformedTokenSequence
MatchDetec2on Forma[ng
Clone-pairs(output)
CloneDetec*on
15
Step 4: FormaOng • Fromtheoutputofsuffix-treematchingalgorithm,allclonesareconvertedtolinenumbersoftheoriginalcode• Here,line1-9andline10-18isaclonepair
1. void print_numbers (const set<int>& s) {
2. int c = 0;
3. set<int>::const_iterator i = s.begin();
4. for (; i != s.end(); ++i) {
5. cout << c << ", "
6. << *i << endl;
7. ++c;
8. }
9. }
10. void print_lines (const vector<string>& v) {
11. int c = 0;
12. vector<string>::const_iterator i = v.begin();
13. for (; i != v.end(); ++i) {
14. cout << c << ", "
15. << *i << endl;
16. ++c;
17. }
18. }
16
Advantages of using transforma.on step
public class MultiButtonUI extends ButtonUI {
public static ComponentUI createUI(JComponent a) {
ComponentUI mui = new MultiButtonUI();
return MultiLookAndFeel.createUIs(mui,
((MultiButtonUI)mui).uis, a);
}
public class MultiColorChooserUI extends ColorChooserUI {
public static ComponentUI createUI(JComponent a) {
ComponentUI mui = new MultiColorChooserUI();
return MultiLookAndFeel.createUIs(mui,
((MultiColorChooserUI)mui).uis, a);
} 17
Advantages of using transforma.on step
public class MultiButtonUI extends ButtonUI {
public static ComponentUI createUI(JComponent a) {
ComponentUI mui = new MultiButtonUI();
return MultiLookAndFeel.createUIs(mui,
((MultiButtonUI)mui).uis, a);
}
public class MultiColorChooserUI extends ColorChooserUI {
public static ComponentUI createUI(JComponent a) {
ComponentUI mui = new MultiColorChooserUI();
return MultiLookAndFeel.createUIs(mui,
((MultiColorChooserUI)mui).uis, a);
} 18
Advantages of using transforma.on step for (int i = 0; i < n; i++) {
if (a == 1) {
b = 2; }
}
for (int i = 0; i < n; i++) {
if (a == 1)
b = 2;
}
19
Advantages of using transforma.on step for (int i = 0; i < n; i++) {
if (a == 1) {
b = 2; }
}
for (int i = 0; i < n; i++) {
if (a == 1)
b = 2;
}
20
Implementa.on
• ImplementedinC++• Supports4programminglanguages:C,C++,Java,COBOL• Timeandspacecomplexityis𝑂(𝑛),wherenistotallengthofsourcefile
21
Results
• AppliedCCFinderonFreeBSD4.0(2.2Mlines),Linux2.4(2.4Mlines),NetBSD1.5(2.6Mlines)• Time:108minutes
22
Cloneclasses Coverage(%LOC) Coverage(%file)
FreeBSD&Linux 1,091 0.8%FreeBSD0.9%Linux
3.1%FreeBSD4.6%Linux
FreeBSD&NetBSD 25,621 18.6%FreeBSD15.2%NetBSD
40.1%FreeBSD36.1%NetBSD
Linux&NetBSD 1,000 0.6%Linux0.6%NetBSD
3.3%Linux2.1%NetBSD
Later Works
• BasedonCCFinder,theauthorsdevelopedanothertoolAIST-CCFinderXin2005• CCFinderXisfreelyavailablefrom:hcp://www.ccfinder.net/ccfinderxos.html,hcps://github.com/petersenna/ccfinderx-core• SomeothertoolsfromtheauthorsofCCFinder:
• D-CCFinder(distributedCCFinder)• Gemini(addGUItoviewtheoutputofCCFinder)• Aries(refactorcodebasedonclonedetec2on)• Agec(clonedetec2onfromJavabytecode)
24
Discussion
• Strengthsofthispaper• Clearexplana2onofthemethod• Appliesthetoolondifferentcodebasesandshowsalltheresultsintermsof2meprofile,memoryprofile,numberofclonepairs,andpercentageofclones
25
Discussion
• Weaknessesofthispaper:• DidnotcompareCCFinderwithotherexis2ngtoolswithrespecttorunning2meormemoryconsump2on
• DidnotapplyCCFinderonanybenchmarkdatasetandcalculatetheaccuracyoftheresult
26
Discussion
• HowtoimproveCCFinder?• Tocomputetokensequencematching,CCFinderusessuffix-treebasedmatchingalgorithm,butsuffix-treeisnotspaceefficientforlargecodebases.AccordingtotheauthorsofSourcererCC,CCFinderrunsoutofmemoryforlargecodebases.Assuffix-arraybasedmatchingalgorithmismorespaceefficient,insteadofsuffix-treebasedmatchingalgorithms,suffix-arraybasedmatchingalgorithmcanbeused.
27