Analysis and Preliminary Thoughts in Model Clone Detection Wenjun Luo, Xiaochi Ma, Jinglei Xu College of Computer Science and Software Engineering Shenzhen University Dec. 18, 2011
Mar 09, 2016
Analysis and Preliminary Thoughts
in Model Clone Detection
Wenjun Luo, Xiaochi Ma, Jinglei Xu
College of Computer Science and Software Engineering
Shenzhen University
Dec. 18, 2011
Abstract
In this work, an idea of fingerprint-based algorithm for model clone detection in graph-based
dataflow models is presented. The concept of exact clone detection is to enumerate all the
maximal, isomorphic, disjunctive and connected sub graphs. And there are some sub graphs that
are not isomorphic but have the same structure or similar structures, which we called them
approximate clone. Our algorithm works on both of accurate clone detection and similar clone
detection. Also, as the problem of clone detection in graphs is known to be NP-complete, there is
no polynomial solution for it.
Since we didn’t have a full achievement of our algorithm, this work is mostly present the ideas of
our algorithm and the comparison with other existing algorithms in this field.
Keywords: model clone, clone detection, fingerprint-based, LSH
Content
1. Introduction .............................................................................................................................. 4
1.1 Occurrence of clone and clone detection ..................................................................... 4
1.2 Clones in models ........................................................................................................... 4
1.3 Advantages and disadvantages of cloning .................................................................... 4
1.4 Differences between model cloning and code cloning ................................................. 4
2. Existing Algorithm Analysis ....................................................................................................... 5
2.1 Index-based Model Clone Detection ............................................................................. 5
2.2 Automotive Model Clone Detection ............................................................................. 6
2.3 E-Scan and A-Scan ......................................................................................................... 6
2.4 Analysis.......................................................................................................................... 7
3. Preliminary Thought .................................................................................................................. 9
3.1 Locality-Sensitive Hashing ............................................................................................. 9
3.2 Fingerprint ..................................................................................................................... 9
3.3 Preliminary Approach .................................................................................................. 10
3.4 Analysis........................................................................................................................ 10
4. Conclusion ............................................................................................................................... 12
5. References ............................................................................................................................... 13
1. Introduction
The concept of clone is first created by biologists while they doing reach on organic gene. Later
people classify the phenomena or operations that have the same characteristics of gene clone
into clone. The clone we talk about is in IT industry fields and not the bio one.
1.1 Occurrence of clone and clone detection
When you analysis your code after programming it, you may found that there are many code that
have been reused in many places in your program. These codes may form into a function then
put into a package for later use. Finally, the package you formed is called mother-body, and clone
in the code that is the same of the mother-body which you’ve put in your program. And the
procedure of finding these clones is called clone detection.
1.2 Clones in models
Some big industry enterprises like BMW, a well-known car manufacturer in Germany, suffering
from the difficulty of improve and update their car core system just because a litter change will
cause a whole rebuild for the system, which will cost a lot of money and time. BMW design their
cars by many different models. These models are the components of a car. Each model has
versions on them and different class of the car may use different version of the model. So if a
place has change in the models, you can imagine what would happen next. This is much more
complicated than the clone in codes.
1.3 Advantages and disadvantages of cloning
There won’t be any doubt that using clone can makes our code more clear and easy-fix, and
enterprises reduce their cost by clone their products. However, this also became a problem that
people may use codes without authorizations, which we called plagiary. And it became much
worse in updating the product line in a factory, because this is already outside the codes, people
have you find it out one by one all by themselves.
1.4 Differences between model cloning and code cloning
As a matter of fact, model cloning and code cloning are all shares the same property, replication.
The most different part between them is the clone level. The code clone is mostly occurred in a
low level while the model clone is in a higher one. Since model driven theory of software
engineering and OOP gain such a high focus nowadays, cloning the UMLs or structures from one
project to another can be commonly seen in IT industry.
2. Existing Algorithm Analysis
So far, there only exist a few algorithms in this field. The major algorithms for model clone
detection are from U.S. and Germany. They are the index-based model clone detection algorithm
[1], the automotive model clone detection algorithm [2], e-Scan and a-Scan [3]. We will have a
brief introduction to their processes and have an analysis of them.
2.1 Index-based Model Clone Detection
In [1], an algorithm called index-based model clone detection was presented. This algorithm
mainly performs the detection by the following steps. Firstly, a graph from a MATLAB/Simulink
model will be extracted and normalized into a directed labeled multi graph. Blocks and edges in
the original MATLAB/Simulink model are corresponded to nodes and lines in the normalized
graph. Then this graph will be processed into a list of sub graphs with a specific size k. after that,
each of the sub graphs in the list will be merged into a hash table. Finally, the last process will get
the maximal clone group as the final result. All the procedure is shown in Figure 1.
Figure 1: Index-based Model Clone Detection [1]
2.2 Automotive Model Clone Detection
In [2], automotive model clone detection is presented, which the index-based model clone
detection is based on. It first preprocesses and normalizes the input models. Then do the
extraction of the clone pairs and clustering these pairs to also find substructures by using more
than twice in the models. In order to have a polynomial time complexity algorithm for
enumerating all maximal clone pairs in large cases, a heuristic approach, which is shown in Figure
2, has been built.
Figure 2: Heuristic for detecting clone pairs [2]
2.3 E-Scan and A-Scan
In [3], two algorithms, eScan for exact clone detection and aScan for approximate clone
detection, are presented. They are the core algorithms in an open source model clone detect
software called ModelCD, planted in CONQAT. The preprocessing and normalization of these two
algorithms are similar; we are not going to have a detail description for them. The eScan
performs by the steps that are shown in Figure 3, while the steps of aScan are shown in Figure 4.
Figure 3: eScan performing exact clone detection [3]
Figure 4: aScan performing approximate clone detection [3]
2.4 Analysis
All the approaches presented above are the main approaches in model clone detective field.
Before we presented our preliminary thoughts about how to perform and how to improve the
existing algorithms, we are now presenting some analysis on the comparison between them.
Since detection in models is proofed be a NP-complete problem, the most important feature we
care about is the time complexity of the detective algorithm.
Table 1: The Result of Analyzing the Time Complexity of Index-base Clone Detective,
Automotive Clone Detective, eScan and aScan
Approaches Time Complexity(Core Algorithm)
Index-based Detection ( (
) ( )), (
)
Automotive Detection ( )
eScan ( )
aScan ( )
We have analyzed all these algorithms (result shown in Table 1). The time complexity of
index-based detective contains two main properties. The first complexity is the complexity of
the enumeration of sub graphs of a give size of k. while is the complexity of finding maximal
clone groups. All the complexity of each algorithm is only one or two parts of the core of them.
From the analysis we can easily found that if we focus on the correctness of the detection, then
we might lose the time efficiency.
We also have a feature analysis of these algorithms and the analysis results are listed in Table 2.
Table 2: Feature Analysis Result of the Index-based Detective, Automotive Detective, eScan
and aScan
Features Index-based
Detective
Automotive
Detective eScan aScan
Exact Clone Detect O O O X
Approximate Clone
Detect X O X O
Minimal Clone
Detect Size Support O O O O
Maximal Clone
Detect Size Support O O O O
Speed GENERAL FAST GENERAL GENERAL
Incremental Detect
Support O X O O
Detect Correctness GENERAL GOOD BEST NOT GOOD
Completeness GENERAL NOT GOOD BEST GOOD
Stability GENERAL BEST GOOD GOOD
In the table, “O” stands for yes and “X” stands for no. “GENERAL” is the lowest level in this analysis result.
So, as far as we are concerned, though all these algorithms are able to deal with some large
cased, within its own limitations, but there still not exist a perfect solution for every aspects. So
we are considering a better approach for clone detection, both exact and approximate.
3. Preliminary Thought
Here, we are going to present a preliminary thought of model clone detection based on the
algorithms above and add some new features inside by referencing a Locality-Sensitive Hashing
(LSH) and fingerprints. In this section, we will first have a glance at LSH and fingerprints (see in
3.1 and 3.2). Then we are going to explain how we can improve the existing algorithms (see in
3.3). Finally, we will have a brief analysis on our thought (see in 3.4).
3.1 Locality-Sensitive Hashing
Locality-Sensitive Hashing (LSH) is an algorithm for solving the (approximate/exact) Near
Neighbor Search in high dimensional spaces [4]. The main algorithm of LSH is shown in Figure 5.
Figure 5: Algorithms for initializing a hash function 𝐡 from the LSH hash family, and for
computing 𝐡(𝐩) for a point 𝐩 ∈ 𝐑𝐝 [4]
3.2 Fingerprint
Fingerprint was first appear and be used in biology fields because of its identical feature. In code
clone detective fields, a fingerprint stands for the recognizable features for each fragment that
was created [6]. These fingerprints may appear as a form of dynamic array list and they are use to
make clone clusters. Fingerprints can be store in file system, a database or just temporary store in
the RAM.
3.3 Preliminary Approach
Our approach will pull in LSH and fingerprint. And based on the existing algorithms, our approach
may have a less time complexity and totally support the incremental detect without sacrificing
the detect correctness. The overview of our system is shown in Figure 6.
Figure 6: Overview of our system based on our preliminary thoughts
In this system, we first, as usual, parse the model in to a graph called Original Graph with nodes
and labeled lines that have stored the information in the model. Then we start enumerating and
normalizing so that we can get k size sub graphs. After that, we will using LSH to create
fingerprint for each of the sub graph. These fingerprints will be stored in a database. This will
enable the system support an incremental detection. The grouping process will start after all the
fingerprints were stored in the database. We also need a filter to ignore some useless clone.
Finally, when there is no new fingerprint added in the database, which means the entire clone
pairs have been max grouped, we can get the final result.
3.4 Analysis
This detect system is only a concept model base on the preliminary thoughts. We have not had a
complete and runnable example on this approach. But theoretically, this system may have some
new features, shown in Table 3, compares to the other approaches.
Table 3: Feature Comparison between Our Work and the Existing Approaches
Features Index-based
Detective
Automotive
Detective eScan aScan Our Work
Exact Clone
Detect O O O X O
Approximate
Clone Detect X O X O O
Minimal Clone
Detect Size
Support
O O O O O
Maximal Clone
Detect Size
Support
O O O O O
Speed GENERAL FAST GENERAL GENERAL NOT
FAST
Incremental Detect
Support O X O O O
Detect Correctness GENERAL GOOD BEST NOT
GOOD GOOD
Completeness GENERAL NOT
GOOD BEST GOOD GOOD
Stability GENERAL BEST GOOD GOOD N/A
Extra Storage NOT NEED NOT NEED NOT
NEED
NOT
NEED NEED
By pulling fingerprint in clone detection, we think, will be able to have a more clear structure for
the later grouping and clustering. Also, because the fingerprints can be able to store in the
database, we can get a record for the process status. By doing this, we can have incremental
detect support. Besides, by changing the fingerprint vector, we can be able to have detection at
different depth.
In the preprocessing and normalizing of the input models, we have a data mining process so that
we can make a property and correct structure for the model. By using LSH, we think, can be able
to have a less time complexity in those cases which have a lot of sub-systems and tens of
thousands of blocks. Also, LSH may help us to build fingerprints.
4. Conclusion
The approach that we presented is still under programming and testing. We need to find out with
all these algorithms get together, will the pre-process and normalization need to cost more time
than other approaches. Theoretically, this approach will have a good correctness and
completeness if we construct a proper vector for the clone pairs comparison. Though it is not a
polynomial time complexity, we believe this is an improvement compare to the existing
approaches.
5. References
[1] Daniela Steidl. Index-based Model Clone Detection. In Technology University of München,
2010.
[2] F. Deissenboeck, B. Hummel, E. Juergens, B. Schätz, S. Wagner, J. F. Girard, and S. Teuchert.
Clone detection in automotive model-based development. In Proc. of ICSE '08, pages 603-612,
2008.
[3] Nam H. Pham, Hoan Anh Nguyen, Tung Thanh Nguyen, Jafar M. Al-Kofahi, Tien N. Nguyen.
Complete and accurate clone detection in graph-based models. In Proc. of ICSE ’09, software
engineering, pages 276-286, 2009.
[4] Alexandr Andoni, Piotr Indyk. Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. In Proc. of 47th Annual IEEE Symposium on FOCS’06, 2006.
[5] F. V. Rysselberghe, S. Demeyer. Evaluating clone detection techniques from a refactoring
perspective. In Proc. of the 19th International Conf. on Automated Software Engineering (ASE’04),
2004.
[6] M. Chilowicz, E. Duris, G. Russel. Syntax tree fingerprinting: a foundation for source code
similarity detection. In the IEEE 17th International Conference on Program Comprehension
(ICPC’09), 2009.