Pitfalls in Aspect Mining Pr. Kim Mens Université catholique de Louvain B-1348 Louvain-la-Neuve Belgium [email protected]Dr. Jens Krinke King’s College London United Kingdom [email protected]Dr. Andy Kellens Vrije Universiteit Brussel Belgium [email protected]WCRE 2008, 15th Working Conference on Reverse Engineering October 15th – 18th, 2008 Antwerp, Belgium 1
Presentation of paper on "pitfalls in aspect mining" at the Working Conference on Reverse Engineering (WCRE), Antwerp, Belgium, 2008.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Poor precision or recall occurs at different levels of granularity
11
Example
In order to perform this evaluation, we use grok to pro-cess the sets of clone classes of both clone detectors sepa-rately. For each of the concerns we consider, we try to findan ordered selection of clone classes that does a good job at‘covering’ the region of code defined by the concern in ques-tion. A source code line of a concern is covered by a cloneclass if it is included in one of the clones (code fragments) ofthe clone class.For each concern, we then proceed as follows: for all of
the clone classes in the set, we calculate which concern linesare covered by each clone class. The clone class that cov-ers the most lines of the concern is selected, and the concernlines that are covered will no longer be considered duringthe remainder of the algorithm. Subsequently, the algorithmwill select the clone class that covers the most of the remain-ing concern lines, and so on until no more concern lines arecovered by any clone class. If it occurs that multiple cloneclasses cover an equal number of concern lines, we selectthe clone class that contains the least number of non-concernlines. Similar to lines belonging to a concern, non-concernslines are also considered at most once.
6. Obtained Results
Our primary goal is finding the code belonging to a cer-tain concern. Therefore, in our algorithm to select the cloneclasses (see Section 5), we favor coverage and sacrifice pre-cision (defined below). Arguably, other goals require differ-ent criteria to rank the clone classes. For example, in orderto identify opportunities for (automatic) refactoring, preci-sion would be the primary issue. We plan to explore thesepossibilities in the future.In order to evaluate to what extent the clone detectors
meet our goal, we investigate the level of concern coveragemet by the clone classes. Concern coverage is the fractionof a concern’s source code lines that are covered by the firstn selected clone classes. Using the selection algorithm de-scribed in Section 5 we obtain the results displayed in Fig-ure 2(a) and Figure 2(b) for Bauhaus’ ccdiml and CCFinder,respectively.Additionally, we evaluate the precision obtained by the
first n selected clone classes. Precision is defined as follows:
precision(n) =concernLines(n)totalLines(n)
,
where n indicates the first n selected clone classes, concern-Lines equals the number of concern code lines covered bythe first n selected clone classes, and likewise totalLinesequals the total number of lines covered by the first n se-lected clone classes. Figure 2(c) and Figure 2(d) show theprecision obtained by the first n selected clone classes forBauhaus’ ccdiml and CCFinder, respectively.
Observe that as the number of clone classes considered in-creases, the coverage displays a monotonic growth, whereasthe precision tends to decrease. The highest coverage isless than 100% in all cases: the remaining percentage cor-responds to concern code that is coded in such a unique waythat it does not occur in any clone class. For example, Fig-ure 2(a) and Figure 2(b) show that 5% of the memory errorhandling code is not part of any clone class.We are primarily interested in achieving sufficient cover-
age without loosing too much precision. Therefore, we willfocus on the number of clone classes needed to cover mostof a concern, where we will consider 80% to be a sufficientcoverage level.
6.1. Memory Error HandlingUsing 9 clone classes is enough to sufficiently cover thememory error handling concern for both Bauhaus’ ccdimland CCFinder, resulting in 69% and 52% precision, respec-tively.We observe that CCFinder yields a clone class that al-
ready covers 45% of the concern code. This particular cloneclass contains 96 clones which are 6 lines in length. Figure 3shows an example clone from this class. While the linesmarked with ‘M’ belong to the memory handling concern,only the lines marked with ‘C’ are included in the clones.CCFinder allows clones to start and end with little regard tosyntactic units. In contrast, Bauhaus’ ccdiml does not allowthis, due to its AST-based clone detection algorithm.
M C if (r != OK)M C {M C ERXA_LOG(r, 0, ("PLXAmem_malloc failure."));M CM C ERXA_LOG(VSXA_MEMORY_ERR, r,M C ("%s: failed to allocated %d bytes.",M func_name, toread));MM r = VSXA_MEMORY_ERR;M }
Furthermore this clone class does not cover memory er-ror handling code exclusively. In Figure 2(d), note that theprecision obtained for the first clone class is roughly 82%.Through inspection of the code we found that some of theclones do not cover memory error handling code at all, butcode that is similar at the syntactical level, yet semanticallydifferent.
6.2. Parameter CheckingOur results show that the parameter checking concern isfound very well by both clone detectors: using 7 cloneclasses of Bauhaus’ ccdiml is sufficient to cover 80% of theconcern, while for CCFinder we can suffice with 4 clone
5
• 3 clone detection techniques• 5 known aspects• 16KLOC C code• Aspects manually annotated
by programmer• Precision and recall compared
to manual annotations
Even for this “ideal” case still relatively poor precision ->
par t ially funded by t he Interuniversi ty A t t rac-t ion Poles P rogramme - B elgian St a te B elgianScience Policy.
References
1. Elisa Baniassad and Siobhan Clarke. Theme:An approach for aspect-oriented analysis anddesign. In Proc. Int’l Conf. Software Engineer-ing (ICSE), pages 158–167, Washington, DC,USA, 2004. IEEE Computer Society Press.
2. Elisa Baniassad, Paul C. Clements, JoaoAraujo, Ana Moreira, Awais Rashid, and BedirTekinerdogan. Discovering early aspects. IEEESoftware, 23(1):61–70, January-February 2006.
3. Len Bass, Mark Klein, and Linda Northrop.Identifying aspects using architectural reason-ing. Position paper presented at Early Aspects2004: Aspect-Oriented Requirements Engineer-ing and Architecture Design, Workshop of the3rd Int’l Conf. Aspect-Oriented Software Devel-opment (AOSD), 2004.
4. Magiel Bruntink, Arie van Deursen, Remcovan Engelen, and Tom Tourwe. An evalua-tion of clone detection techniques for identify-ing crosscutting concerns. In Proc. Int’l Conf.Software Maintenance (ICSM), pages 200–209.IEEE Computer Society, 2004.
5. Magiel Bruntink, Arie van Deursen, Remco vanEngelen, and Tom Tourwe. On the use of clonedetection for identifying cross cutting concerncode. IEEE Computer Society Trans. SoftwareEngineering, 31(10):804–818, 2005.
6. M. Ceccato, M. Marin, K. Mens, L. Moonen,P. Tonella, and T. Tourwe. Applying andcombining three di!erent aspect mining tech-niques. Software Quality Journal, 14(3):209–231, September 2006.
7. A. Kellens, K. Mens, and P. Tonella. A survey ofautomated code-level aspect mining techniques.Trans. AOSD, 2007. To be published.
8. Awais Rashid, Peter Sawyer, Ana M. D. Mor-eira, and Joao Araujo. Early aspects: A modelfor aspect-oriented requirements engineering.In Joint Int’l Conf. Requirements Engineering(RE), pages 199–202. IEEE Computer SocietyPress, 2002.
9. Bedir Tekinerdogan and Mehmet Aksit. De-riving design aspects from canonical mod-els. In S. Demeyer and J. Bosch, editors,Workshop Reader of the 12th European Conf.Object-Oriented Programming (ECOOP), Lec-ture Notes in Computer Science, pages 410–413.Springer-Verlag, 1998.
10. Charles Zhang and Hans-Arno Jacobsen. Ef-ficiently mining crosscutting concerns throughrandom walks. In AOSD ’07: Proceedings
of the 6th international conference on Aspect-oriented software development, pages 226–238,New York, NY, USA, 2007. ACM Press.
technique is rela t ively low. W hile t his low pre-cision is not a problem in se, i t does imply t ha taspect mining techniques tend to ret urn a lotof false posi t ives, which can be det riment al tot heir scalabili ty and ease-of-use. E specially fortechniques t ha t ret urn a large number of resul ts,t his lack of precision can be problema t ic, sincei t may require an impor t ant amount of user in-volvement to separa te the false posi t ives fromt he relevant aspect candida tes.
Note t ha t precision can be considered a t sev-eral levels of granulari ty. A t t he level of cross-cu t t ing sor ts: if we look for all aspects or con-cerns of a given kind, how many false posi t ivesdo we find t ha t do not belong to t ha t kind? A tt he level of individual aspects or concerns: do wefind some t hings t ha t are not really aspects orconcerns? A t t he level of joinpoints: for a givenaspect candida te or seed we detected, are t hecode fragments we find as belonging to t ha t con-cern really a par t of t ha t aspect?
E x a m p le B runt ink et al [4, 5] evalua ted t hesui t abili ty of clone detect ion techniques forau toma t ically ident ifying crosscu t t ing concerncode. T hey considered 16,406 lines of code be-longing to a large indust rial software systemand five known crosscu t t ing concerns t ha t ap-peared in t ha t code: memory handling, nullpointer checking, range checking, excep t ion han-dling and t racing. B efore applying t heir clonedetect ion techniques to mine for t he code frag-ments (lines of code) belonging to each of t hoseconcerns, t hey asked t he developer of t his codeto manually mark , for each line of code, to wha tconcern(s) i t belonged. N ex t , t hey applied t hreedi erent clone detect ion techniques to t he code:an A S T -based, a token-based and a P D G -basedone. In order to evalua te how well each of t het hree techniques succeeded in finding t he codet ha t implemented t he five crosscu t t ing concerns,t he resul ts of each of t he clone detect ion tech-niques were compared to t he manually markedoccurrences of t he di erent crosscu t t ing con-cerns, and precision and recall were calcula tedagainst t hose. Table 1 shows t he average preci-sion of t he t hree clone detect ion techniques foreach of t he five concerns considered.
A s can be seen from t he t able, t he resul ts oft his experiment were ra t her dispara te. For t henull pointer checking concern, all clone detectorsident ified t he concern code a t near-perfect preci-
Technique: A S T Token P D GConcern:M emory handling .65 .63 .81N ull pointer checking .99 .97 .80R ange checking .71 .59 .42E xcep t ion handling .38 .36 .35Tracing .62 .57 .68
Ta b le 1. A verage precision of each techniquefor each of t he five concerns
sion. For most of t he ot her concerns, none of t heclone detectors achieved sa t isfying precision.
R ela t ed p rob le ms Poor precision has a nega-t ive impact on scaleabili ty (3.5). T here is alsoa sub t le t rade-o between recall (3.2) and pre-cision: often bet ter precision can be reached a tt he cost of lower recall and vice versa.
3.2 P oor recall
D esc r i p t ion R ecall is t he propor t ion of rele-vant aspect candida tes t ha t were discovered ou tof all aspect candida tes present in t he sourcecode. In ot her words, recall gives an idea of howmany false nega t ives remain in t he code andt hus how well t he technique covers t he ent irecode analysed. A s for precision, recall can beconsidered a t several levels of granulari ty. A t t helevel of crosscu t t ing sor ts: if we look for all as-pects or concerns of a given kind, do we find allconcerns of t ha t kind which exist in t he code? A tt he level of individual aspects or concerns: do wefind all aspects and concerns t ha t are present int he code? A t t he level of joinpoints: do we findt he full ex tent of t he aspect or concern or doest he technique fail to discover some code frag-ments per t aining to t he aspect?
A problem wi t h calcula t ing recall is t ha t typ-ically, in a program under analysis, i t is notknown wha t t he relevant aspects and joinpointsare, excep t in an ideal case like t he valida t ionexperiment of B runt ink et al. (see above) wheret he concerns are known in advance and wherea programmer took t he t ime to mark each lineof code wi t h t he concern(s) i t belongs to. A sec-ond problem is t ha t most techniques will lookfor cer t ain symp toms of aspects only and t husare bound to miss occurrences of aspects t ha texhibi t di erent symp toms.
12
Subjectivity and scalabilitySubjectivity in interpretation of results
Filters, threshold values and blacklists configured by users
Ambiguity in interpretation of what is valid aspect candidate
“if it is part of the core functionality, it is not an aspect”
e.g. “Moving Figures” in JHotDraw
Scalability can be problematic due to user involvement
often many results to be validated / refined by user
looking for false positives / completing the aspect seeds
13
Evaluate, compare and combineEmpirical validation
no common benchmark
subjectivity in interpretation
results at different levels of detail and granularity
Comparability
how to compare the quality of mining techniques?
Composability
how to combine the results of different mining techniques?
14
Causes of the problemsInappropriate techniques
Too general-purpose
Too strong assumptions
Too optimistic approaches
Scattering versus tangling
Lack of use of semantic information
Imprecise definition of what is an aspect
Inadequate representation of results
15
Aspect mining problems and causes
CauseInappropriate techniques Imprecise
definition
Inadeq. repres. of results
too general purpose
too strong assumptions
too optimistic approaches
no attention to tangling
lack of use of sem. info
Problem
poor precision - - - - - -
poor recall - - - -
subjectivity - - -
scalability (-) (-) (-) (-) - (-)
emp. valid. - -
comparability - -
composability - -
What can we learn from this table?
Most causes negatively affect
either precision, recall or both
Poor precision negatively affects scalability: more user involv.
Only this one seems specific
to aspects
These three cause most problems
16
How to improve? (1)
Provide more rigourous definition of aspect
Dedicated mining techniques may be more successful than general-purpose ‘one size fits all’ aspect mining techniques
Rely on semantics rather than on code structure
need for stable semantic foundation
Desired quality depends on purpose of mining
what is it that you want to do with the mined information?
initial understanding vs. migration towards aspects
17
How to improve? (2)
Leave room for variability
Look for counter-evidence
Look for symptoms of tangling
Choose adequate and uniform way of presenting the results
enough detail but not too much
Combine results of different techniques
Provide common framework to compare and evaluate mining techniques
18
ConclusionMost encountered pitfalls not specific to “aspect mining”
relevant to any discovery / reverse engineering process
especially present in aspect mining due to relative immaturity of domain
potential for cross-fertilisation?
A word of warning
If you want to use aspect mining, don’t apply tools blindly
If you want to research aspect mining, still many research opportunities but also a high risk of failure