Re-engineering Pattern Extraction for Program Understanding ...

Re-engineering Pattern Extraction for Program Understanding

Technical Report CS-014-2004

Onaiza Maqbool, Haroon Babri

Computer Science Department

Lahore University of Management Sciences LUMS

Acknowledgment

We would like to thank the Lahore University of Management Sciences for providing funding for this research. Thanks to Rainer Koschke of the Bauhaus group at the University of Stuttgart for providing the RSF for Bash, and to Johannes Martin of the Rigi group, University of Victoria, for providing the RSF for Xfig.

ii

Table of Contents

1. Introduction...........................................................................................................................................12. Related work..........................................................................................................................................33. An overview of association rule mining..................................................................................................64. Re-engineering pattern extraction...........................................................................................................85. Experiments and results........................................................................................................................16

5.1 The test systems................................................................................................................................... 165.2 Analysis of results................................................................................................................................ 16

6. Conclusions.........................................................................................................................................34References....................................................................................................................................................35

iii

List of Tables

Table 1: A set of transactions..........................................................................................................................6Table 2: Statistics of the Bash and Xfig systems...........................................................................................16Table 3: Global variables accessed by maximum number of functions...........................................................18Table 4: List of files containing frequently accessed functions (Xfig)............................................................27Table 5: Function calls made in various sub-systems....................................................................................29Table 6: Number of function calls made to functions in each sub-system.......................................................30Table 7: Five or more function calls made to functions in each sub-system...................................................30Table 8: Accesses to user defined types in various Xfig sub-systems.............................................................31

iv

List of Figures

Figure 1: Number of functions accessing a global variable (Bash).................................................................17Figure 2: Global variable access breakdown (Bash).......................................................................................17Figure 3: Number of functions accessing a global variable (Xfig)..................................................................18Figure 4: Global variable access breakdown (Xfig).......................................................................................18Figure 5: Functions called by one function only (Bash).................................................................................19Figure 6: Functions called by one function only (Xfig).................................................................................19Figure 7: Functions that are called together (Bash)........................................................................................20Figure 8: Functions that are called together (Xfig)........................................................................................20Figure 9: Global variables that are accessed together with high confidence and support > 2 (Bash )...............21Figure 10: Association between global variables (Bash)................................................................................21Figure 11: Global variables that are accessed together with high confidence and support > 2 (Xfig ).............22Figure 12: Association between global variables (Xfig).................................................................................22Figure 13: User defined types that are accessed together with high confidence (Bash)...................................22Figure 14: Association between user defined types (Bash)............................................................................23Figure 15: User defined types that are accessed together with high confidence (Xfig)....................................23Figure 16: Association between types (Xfig).................................................................................................24Figure 17: Number of functions accessing a type (Bash)...............................................................................24Figure 18: Number of functions accessing a type (Xfig)................................................................................24Figure 19: Confidence 1 association between user defined types (Bash)........................................................25Figure 20: Confidence 1 association between user defined types (Xfig).........................................................25Figure 21: Confidence 1 association between globals and user defined types (Bash)......................................25Figure 22: Confidence 1 association between globals and user defined types (Xfig)......................................26Figure 23: Number of function calls made to functions (Bash)......................................................................26Figure 24: Number of function calls made to functions (Xfig).......................................................................26Figure 25: Number of function calls made to functions (Xfig d_*files sub-system).......................................27Figure 26: Number of function calls made to functions (Xfig e_*files sub-system)........................................28Figure 27: Number of function calls made to functions (Xfig f_*files sub-system)........................................28Figure 28: Number of function calls made to functions (Xfig u_*files sub-system).......................................28Figure 29: Number of function calls made to functions (Xfig w_*files sub-system).......................................29Figure 30: Types accessed by a subsystem (d_*files)....................................................................................31Figure 31: Types accessed by a subsystem (e_*files)....................................................................................31Figure 32: Types accessed by a subsystem (f_*files).....................................................................................32Figure 33: Types accessed by a subsystem (u_*files)....................................................................................32Figure 34: Types accessed by a subsystem (w_*files)...................................................................................32

v

1. Introduction

Legacy systems are old software systems that are crucial to the operation of a business. These systems are expected to have undergone changes in their lifetime due to changes in requirements, business conditions and technology. It is quite likely that such changes were made without proper regard to software engineering principles. The result is often a deteriorated structure, which is unstable but cannot be discarded because it is costly to do so. Moreover, another reason for retaining these legacy systems is that they embed business knowledge which is not documented elsewhere.

Since it is often not feasible to discard a system and develop a new one, techniques must be employed to improve the structure of the existing system. An effective strategy for change must be devised; reengineering is one such strategy. Re-engineering is a process that re-implements legacy systems to make them more maintainable [1]. According to [2], re-engineering is any activity that improves one’s understanding of software or prepares/improves the software itself, usually for increased maintainability, reusability or evolvability

Given the fact that software maintenance usually accounts for over 50% of project effort [1],[3],[4], making it the single most expensive software engineering activity [1], and perhaps the most important life cycle phase [5], the need for re-engineering to ease the maintenance effort is justified. The re-engineering option should be chosen when system quality has been degraded by regular change, but change is still required i.e. the system under consideration has low quality but a high business value, and the re-engineering effort is less risky and less costly than system replacement.

The re-engineering effort starts with gaining an understanding of the software system, a process known as reverse engineering. Understanding is critical to many activities including maintenance, enhancement, reuse, design of a similar system and training [6]. Reverse engineering has been heralded as one of the most promising technologies to combat the legacy systems problem [7]. However, gaining system understanding is difficult because documentation for the system is often not available and source code files are the only means of information regarding the system. According to [8] system understanding takes up 47% of the software maintenance effort. Hall [9] places the system understanding effort at 47%-62%. Tools and techniques are thus required to make the program understanding task easier. Tools provide automated support for system understanding at the procedural level by extracting the procedural design or at a higher level by extracting the architectural design. Tools for procedural design extraction are based on techniques including knowledge based systems with inference rule engines [10], graph parsing [11] and program plan recognition [12]. Techniques utilized by architectural design recovery tools include the composition of sub-system hierarchies using (k,2)-partite graphs [13], graph transformations using relational algebra [14] and view extraction and fusion using SQL [15].

In the past, the application of deductive techniques to different aspects of software engineering has been more frequent than the use of inductive techniques [16]. Deductive techniques usually employ some knowledge base as their underlying technology which is used to deduce relations within the software system. Great effort is required to build the knowledge base and continuously maintain it. Moreover, algorithms used for deductions may be computationally demanding [16]. Researchers have thus started exploring the use of inductive or data mining techniques in software engineering. Data mining is considered one of the most promising interdisciplinary developments in the information industry [17].

There has been growing interest in the application of data mining techniques to gain better understanding of software systems. In recent years, researchers have applied data mining techniques in different contexts e.g. for architecture recovery of legacy systems [18] - [21], to discover patterns for re-using library components [22], [23], to support software system maintenance [16], [24] - [26], to discover user interaction patterns [27], to aid program understanding [28], [29] and to facilitate software re-use [30] - [33].

1

In this report, our focus is on the use of association rule mining for discovering patterns within the source code that are helpful in system understanding and improvement. Patterns were first adopted by the software community as a way of documenting recurring solutions to design problems [34]. However, their use is not restricted to design problems; they have been used as an effective means to communicate best practice in various aspects of software development including the development process, testing and re-engineering. Re-engineering patterns present solutions to re-engineering problems. It is interesting to note that although re-engineering may be carried out due to a number of different reasons, the actual technical problems within legacy systems are often similar and hence some general techniques or patterns can be utilized to aid in the re-engineering task [34]. Re-engineering patterns for object-oriented legacy systems were identified based on experiences during the FAMOOS project [35] carried out to support the evolution of object-oriented software. In this report, we identify patterns for traditional legacy systems developed using the structured approach with functions as basic components. Given the source files of such a software system, we use association rule mining algorithms and tools to gain insight about the software. The understanding gained allows suggestions for making subsequent changes and optimizations to the source code for better maintainability.

The organization of this report is as follows. In section 2 we present related research. Section 3 gives an overview of association rule mining. Section 4 details our approach. Section 5 gives the results of applying association rule mining to two test systems. Finally, we present the conclusions.

2

1. Related work

The use of data mining for software understanding is gaining popularity as evidenced by recent work in this area. One possible categorization of the work done would be according to the mining technique employed. Popular techniques include association rule mining, concept learning and cluster analysis. Another useful categorization is according to the application or purpose of work. Keeping in view that some researchers employ more than one technique to arrive at results, in this report we chose to categorize research according to its purpose e.g. architecture recovery, re-modularization, maintenance support, facilitating re-use, interaction pattern mining and program comprehension.

The architecture recovery of software systems using data mining techniques is discussed in [18] - [21]. The data mining technique employed in these papers is primarily association rule mining. The Identification of sub-systems based on associations (ISA) was proposed by Oca and Carver [18]. They used association rule mining for extraction of data cohesive sub-systems by grouping together programs that use the same data files. Experiments were performed on a 25 KLOC Cobol system with 28 programs and 36 data files and sub-systems were successfully identified. However, the results obtained were not compared with any existing documentation or decomposition. A very similar approach is that of [19], where the use of a representation model (RM) is discussed to represent the sub-systems identified using the ISA methodology in [18].

Sartipi et al. [20] discuss a technique for recovering the structural view of a legacy system’s architecture based on association rule mining, clustering and matches with architectural plans. Rather than using programs as sub-system entities as in [18], functions are used. Moreover, associations between functions are determined based on the variables and types they access, and the function calls they make. After closely associated function groups have been identified, a branch and bound algorithm is used for matching with architectural plans. In experiments performed with the CLIPS system (40 KLOC, 734 functions, 59 aggregate data types and 163 global variables), a precision level of 90% and a recall level of 60% is maintained between the conceptual and concrete architecture recovered through the defined process.

Tjortjis et al. [21] also employ association rule mining to arrive at decompositions of a system at the function level. Their rule mining approach is similar to the one discussed in [20], where attributes are variables, data types and calls to functions. Groups of functions i.e. sub-systems are created finding common attributes participating in the same association rules. Experiments performed on a Cobol system, with programs of an average size of 1000 lines of code show that the results compared with a mental model constructed by a developer of the system are satisfactory.

Software re-modularization using clustering techniques is discussed in [36] - [41]. Tzerpos and Holt [36] present the case for using clustering techniques to re-modularize software, after the techniques have been adapted to fit the peculiarities of the software domain. In [37], Wiggerts provides a framework to apply cluster analysis for re-modularization. The clustering process is described in some detail, along with commonly used similarity measures and clustering algorithms.

Experiments with clustering as a re-modularization method are described in [38] and [39]. Both papers conclude by recommending similarity measures and clustering algorithms which yield good experimental results for software artifacts. A theoretical explanation to some previous experimental results obtained by researchers in the area is provided in [40]. The paper also describes a new clustering algorithm, which gives better results as compared to the currently employed algorithms for clustering software. In [41], the weighted combined algorithm for software clustering is presented. This algorithm shows improvement in clustering results as compared to previously employed clustering algorithms.

Data mining techniques have also been employed to support the maintenance task [16], [24] - [26]. The use of inductive techniques to extract a maintenance relevance relation (MRR) from the source code,

3

maintenance logs and historical maintenance update records is discussed in [16], [24], [25]. An MRR simply indicates that if a software engineer needs to understand file1, he/she probably also needs to understand file2. The problem has been presented as a concept learning problem, and a decision tree classifier is used for classifying file pairs as relevant, not relevant and potentially relevant. Relevance indicates that two files were modified in the same update and potential relevance indicates that both files were looked at in the same update. Experiments performed on the SX2000 system, a large legacy telephony switching system with 1.9 MLOC and 4700 files show that the 2-class problem, with relevant and non-relevant classes only, yields better results than the 3-class problem. It is also seen that combining text based features with syntactic features yields better results than using syntactic features alone.

Zimmermann et al. [26] apply association rule mining to ease maintenance by mining version histories. Association rule mining is used to predict likely changes after a change has been made, prevent errors due to incomplete changes and detect coupling between items which is not revealed by program analysis. Mining is carried out through the ROSE tool developed for this purpose. The tool reads a version archive and groups changes into transactions so that rules describing them are formed. ROSE was tested on 8 open source projects and was found to be a helpful tool in suggesting further changes and in warning about missing changes.

Software re-use can be facilitated by data mining techniques [22], [23], [30] - [33]. Software library reuse patterns have been mined using generalized association rules in [22]. The discovered patterns serve as guides to identify typical usage of the software library i.e. the combination of classes and member functions that are typically re-used by applications. The paper extends earlier work by the author [23], in which association rules rather than generalized association rules were used. Re-use patterns for the KDE core libraries version 1.1.2 were mined by analyzing 76 applications.

Reusable components are identified by finding similarities between components using lexical rather than structural techniques in [30]. Mc Carey et al. [31] utilize collaborative filtering for recommending reusable components by enabling prediction of the utility of an item based on the user’s previous history and opinions of other like minded users.

The building of a digital library for source code to facilitate reuse of code segments is discussed in [32]. Garg [33] discusses the sharing of knowledge of re-usable components across multiple projects.

El-Ramly et al. [27] apply sequential data mining to user activities to discover interaction patterns depicting how users interact with systems. Interaction pattern mining is applied to legacy and web-based systems. For legacy systems, these patterns reflect active services. For web-based systems, the patterns can be used for reengineering the website for easier and faster access.

Program comprehension can be aided by source code mining [28], [29]. Balanyi and Ferenc [28] automatically search design patterns in C++ code. A tool called Columbus is used to build an internal representation of the source code which is compared with pattern descriptions written in DPML, a language based on XML. Four publicly available C++ projects were used for experiments. Some problems were faced because of rule violations in implementation, and when the structures of patterns were similar. However, the overall results were satisfactory.

The comprehension of C++ programs by clustering together similar entities based on their attributes is proposed in [29]. Four entities are used: classes, member data, member functions and parameters. Results of applying the process to three open source systems have been presented. Analysis of the results reveals correlations between classes, thus revealing portions of code that have common characteristics and are expected to change together.

A workshop for mining software repositories for assisting in program understanding and studying evolution was held recently [42]. Papers presented in the workshop covered various aspects, including the infrastructure required for extraction of information, use of mining for program understanding, identification of change patterns, defect analysis, software re-use and process and community analysis.

4

2. An overview of association rule mining

Association rule mining is a data mining technique that finds interesting association or correlation relationships among a large set of data items [17]. Traditionally, association rule mining has been employed as a useful tool to support business decision making by discovering interesting relationships among business transaction records.

To illustrate the concept of association rule mining, consider a set of items I = {i1, i2,….in}. Let D be a set of transactions, with each transaction T being a subset of I i.e. An association rule is an implication of the form where and As an example, consider a set of computer accessories (CDs, memory sticks, microphones, speakers) that are available at a certain store. These accessories form the set of items I of interest to us. Every sale made represents a transaction T. Suppose the sales made are represented in the form of the following set of transactions D:

Transaction ID Items Sold

T1 CD, memory stick

T2 CD

T3 Microphone, speaker

T4 CD, speaker, Microphone

T5 Memory stick, microphone, speaker

Table 1: A set of transactions

Association rules in the above case represent the items that tend to be sold together e.g. the association rule shows that those who buy CDs also tend to buy speakers.

A large number of such association rules may exist in a given set of transaction and not all of them may be of interest. A pattern is said to be interesting if it is easily understood, valid, useful, novel or validates a hypothesis that the user sought to confirm [17]. To find interesting rules, support and confidence are commonly used as objective measures of pattern interestingness. Support represents the percentage of transactions in D which contain both A and B. Confidence is the percentage of transactions in D containing A that also contain B. Another measure of interest is coverage. The coverage of an association rule is the proportion of transactions in D that have the items specified on the left hand side of the rule. Mathematically:

Support =

Confidence =

Coverage =

An association rule is said to be strong if it satisfies both a minimum support threshold and a minimum

5

confidence threshold.

For the association rule , support is 1/5, confidence is 1/3 and coverage is 3/5. An interesting association rule in this case is , for which support is 3/5, confidence is 1 and coverage is 3/5.

6

3. Re-engineering pattern extraction

To employ association rule mining for pattern extraction, the first step is to identify a set of items and transactions. The guiding principle is to choose items which facilitate understanding of the code and allow suggestions for re-structuring the code for greater maintainability. Most of the legacy software systems that exist have been developed using the structured approach, with functions or routines forming basic components. Moreover, in legacy software, the use of global variables is often widespread leading to difficulty in understanding the code. In view of these facts, we decided to use functions and global variables as items. Moreover, we also decided to use user defined types. The reason is that user defined types become potential data objects when code is to be re-structured as object-oriented code. Thus, the transaction set that we use consists of functions within a software system, with items being the global variables accessed, user defined types accessed, and function calls made by the function.

In the next step, we identify re-engineering patterns that help in identifying problems in legacy code and suggesting appropriate solutions. Whereas design patterns have to do with choosing a particular solution to a design problem, re-engineering patterns have to do with discovering an existing design, determining what problems it has and repairing these problems [34].

Traditionally high thresholds of support, confidence and coverage have been used for finding interesting rules. We use both low and high thresholds of the three measures, coverage, support and confidence. It is apparent from the patterns below that low thresholds can reveal interesting facts about the software and provide insight into its structure. One can consider the range 0.7 to1.0 as high for any measure and 0.0 to 0.3 as low. However, it is not meaningful to fix an absolute threshold, since system characteristics such as size, number of global variables, types and functions etc. can vary widely from system to system. Thus we recommend that high and low thresholds be determined subjectively, depending on the system under consideration.

It is relevant to note that if we employ user defined types, functions and global variables as items, and use coverage, support and confidence criteria with values zero, low, high and one, the number of possible association rules is almost 1500. It is obvious that the chosen objective measures are not sufficient to guide the mining process. Useful patterns need to be filtered using subjective interestingness measures. These measures are based on user beliefs in data and find patterns interesting if they are unexpected or offer information on which the user can act [17]. In this report, our emphasis is on selection of unexpected and actionable patterns in the context of program understanding. The associations that we describe are somewhat different from those of interest in conventional applications like market basket analysis. However, they are relevant in the software context and convey meaningful information, thus highlighting the need and benefits of adapting techniques to suit the peculiarities of specific domains. We present a small subset of the total possible association rules in this report. They are a representative sample of interesting association rules and have been selected because they help in understanding the design of legacy software and in identifying problem areas, whereas the related patterns offer suggestions for improvement.

In the tables below, we list the interesting association rules and detail their implication. We also list benefits of employing the patterns, as well as related issues and problems. In some of the association rules identified, one out of the three interestingness measures is used. This indicates that the value of the other two measures does not influence the pattern. Patterns in which a combination of measures gives interesting results are also listed.

7

Pattern 1:Reduce global variable usage

Association rule Coverage

Global → Calling function Low

Implication

Only a small proportion of functions in the system use the global variable on the LHS.

Suggestion

Rather than using the variable as a global variable, pass it as a parameter within the relevant functions. If passing a global variable as a parameter is not convenient, its scope may be restricted to a single file by defining it as a static variable.

Motivation & Benefits

When a large number of variables are to be shared amongst functions, global variables are convenient. Moreover, global variables have a longer lifetime than automatic variables, making it simpler to share information between functions that do not call each other [43]. However, unless the global data is read only, the use of global variables results in undesirable coupling between functions [44], leading to difficulties in program understanding and maintenance. By removing unnecessary global variables and restricting their usage, coupling among components is reduced, making it easier to trace faults and avoid unintentional changes to data.

Issues & Problems

Careful evaluation of each global variable is required to decide whether it should be passed as a parameter, declared to be static or left as it is. If the global variable has low coverage, it implies that only a small proportion of functions in the system use the global variable. However, for a large system, this small proportion can mean a large absolute number of functions e.g. if we consider a small system with 100 functions, a 30% coverage implies upto 30 functions accessing the global variable, which is not a small number. Moreover, even if the number of functions is considerably less, the designers of the system may have valid reasons for defining a variable as global.

Pattern 2:Select appropriate storage classes


Global → Calling function High

Implication

Global variable on LHS is used by most of the functions in the system.

Suggestion

The global variable appears to be frequently used. Declare the global variable as a register variable.


Register variables are a suggestion to the compiler that they be placed in a register instead of memory. This provides a certain degree of control over program efficiency, and may result in faster access and speed improvement [45].

8

Issues & Problems

The storage of program variables in registers may interfere with conventional usage of a register by the compiler, thus slowing down execution of a program [46]. Thus knowledge of the architecture and compiler is required before such declarations can be of use. It is to be kept in mind that the declaration is a suggestion to the compiler only, which the compiler may choose to ignore [45]. Also, not all application languages support the declaration of register variables.

Pattern 3: Increase locality of reference

Association rule Confidence

Called function → Calling function

One

Implication

The function is called by one function only.

Suggestion

Place the called function in the same file as the calling function. If the size of the function is small, it may be appropriate to make the function inline.


Since the function is called by one function only, there appears to be a strong relation between the two. Functions that are related should be grouped together and placed in one file to promote understandability and information hiding [44]. Such grouping promotes modular continuity [3] because changes in requirements are localized instead of resulting in system wide changes. Making a function inline results in performance improvement and is clearly beneficial if the function expansion is shorter than the code for the calling sequence [48].

Issues & Problems

Large inline functions will save a small percentage of run time but will have a higher space penalty. Also functions with loops should almost never be inlined [48] because the run time of a loop is likely to swamp the function call overhead. Large inlined functions may also make it difficult to understand the functionality of the calling function.

Association rule Confidence Support

Called function → Called function

High High

Implication

Whenever one function is called, there is a high probability that the other function is also called.

Suggestion

Place functions in the same file.


9

The fact that two functions are called together indicates that they perform related tasks. A file is a way to package related functions into a module [44]. Grouping related functions together is a good design practice because it promotes understandability. Moreover, such grouping promotes modular continuity [47] because changes in requirements are localized instead of resulting in system wise changes. Localized changes ease the maintenance task.

Issues & Problems

It may be the case that a function is associated with more than one function. These functions may be present in different files making it difficult to decide in which file to place the function under consideration. The support of the association rule can serve as a useful indicator. If a function is associated with more than one function, then the function should be placed in the same file as the function with which it is associated with highest support. In case support is the same, the decision depends on other factors e.g. similar functionality of a function with functions in a certain file etc.

Pattern 4: Increase data modularity


Global → Global High High

Implication

Whenever one global variable is accessed, there is a high probability that the other global variable is also accessed.

Suggestion

Examine the global variables to see if they form a coherent entity. If they do, combine them into a structure.


Type → Type High High

Implication

Whenever one type is accessed, there is a high probability that the other type is also accessed.

Suggestion

Examine the types to see if they form a coherent entity. If they do, combine them into a structure.


Combining variables/types into structures leads to code that is easier to understand and change [44]. If at some stage, a shift is to be made to an object-oriented design paradigm, the structures become potential classes.

Issues & Problems

10

It may be the case that one global variable/type is associated with a high degree of confidence with a number of other global variables/types i.e. when the global variable/type is accessed, a number of other global variables/types are accessed. In this case, to avoid a large structure that hinders rather than promotes understandability, the software engineer should study the code and analyze which global variables/types are to be combined into a structure.

Pattern 5:Strengthen encapsulation


Calling function → Type One

Implication

The functions access the type on the RHS.

Suggestion

If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, consider the type as a candidate class and the functions as its member functions.


Type → Type One

Implication

The types are used together by functions within the system.

Suggestion

Examine the types to see if they form a coherent entity. If they do, combine the types into a structure. If we are considering converting a ‘structured’ system to an ‘object-oriented’ system, the structure is a candidate class and the functions are its member functions.


Global → Type One

Implication

It is always the case that functions in the system access the type and global variable together.

Suggestion

If we are considering converting a ‘structured’ system to an ‘object-oriented’ system treat the type as a candidate class, the global as a static data member and the functions as member functions.


A collection of functions and the common data set they access can be packaged to provide information

11

hiding. Large programs that use information hiding have been found easier to modify by a factor of 4 than programs that don’t [44]. Information hiding forms a foundation for both structured and object-oriented design.

Issues & Problems

Functions may access more than one type, in which case a careful study of the code is required to decide the type with which the function should be associated. It may be the case that the types accessed by a function form an entity which can be transformed into a structure (Pattern 4).

Pattern 6: Identify utilities


Called function → Calling function High

Implication

Function on LHS is called by most of the functions.

Suggestion

Treat the function as a utility function. It may be useful to put groups of related utility functions in separate files.


Called function → Calling function High (within a sub-system)

Implication

Function on LHS is called by most of the functions within a sub-system.

Suggestion

Treat the function as a utility function for that sub-system. The utility functions for a particular sub-system may be placed in a separate file.


Utility functions represent re-usable components of a structured system. Re-usable components can lead to measurable benefits in terms of reduction in development cycle time and project cost, and increase in productivity [3].

Issues & Problems

In order for functions to be re-used effectively, they must be properly catalogued for easy reference, standardized for easy application and validated for easy integration [3]. In case the number of such functions is large, related functions need to be identified and grouped together, otherwise searching for the appropriate function may be time consuming.

Pattern 7:Localize structures

12


Global → Calling function One

Implication

The global variables are used by one function only.

Suggestion

Place global variables in one local structure.


It is recommended that all variables have the smallest scope possible. If a variable is used by one function only, there is no plausible reason for defining the variable as global. Localizing the variable promotes understandability and maintainability.

Issues & Problems

None

Pattern 8:Beware of side effects


Calling function → Global One

Implication

When sorted by global variable, we get a list of functions that use the same global variable and thus are highly coupled.

Suggestion

Changes in the global variable or a function accessing the global variable should be made carefully keeping in view all related functions.


Using global variables weakens modularity because functions cannot be understood on their own. Understanding the purpose and working of all functions accessing one global variable leads to reduced side effects in case of changes in the functions or global variable.

Issues & Problems

If a large number of functions access a global variable, understanding their working may require time and effort. The best approach is to reduce global variable usage (see Pattern 1). However, if this is not possible, the effort is well justified because inadvertent changes are avoided.

Pattern 9: Generate alternative views


Type → Calling function High (Within a sub-system)

Implication

13

Type on LHS is used by most of the functions within the sub-system.

Suggestion

The association of types with a certain sub-system should be highlighted.


Sub-systems and their inter-relationships represent the architecture of a software system. It is important to focus on the architecture from multiple perspectives so that large scale structural changes are easier to make [49]. A study of the types associated with a sub-system provides an alternative ‘data’ view which is particularly important if changes are to be made to types or functions within the sub-system.

Issues & Problems

Different sub-systems of the software system may access the same types, making it difficult to associate a type with one sub-system only. In such a case, it may be feasible to use Pattern 5 on the entire software system in order to group together functions with related types and hence arrive at an alternative modularization based on data abstraction.

4. Experiments and results

5.1 The test systems

For conducting experiments, Xfig version 3.2.3 and a subset of Bash version 1.14.4 were used. Xfig is an open source drawing tool that runs under X Window system. It has been written in C, and consists of around 75,000 lines of code. The design documentation of Xfig is not available, although the user manual and other useful information is available at the Xfig site [50]. Bash is a Unix shell, and the subset which we

14

experimented with consists of 38K lines of source code [51]. Bash and Xfig have been used for architecture recovery experiments in [40], [41], [52]. Our data mining experiments are helpful in gaining understanding of the test systems by providing alternative views of the software.

The source files for the Xfig and Bash system have been parsed using the Rigi tool and relevant ‘facts’ have been stored in an exchange format called the ‘Rigi Standard Format (RSF) [52], [53]. The transaction set discussed in section 4 was developed from this fact set.

Some useful statistics of the two systems are provided in the following table. It is relevant to note that the Xfig system consists of five major subsystems, whose source code files can be identified by their names. We have experimented with the 95 files in these sub-systems, leaving 4 files which have not been considered. We do not expect that the average figures and percentages obtained for Xfig will change substantially on inclusion of these 4 files.

System Purpose Globals Functions Types

Xfig 1746 1661 828

d_*files Drawing tasks 94

e_*files Editing tasks 369

f_*files File related tasks 139

u_*files Utility files 422

w_*files Window related tasks 637

Bash 539 892 198

Table 2: Statistics of the Bash and Xfig systems

5.2 Analysis of results

In this section, we present the results of applying association rule mining to Xfig and Bash and analyze the results using the patterns identified in section 3.

15

Pattern 1: Reduce global variable usage

Bash - Global variable usage

0

10

20

30

40

50

60

70

Global variables

Nu

mb

er o

f fu

nct

ion

s ac

cess

ing

a g

lob

al

vari

able

Figure 1: Number of functions accessing a global variable (Bash)

Figure 1 shows the usage of global variables by functions in Bash, where 543 functions access global variables. The average number of functions accessing a global variable is 4.12 with a standard deviation of 5.70. The global variable rl_point is accessed by the maximum number of functions (62) with coverage 0.07.

Bash - Global variable usage

7%

65%

28%

Globals accessed by more than 10 functions

Globals accessed by 2 - 9 functions

Globals accessed by 1 function

Figure 2: Global variable access breakdown (Bash)

[1] As illustrated in , 28% of the global variables are accessed by one function only. Similar results are

obtained for Xfig (Figure 3, Figure 4), where average number of functions accessing a global variable is

5.97 with a standard deviation of 15.73 and 22% of the global variables are accessed by one function only.

The variable XtStrings^* is accessed by the maximum number of functions (231) with a coverage of

0.139.

16

Xfig - Global variable usage

020406080

100120140160180200220240260

Global variables

Nu

mb

er o

f fu

nct

ion

s ac

cess

ing

a g

lob

al

vari

able

Figure 3: Number of functions accessing a global variable (Xfig)

Xfig - Global variable usage

10%

68%

22%

Globals accessed by more than 10 functions

Globals accessed by 2 - 9 functions

Globals accessed by 1 function

Figure 4: Global variable access breakdown (Xfig)

It is evident from Figure 1 - Figure 4 that a substantial proportion of global variables are being accessed by very few functions in Bash and Xfig. It can also be observed that some global variables are accessed by a single function. The fact that the variables have been defined as global shows a poor design or a design that has deteriorated over time. To enhance software quality, it will be useful to reduce such a large number of global variables by localizing them.

Pattern 2: Select appropriate storage classes

Table 2 shows global variables accessed by the largest number of functions in Xfig and Bash. Only three global variables are shown.

Number of functions Global variable Coverage

Bash 62 rl_point 0.07

46 rl_end 0.052

39 __ctype_b 0.044

Xfig 231 XtStrings^* 0.139

211 _ArgCount^* 0.127

211 _ArgList^* 0.127

Table 3: Global variables accessed by maximum number of functions

17

If the functions are accessing the global variables are called frequently, the global variables are candidate register variables.

Pattern 3: Increase locality of reference

Bash - Functions called by one function

74%

26%

Functions residing in different files

Functions residing in same file

Figure 5: Functions called by one function only (Bash)

Figure 5 shows that Bash has 298 functions that are called by only one function. An examination of the code shows that 220 (74%) of these functions are present in the same file as the calling function, and 78 (26%) are in different files. It can be seen from Figure 6 that similar results are obtained for Xfig. 368 functions are called by only one function out of which 270 (73%) functions are present in the same file as the calling function, and 98 (27%) are present in a different file.

Xfig - Functions called by one function

27%

73%



Figure 6: Functions called by one function only (Xfig)

A valid reason for placing a called function in a different file from the calling function is that the called function may be a ‘utility’ function which has been placed together with other utility functions in a separate file. If this is not the case, the functions should be placed in the same file as the calling function to increase efficiency and ease maintenance by increasing locality of reference.

18

Bash - Functions residing in same file with high confidence and support > .001

49%

51%



Figure 7: Functions that are called together (Bash)

3644 functions are called together with high confidence (0.7 – 1.0) in Bash. The association between the functions xrealloc and xmalloc carries the highest support 0.045 i.e. 31 functions call them together. The two functions are present in the same file. It can be seen from Figure 7 that 51% of the functions called together with support > .001 (1) reside in same files and 49% reside in different files. If we consider files with high confidence and ignore support, it is seen that 43% of the functions called together reside in same files and 57% reside in different files. In Xfig, 3584 functions are called together with high confidence. The association between the functions cleanup and set_action_object carries highest support 0.045 i.e. 59 functions call them together. The two functions are present in the same file. It can be seen from Figure 78 that 52% of the functions called together with support > .001 (1) reside in same files and 48% reside in different files. If we consider files with high confidence and ignore support, the same percentage figures are obtained.

Xfig - Functions called together with high confidence and support > 0.001

48%

52%



Figure 8: Functions that are called together (Xfig)

Unless there are valid reasons for doing otherwise, functions called together by a large number of functions should be placed in same file. Reasons for placing them in different files need to be examined with care.

Pattern 4: Increase data modularity

In Bash, 1635 global variables are accessed together with high confidence, with 1549 of these accessed together with a confidence of 1. The highest support is 0.077 (42) for the variables rl_end and rl_point, which represents that 42 functions access these variables together with high confidence.

19

Bash - Support for association between global variables

0

5

10

15

20

25

30

35

40

45

Global variables

Su

pp

ort

Figure 9: Global variables that are accessed together with high confidence and support > 2 (Bash )

The support figures in Figure 9 help us to decide which global variables should preferably be grouped together. Without these figures the decision may be difficult since a single global variable may be associated with a large number of other variables and it may not be possible to group them all together. Average support is 1.51 with a standard deviation of 1.61. Figure 10 shows the number of variables with which various global variables in Bash are highly associated. On an average, a global variable is associated with 6.15 other global variables with a standard deviation of 9.31.

Bash - Global associations

0

5

10

15

20

25

30

35

40

Global variables

Num

ber

of v

aria

bles

as

soci

ated

with

hig

h co

nfid

ence

Figure 10: Association between global variables (Bash)

In Xfig, 14983 global variables are associated with each other with a high confidence with 14030 of these associated with a confidence of 1.00. The highest support is 0.159 (211) for the variables _ArgList^* and _ArgCount^*, which represents that 211 functions access these variables together with high confidence. The support figures for Xfig are depicted in Figure 11. Support average is 2.13 with a standard deviation of 5.95. This is higher compared to the support figure for Bash.

Xfig - Support for association between global variables

0

50

100

150

200

250

Global variables

Sup

port

20

Figure 11: Global variables that are accessed together with high confidence and support > 2 (Xfig )

Figure 12 shows the number of variables with which various global variables in Xfig are highly associated. On an average, a global variable is associated with 12.95 other global variables with a standard deviation of 18.87. The average number of global variables with which a variable is associated is higher for Xfig as compared to Bash. For global variables associated with high confidence, being accessed together by a large number of functions, it is suggested that they be placed in a single structure.

Xfig - Global associations

0

20

40

60

80

100

120

Global variables

Num

ber

of v

aria

bles

as

soci

ated

with

hig

h co

nfid

ence

Figure 12: Association between global variables (Xfig)

In Bash, 155 types are accessed together with high confidence, with 148 of these accessed together with a confidence of 1. The highest support is 0.92 (23) for the types KEYMAP_ENTRY_ARRAY and Keymap, which represents that 23 functions access these types together with high confidence.

Bash - Support for association between types

0

5

10

15

20

25

User defined types

Sup

port

Figure 13: User defined types that are accessed together with high confidence (Bash)

The support figures in Figure 13 help us to decide which user defined types should preferably be grouped together. Without these figures the decision may be difficult since a single type may be associated with a large number of other types and it may not be possible to group them all together. Average support figures, very similar to global variable average support figures are 1.74 with a standard deviation of 2.53.

Figure 14 shows the number of types with which various user defined types in Bash are highly associated. On an average, a type is associated with 4.08 other types with a standard deviation of 5.75. It can be seen that quite a large number of types are associated with one other type only. If the support for such an association is high, the two types can easily be combined to form a structure. For types associated with a number of other types, support figures should be utilized to take an appropriate decision.

21

Bash - Type associations

0

2

4

6

8

10

12

14

16

18

User defined types

Num

ber

of ty

pes

asso

ciat

ed

with

hig

h co

nfid

ence

Figure 14: Association between user defined types (Bash)

In Xfig, 422 user defined types are associated with each other with a high confidence with 378 of these associated with a confidence of 1.00. The highest support is 0.156 (213) for the types Arg and WidgetList, which represents that 213 functions access these types together with high confidence. The support figures for Xfig are depicted in Figure 15. Support average is 4.34 with a standard deviation of 13.94. This is once again higher compared to the support figure for Bash.

Xfig - Support for association between types

0

50

100

150

200

250

User defined types

Sup

port

Figure 15: User defined types that are accessed together with high confidence (Xfig)

Figure 16 shows the number of types with which various types in Xfig are highly associated. On an average, a type is associated with 4.44 other types with a standard deviation of 5.12. The average number of types with which a type is associated is almost the same for Xfig and Bash. It can be seen that quite a large number of types are associated with one other type only, similar to the results for Bash.

Xfig - Type associations

0

5

10

15

20

User defined types

Num

ber

of ty

pes

asso

ciat

ed

with

hig

h co

nfid

ence

Figure 16: Association between types (Xfig)

22

Pattern 5: Strengthen encapsulation

Figure 5 shows the number of functions that access a user defined type for Bash. The average number of functions that access a type is 10.84 with a standard deviation of 14.18. It can be seen that 71% of the types are accessed by 10 or less functions.

Bash - Type accessess

0

10

20

30

40

50

60

70

80

Types

Nu

mb

er

of

fun

ctio

ns

acc

essin

g t

he

typ

es

Figure 17: Number of functions accessing a type (Bash)

In Xfig, average is 26.42, with 67% of the types accessed by 10 or less functions. In order to promote information hiding, the types should be packaged with the functions accessing them. If the system is to be re-structured as an object oriented system, the types are candidate classes and accessing functions are member functions.

Xfig - Type accesses

0

50

100

150

200

250

300

350

400

Types

Nu

mb

er

of

fun

cti

on

s

accessin

g t

he t

yp

es

Figure 18: Number of functions accessing a type (Xfig)

Figure 19 shows the number of types with which various user defined types in Bash are associated with a confidence of 1. On an average, a type is associated with 4.77 other types with a standard deviation of 6.17. For Xfig, average is 5.18 with a standard deviation of 5.59. Associations are depicted in Figure 20.

23

Bash - Type associations

0

5

10

15

20

Types

Nu

mb

er

of

as

so

cia

ted

ty

pe

s

Figure 19: Confidence 1 association between user defined types (Bash)

Xfig - Type associations

0

5

10

15

20

Types

Figure 20: Confidence 1 association between user defined types (Xfig)

Figure 21 and Figure 22 show the number of types that are associated with each global variable with a confidence of 1 for Bash and Xfig. The number is large, and due to the fact that a global is normally associated with more than one type, it may be difficult to identify the type for which the global is to become a static data member in case of conversion to an object-oriented design. It may be appropriate to apply pattern 3 to determined highly associated data types. If the set of types with which a global is associated are highly associated with each other, they may first be grouped together to form a class.

Bash - Global Type associations

0

2

4

6

8

10

12

14

16

18

20

Globals

Figure 21: Confidence 1 association between globals and user defined types (Bash)

24

Xfig - Global Type associations

0

5

10

15

20

25

Globals

Num

ber

of ty

pes

asso

ciat

ed

with

a g

loba

l

Figure 22: Confidence 1 association between globals and user defined types (Xfig)

Pattern 6: Identify utilities

Figure 23 shows the calls made by functions in Bash, where 694 functions make function calls. The average number of calls made is 3.18 with a standard deviation of 8.46. The highest number of calls (195) are made to the xmalloc function with a coverage of 0.219. An examination of the code shows that 75% of the functions to which 20 or more calls are made reside in the files general.c, error.c or variable.c. This indicates that utility functions are placed in separate files in Bash.

Bash - Function calls made

0

25

50

75

100

125

150

175

200

Functions

Nu

mb

er o

f fu

nct

ion

s ca

llin

g a

fu

nct

ion

Figure 23: Number of function calls made to functions (Bash)

As can be seen from Figure 24, similar results are obtained for Xfig (average number of calls made is 5.00 with standard deviation of 10.17). 979 functions make function calls. The highest number of calls (180) are made to the put_msg function with a coverage of 0.096.

Xfig - Function calls made

0

50

100

150

200

Functions

Num

ber

of fu

nctio

ns c

allin

g a

func

tion

Figure 24: Number of function calls made to functions (Xfig)

25

An examination of the code shows that there are 36 functions to which 20 or more calls are made and they reside in 19 different files. A listing of these files, along with the number of functions in a file is given in Table 4 below:Mathinline 2string2 1stdlib 2f_util 2mode 3u_bound 1u_elastic 1u_markers 2u_redraw 2u_search 1u_undo 2w_color 1w_cursor 3w_drawprim 1w_indpanel 2w_modepanel 1w_mousefun 2w_msgpanel 3

Table 4: List of files containing frequently accessed functions (Xfig)

It can be seen that in Xfig, as compared to Bash, a larger number of files contain frequently accessed functions. The names of the files indicate that they contain utility functions. However, a more detailed look at the files is required to ascertain that this is indeed the case.

To identify the utility functions within various sub-systems of Xfig, associations between the functions with the software, and functions within each sub-system were noted. Figure 25 - Figure 29 show the calls made within the various sub-systems. The graphs are similar, depicting that function calls in each sub-system follow a similar trend.

Xfig - Function calls made in the d_*files subsystem

0

5

10

15

20

25

30

35

40

Functions

Num

ber

of fu

nctio

ns

calli

ng a

func

tion

Figure 25: Number of function calls made to functions (Xfig d_*files sub-system)

26

Xfig - Function calls made in the e_*files subsystem

0

10

20

30

40

50

Functions

Num

ber

of fu

nctio

ns

calli

ng a

func

tion

Figure 26: Number of function calls made to functions (Xfig e_*files sub-system)

Xfig - Function calls made in the f_*files subsystem

0

5

10

15

20

25

30

35

40

Functions

Nu

mb

er

of

fun

cti

on

s c

allin

g

a f

un

cti

on

Figure 27: Number of function calls made to functions (Xfig f_*files sub-system)

Xfig - Function calls made in u_*files subsystem

0

5

10

15

20

25

30

35

Functions

Num

ber

of c

alls

mad

e to

a

func

tion

Figure 28: Number of function calls made to functions (Xfig u_*files sub-system)

27

Xfig - Function calls made in the w_*files subsystem

0

10

20

30

40

50

60

70

80

Functions

Num

ber

of fu

nctio

ns

calli

ng a

func

tion

Figure 29: Number of function calls made to functions (Xfig w_*files sub-system)

Table 5 shows some statistics of the function calls made. It can be observed that the average number of calls made in all the sub-systems is very similar.

Sub-system Average number of calls made

Standard deviation Highest number of calls made (function)

Coverage

d_*files 3.27 5.00 draw_mousefun_canvas 37 (0.394)

e_*files 3.44 5.19set_cursor

46 (0.125)

f_*files 2.03 2.98file_msg

38 (0.273)

u_*files 3.09 3.70set_action_object

33 (0.078)

w_*files 3.02 5.13put_msg

69 (0.108)

Table 5: Function calls made in various sub-systems

Table 6 shows the function calls that are made to various functions by functions within a sub-system. It appears that the u_*files sub-system contains utility functions, because maximum number of functions are called from this sub-system. This was our initial assumption , which has been confirmed through the results obtained. The d_*files sub-system functions make more calls to functions within the u_*files sub-system as compared to any other sub-system. However, the rest of the functions make more calls to functions within their own sub-system, with the 2nd highest number of calls to the u_*files sub-system. Function calls are made to functions in all other sub-systems, but the d_*files sub-system appears to contain the least number of utilities. These figures show that the Xfig system is quite well structured, but its structure can be improved by further localizing the calls to own sub-system or the u_*files sub-system.

Table 7 shows call results for functions to which 5 or more calls are made. Results are similar to those in Table 6, with the maximum number of functions being called from the u_*files sub-system and no calls to the d_*files sub-system functions except from functions within d_*files.

28

Sub-system

Calls made to functions in d_*files

Calls made to functions

in e_*files


in f_*files

Calls made to functions in u_*files

Calls made to functions in w_*files


in misc. files

d_*files 49 1 1 53 13 6

e_*files 6 224 16 177 51 11

f_*files 1 2 113 48 28 6

u_*files 2 10 9 302 43 10

w_*files 2 7 20 39 301 16

Total 60 244 159 619 436 49

Table 6: Number of function calls made to functions in each sub-system

Sub-system Calls made to functions in d_*files

Calls made to functions in e_*files

Calls made to functions in f_*files

Calls made to functions in u_*files

Calls made to functions in w_*files

Calls made to functions in misc. files

d_*files 1 9 6 2

e_*files 12 1 55 12 7

f_*files 7 2 1

u_*files 43 11 6

w_*files 1 4 3 33 6

Total 1 13 12 110 64 22

Table 7: Five or more function calls made to functions in each sub-system

Pattern 7: Localize structures

An application of the association rule shows that in Bash, 115 global variables are used by only one function, whereas in Xfig, there are 310 such global variables. This can also be verified by the statistics presented in Figure 2 and Figure 4. Thus a significant percentage of global variables in both systems is used by just one function. It is recommended that such global variables be localized.

Pattern 8: Beware of side effects

An application of the association rule results in a listing of functions that access the same global variables. Figure 1 shows the number of functions that access common global variables for Bash and Figure 3 shows the number for Xfig. These functions are highly coupled and any changes to the functions involving the global variables should be made carefully.

Pattern 9: Generate alternative views

Table 8 summarizes the statistics for types accessed by functions in various sub-systems of Xfig. Figure 30- Figure 34 show the types accessed within the sub-systems.

Sub-system Average number Standard deviation Highest number of Coverage

29

of accesses to types

accesses (Type)

d_*files 6.28 5.92Cursor

22 (0.234)

e_*files 17.34 25.89F_compound

99 (0.268)

f_*files 6.49 8.21FILE

37 (0.266)

u_*files 15.10 25.33F_compound

127 (0.301)

w_*files 16.74 35.40WidgetList

257 (0.403)

Table 8: Accesses to user defined types in various Xfig sub-systems

Xfig - Types accessed within d_*files subsystem

0

5

10

15

20

25

Cur

sor

XF

ontS

truc

t

F_p

oint

Win

dow

F_p

os

F_t

ext

XC

harS

truc

t

F_s

plin

e

F_l

ine

F_e

llipse

F_s

fact

or

F_c

ompo

und

XtA

ppC

onte

xt

FIL

E

size

_t

F_p

ic

obje

ct.h

.unn

amed

.149

(

F_a

rc

Types

Nu

mb

er

of

func

tio

ns

ac

ce

ssi

ng

a

typ

e

Figure 30: Types accessed by a subsystem (d_*files)

Xfig - Types accessed within e_*files subsystem

020406080

100120

F_com

pound

F_lin

eW

idgetL

ist

F_poin

tF

_splin

eC

urs

or

F_ellipse

F_arc

Arg

F_pos

F_te

xt

e_edit.

c.u

nnam

ed.1

99(s

truct)

Wid

getC

lass

F_arr

ow

F_sfa

cto

rF

_pic

icon_str

uct

Pix

map

Dis

pla

yX

tVarA

rgsLis

tW

indow

choic

e_in

foP

ixel

siz

e_t

Colo

rmap

obje

ct.h.u

nnam

ed.1

49(s

truct)

XtL

anguageP

roc

Positi

on

FIL

ED

imensio

nX

FontS

truct

XtA

ctio

nsR

ec

GC

XtA

ppC

onte

xt

Ato

mappre

sS

truct

sfa

cto

r_def(

str

uct)

XtO

rderP

roc

XtC

allb

ackR

ec

XtC

allb

ackP

roc

sta

t(str

uct)

XC

harS

truct

XK

eyR

ele

asedE

vent

Tim

epaper_

def(

str

uct)

pid

_t

time_t

obje

ct.h.u

nnam

ed.1

95(s

truct)

e_edit.

c.u

nnam

ed.2

00(s

truct)

counts

(str

uct)

patr

n_str

ct

XtW

idgetG

eom

etr

yX

Colo

r

Types

Nu

mb

er

of

fun

cti

on

s

ac

ce

ss

ing

a t

yp

e

Figure 31: Types accessed by a subsystem (e_*files)

30

Xfig -- Types accessed within f_*files subsystem

05

10152025303540

FIL

EF

_com

poun

dF

_pos

F_p

icF

_lin

eap

pres

Str

uct

size

_tC

OLR

F_s

plin

eF

_arc

F_e

llipse

F_t

ext

XC

olor

Cm

ap(s

truc

t)F

_arr

owfig

_set

tings

F_p

oint

Dis

play

pape

r_de

f(st

ruct

)__

uint

8_t

stat

(str

uct)

_rec

ent_

files

jpeg

_err

or_m

gr(s

truc

pcxh

eadr

Cur

sor

erro

r_pt

rf_

read

gif.c

.unn

amed

.F

_sfa

ctor

j_co

mm

on_p

trob

ject

.h.u

nnam

ed.1

4W

indo

wW

idge

tLis

tA

rgC

olor

map

f_re

ad.c

.unn

amed

.19

hdr(

stru

ct)

f_re

adgi

f.c.u

nnam

ed.

f_re

adep

s.c.

unna

me

obje

ct.h

.unn

amed

.19

jpeg

_mem

ory_

mgr

(st

j_de

com

pres

s_pt

rJS

AM

PIM

AG

Eob

ject

.h.u

nnam

ed.1

4f_

neuc

lrtab

.c.u

nnam

__in

t32_

tf_

read

pcx.

c.un

nam

epc

xhed

(str

uct)

Types

Nu

mb

er

of

fun

cti

on

s

ac

ce

ss

ing

a t

yp

e

Figure 32: Types accessed by a subsystem (f_*files)

Xfig - Types accessed with u_*files subsystem

020406080

100120140

F_

co

mp

ou

nd

F_

lin

eF

_p

oin

tF

_sp

lin

eF

_a

rcW

ind

ow

F_

po

sF

_e

llip

se

F_

text

ap

pre

sS

tru

ct

Dis

pla

yG

CF

_sfa

cto

rF

_a

rro

wF

_lin

kin

foco

un

ts(s

tru

ct)

zX

Po

int

FIL

EF

_p

icX

Fo

ntS

tru

ct

pa

pe

r_d

ef(

str

uct)

ob

ject.h

.un

na

me

d.1

Cu

rso

r_

fstr

uct(

str

uct)

XC

olo

ro

bje

ct.h

.un

na

me

d.1

Pix

el

siz

e_

tX

Eve

nt

XC

lie

ntM

essa

ge

Ev

Wid

ge

tLis

tA

tom

/usr/

X1

1R

6/in

clu

de

/C

olo

rma

pa

ng

le_

tab

le(s

tru

ct)

Cm

ap

(str

uct)

xfo

nt(

str

uct)

_xfs

tru

ct(

str

uct)

XG

CV

alu

es

u_

dra

w.c

.un

na

me

d.

u_

dra

w.c

.un

na

me

d.

u_

dra

w.c

.un

na

me

d.

fun

cs(s

tru

ct)

__

uin

t8_

tV

isu

al

PIX

RE

CT

_fp

nt(

str

uct)

XP

oin

tR

eg

ion

_a

rro

w_

sh

ap

e(s

tru

XE

rro

rEve

nt

Types

Nu

mb

er

of

fun

cti

on

s

accessin

g a

typ

e

Figure 33: Types accessed by a subsystem (u_*files)

Xfig - Types accessed by w_*files subsystem

050

100150200250300

Wid

getL

ist

Arg

Dis

play

appr

esS

truc

tin

d_sw

_inf

oP

ixm

apW

indo

wic

on_s

truc

tG

CX

Col

orW

idge

tCla

ssX

tApp

Con

text

XtL

angu

ageP

roc

F_c

ompo

und

XF

ontS

truc

tC

olor

map

Pix

elC

urso

rX

tVar

Arg

sLis

tX

tAct

ions

Rec

FIL

Esi

ze_t

XtI

nter

val

IdD

imen

sion

Pos

ition

Ato

mm

ode_

sw_i

nfo

XC

harS

truc

tm

ain_

men

u_in

foX

But

tonR

elea

sedE

vX

Eve

ntX

GC

Val

ues

choi

ce_i

nfo

styl

e_te

mpl

ate(

stru

cF

_tex

tX

Poi

ntLI

BR

AR

Y_R

EC

(str

uS

cree

nR

GB

F_p

osF

igLi

stW

idge

tLi

stP

art

HS

Vva

_lis

tX

tCal

lbac

kRec

fig_c

olor

ssp

in_s

truc

tR

otat

edT

extI

tem

Xaw

List

Ret

urnS

truc

tX

tOrd

erP

roc

PIX

RE

CT

pape

r_de

f(st

ruct

)fu

ncs(

stru

ct)

XtW

idge

tGeom

etry

_XP

rivD

ispl

ayF

_lin

est

at(s

truc

t)co

unts

(str

uct)

XE

xpos

eEve

ntX

Rec

tang

leF

_arc

F_e

llips

eF

_spl

ine

XW

indo

wA

ttrib

utes

Dra

wab

leX

awT

extB

lock

fig_s

ettin

gs_f

stru

ct(s

truc

t)gl

obal

Str

uct

patr

n_st

rct

XK

eyR

elea

sedE

vent

XP

oint

erM

oved

Eve

ntK

eyS

ymzX

Poi

ntD

IRV

isua

ldi

rent

(str

uct)

Cor

ePar

tX

Key

Pre

ssed

Eve

ntxf

ont(

stru

ct)

Com

pKey

_rec

ent_

files

Xrm

Val

ueP

trM

enuI

tem

Rec

_xfs

truc

t(st

ruct

)F

igS

meB

SB

Obj

ect

Fig

Sm

eBS

BP

art

XK

eybo

ardC

ontr

olw

_dra

wpr

im.c

.unn

a__

uint

8_t

pass

wd(

stru

ct)

pid_

tX

tCal

lbac

kPro

cX

tTra

nsla

tions

w_d

raw

prim

.c.u

nna

Rec

tObj

Cla

ssP

art

F_p

icG

rabI

nfo

men

u_de

fR

ectO

bjP

art

Reg

ion

XF

ont

Set

Ext

ents

Sm

eBS

BC

lass

Rec

XC

onfig

ureE

vent

XB

utto

nPre

ssed

Eve

Sm

eBS

BP

art

Sm

ePar

tT

ime

Sm

eThr

eeD

Par

tF

ont

Types

Nu

mb

er

of

fun

cti

on

s

ac

ce

ss

ing

a t

yp

e

Figure 34: Types accessed by a subsystem (w_*files)

It can be seen that some of the types are associated with a certain sub-system only, whereas some of them are accessed by more than one sub-system e.g. the types F_Line, F_Spline, F_arc etc. are accessed

31

by all sub-systems. The reason for this is that these shapes are drawn in the d_*files sub-system, edited in the e_*files sub-system, etc. Thus for an object-oriented view, these types will be associated with functions across sub-systems. On the other hand, types accessed within a certain sub-system may be associated with functions within the sub-system.

32

5. Conclusions

In this report we applied association rule mining to the problem of understanding a software system given only the source code. We analyzed the structure of two legacy systems, Bash and Xfig, and extracted meaningful association rules and patterns which provide useful insight about the software’s overall structure. As illustrated, these patterns can be used to re-structure the code for maintainability, and if required, to re-modularize the code e.g. by converting a structured design to an object-oriented design. A manual inspection to carry out the same tasks would have taken a much longer time.

Our experiments with Xfig and Bash reveal similar results in terms of the average and percentage values in the patterns discussed. This observation can be helpful in revealing interesting characteristics, trends and nature of open source legacy systems.

In the future, we intend to pursue the mining of associations between items other than the ones explored here e.g. between the input and output parameters of functions. Furthermore, patterns should be applied to other software systems in order to validate results obtained with Xfig and Bash and perhaps reveal other interesting properties of legacy systems.

33

References

[1] I. Sommerville, Software Engineering, Fifth Edition, Addison Wesley, 2000.

[2] R.S. Arnold, Software Reengineering, IEEE Computer Society Press, 1993.

[3] R.S. Pressman, Software Engineering A Practitioner’s Approach, Fifth Edition, Mc Graw Hill, 2001.

[4] S.L. Pfleeger, Software Engineering Theory and Practice, Prentice Hall, 1998.

[5] R.L. Glass, Frequently Forgotten Fundamental Facts about Software Engineering, IEEE Software, May/June 2001.

[6] T.J.Biggerstaff, “Design Recovery for Maintenance and Reuse”. IEEE Computer, 22(7), pages 36-49, July 1989.

[7] H.A. Muller, M. Story, J.H. Jahnke, D.B. Smith, A.R. Tilley, K. Wong, “Reverse Engineering: A Roadmap”, The 22nd International Conference on Software Engineering (ICSE’00), June 2000

[8] G. Parikh, N. Zvegintzov, Tutorial on Software Maintenance, IEEE Computer Society Press, 1983.

[9] R.P. Hall, Seven Ways to Cut Software Maintenance Costs, Datamation, July 1987.

[10] M.T.Harandi, J.Q.Ning, “Knowledge-Based Program Analysis”, IEEE Software 7(1), pages 74-81, January 1990 .

[11] Rich, Wills “Recognizing a program’s design: A graph parsing approach”, IEEE Software, 7(1), pages 82-89, January 1990.

[12] A.Quilici, “A Memory-Based Approach to Recognizing Programming Plans”. Communications of the ACM, 37(5), pages 84-93, May 1994.

[13] H. A. Müller, K. Wong, and S. R. Tilley “Understanding software systems using reverse engineering technology.” The 62nd Congress of L'Association Canadienne Francaise pour l'Avancement des Sciences Proceedings (ACFAS) 1994.

[14] H.M.Fahmy, R.C.Holt, J.R.Cordy , “Wins and Losses of Algebraic Transformations of Software Architectures”. Automated Software Engineering ASE 2001, San Diego, California, November 26-29, 2001.

[15] R.Kazman, S.J.Carrière, "View Extraction and View Fusion in Architectural Understanding". The 5th International Conference on Software Reuse, Victoria, BC, Canada, June 1998.

[16] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Supporting software maintenance by mining software update records, International Conference on Software Maintenance, (ICSM) 2001.

[17] Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, August 2000.

[18] C. Montes de Oca, D. L .Carver, Identification of Data Cohesive Subsystems using Data Mining Techniques, International Conference on Software Maintenance, (ICSM) November 1998.

[19] C. Montes de Oca, D. L .Carver, A Visual Representation Model for Software Subsystem Decomposition, Working Conference on Reverse Engineering (WCRE'98), October, 1998.

[20] K. Sartipi, K. Kontogiannis, F. Mavaddat, Architectural Design Recovery Using Data Mining Techniques, Conference on Software Maintenance and Reengineering (CSMR’00), February, 2000.

[21] Tjortjis, C., Sinos, L., Layzell, P., Facilitating Program Comprehension by Mining Association Rules from Source Code, 11th IEEE International Workshop on Program Comprehension (IWPC'03), May, 2003.

34

http://csdl.computer.org/comp/proceedings/iwpc/2003/1883/00/1883toc.htm

http://csdl.computer.org/comp/proceedings/csmr/2000/0546/00/0546toc.htm

http://csdl.computer.org/comp/proceedings/wcre/1998/8967/00/8967toc.htm

http://www.directtextbook.com/publisher/morgan-kaufmann

http://www.directtextbook.com/title/data-mining-concepts-and-techniques

http://portal.acm.org/citation.cfm?id=782045&coll=ACM&dl=ACM&CFID=21568880&CFTOKEN=55796980


[22] A. Michail, Data mining library reuse patterns using generalized association rules, Proceedings of the 22nd international conference on Software engineering, June 2000.

[23] A. Michail, Data Mining Library Reuse Patterns in User-Selected Applications, 14th IEEE International Conference on Automated Software Engineering , October, 1999.

[24] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the maintenance history of a legacy software system, International Conference on Software Maintenance, (ICSM) , 2003.

[25] J.S. Shirabad, T.C. Lethbridge, S. Matwin, Mining the software change repository of a legacy telephony system, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004.

[26] T. Zimmermann, P. Weibgerber, S. Diehl, A. Zeller, Mining version histories to guide software changes, Proceedings of the 26th International Conference on Software Engineering (ICSE) 2004.

[27] M. El-Ramly, E. Stroulia, Mining software usage data, Proceedings of the 1 st International Workshop on Mining Software Repositories, 2004.

[28] Z. Balanyi, R. Ferenc, Mining design patterns from C++ source code, International Conference on Software Maintenance (ICSM) 2003.

[29] Y. Kanellopoulos, C. Tjortjis, Data mining source code to facilitate program comprehension: Experiments on clustering data retrieved from C++ programs, International Workshop on Program Comprehension (IWPC) 2004,

[30] R. Amin, M. Cinneide, T. Veale, LASER: A lexixal approach to analogy in software reuse, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004.

[31] F. McCarey, M. Cinneide, N. Kushmerick, A case study on recommending reusable software components using collaborative filtering, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004.

[32] Y. Yusof, O. F. Rana, Template mining in source-code digital libraries, Proceedings of the 1st

International Workshop on Mining Software Repositories, 2004.

[33] P. K. Garg, T. Gschwind, K. Inoue, Multi-project software engineering: An example, Proceedings of the 1st International Workshop on Mining Software Repositories, 2004.

[34] Demeyer, S., Ducasse, S., Nierstrasz, O., Object-Oriented Reengineering Patterns, Morgan Kaufmann, 2003.

[35] Website http://www.iam.unibe.ch/~scg/Archive/famoos/

[36] V. Tzerpos, R.C. Holt, “Software Botryology: Automatic Clustering of Software Systems”, Ninth International Workshop on Database and Expert Systems Applications (DEXA’98), August 1998.

[37] T.A. Wiggerts, “Using clustering algorithms in legacy systems remodularization,” Fourth Working Conference on Reverse Engineering (WCRE’97), October 1997.

[38] N.Anquetil and T.C.Lethbridge, “Experiments with clustering as a software remodularization method,” The Sixth Working Conference on Reverse Engineering (WCRE’99), 1999.

[39] J.Davey and E.Burd, “Evaluating the Suitability of Data Clustering for Software Remodularization”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000.

[40] M.Saeed, O.Maqbool, H.A.Babri, S.M. Sarwar, S.Z. Hassan “Software Clustering Techniques and the Use of the Combined Algorithm”, Conference on Software Maintenance and Re-engineering (CSMR’03), March 2003.

35

http://www.iam.unibe.ch/~scg/Archive/famoos/

http://www.directtextbook.com/publisher/morgan-kaufmann

http://csdl.computer.org/comp/proceedings/ase/1999/0415/00/0415toc.htm

http://csdl.computer.org/comp/proceedings/ase/1999/0415/00/0415toc.htm


[41] O.Maqbool, H.A.Babri, “The Weighted Combined Algorithm: A Linkage Algorithm for Software Clustering”, Conference on Software Maintenance and Re-engineering (CSMR’04), March 2004.

[42] Website MSR 2004 http://msr.uwaterloo.ca

[43] B.W. Kernighan, D.M. Ritchie, The C Programming Language, Prentice Hall, 1988.

[44] S. McConnell, Code Complete A Practical Handbook of Software Construction, Microsoft Press, 1993.

[45] M.A. Weiss, Efficient C Programming A Practical Approach, Prentice-Hall, 1995.

[46] Website http://publications.gbdirect.co.uk/c_book/

[47] B. Meyer, Object-Oriented Software Construction, Prentice Hall, 1988.

[48] R.B. Murray, C++ Strategies and Tactics, Addison Wesley, 1993.

[49] K. Wong, S. Tilley, H. Muller, M.A. Storey, Structural Redocumentation: A Case Study, IEEE Software, January, 1995.

[50] Website Xfig http://www.xfig.org

[51] R. Koschke, “Atomic Architectural Component Recovery for Program Understanding and Evolution”, PhD Thesis, University of Stuttgart, 2000.

[52] Website http://www.bauhaus-stuttgart.de/bauhaus

[53] J.Martin, K.Wong, B. Winter and H.A.Müller, “Analyzing xfig using the Rigi Tool Suite”, The Seventh Working Conference on Reverse Engineering (WCRE'00), Brisbane, Australia, 2000.

36

http://www.bauhaus-stuttgart.de/bauhaus

http://www.xfig.org/

http://publications.gbdirect.co.uk/c_book/

http://msr.uwaterloo.ca/

Re-engineering Pattern Extraction for Program Understanding ...

Documents

accessed functions xfig

global variable xfig

xfig systems

global variable bash

maximum number of functions

various xfig subsystems

global variables

number of function calls