Smef2009

SOFTWARE MEASUREMENT EUROPEAN FORUM

PROCEEDINGS

28 - 29 May 2009

NH LEONARDO da VINCI, ROME (ITALY)

EDITOR Ton Dekkers

Galorath International Ltd.

COVER PHOTO Ton Dekkers

Proceedings 6th Software Measurement European Forum, Rome 2009


i

CONFERENCE OFFICERS Software Measurement European Forum 2009 Conference Manager Cristina Ferrarotti, Istituto Internasionale di Ricerca Srl., Italy Conference Chairperson Roberto Meli, DPO - Data Processing Organization, Italy Program Committee Chairperson Ton Dekkers, Galorath International Ltd., The Netherlands Program Committee Silvia Mara Abrahão, Universidad Politecnica de Valencia, Spain Prof. Alain Abran, École de Technologie Supérieure / Université du Québec, Canada Dr. Klaas van den Berg, University of Twente, The Netherlands Dr. Luigi Buglione, ETS - Université du Québec / Nexen (Gruppo Engineering), Italy Manfred Bundschuh, DASMA e.V, Germany Thomas M. Cagley Jr., David Consulting Group Inc, U.S.A. Prof. Gerardo Canfora, University of Sannio, Italy Carol Dekkers, Quality Plus Technologies, Inc, U.S.A. Prof. Dr. Reiner Dumke, University of Magdeburg, Germany Dr. Christof Ebert, Vector Consulting, Germany Cao Ji, China Software Process Union, People Republic China Dr. Thomas Fehlmann, Euro Project Office AG, Switzerland Pekka Forselius, 4SUM Partners, Finland Dan Galorath, Galorath Inc, U.S.A. Harold van Heeringen, Sogeti Nederland B.V., The Netherlands Rob Kusters, Eindhoven University of Technology / Open University, The Netherlands Andy Langridge, riskHive Ltd, U.K. Dr. Nicoletta Lucchetti, SOGEI, Italy Sandro Morasca, Università dell’Insubria, Italy Pam Morris, Total Metrics, Australia Dr. Jürgen Münch, Fraunhofer IESE, Germany Marie O’Neill, COSMIC, Ireland Serge Oligny, Bell Canada, Canada Habib Sedehi, University of Rome, Italy Charles Symons, United Kingdom Frank Vogelezang, Sogeti Nederland B.V., The Netherlands Pradeep Waychal, Patni Computer Systems Ltd, India Hefei Zhang, Samsung, People Republic China


ii


iii

The Software Measurement European Forum 2009 is the sixth edition of the successful

event for the software measurement community held previously in Rome in 2004, 2005, 2006 and 2007. In 2008 SMEF was located in Milan, one of the most important European business cities. Now we’re back in Rome.

The two Italian market and competence leading companies, Istituto Internasionale di Ricerca (http://www.iir-italy.it) and Data Processing Organization (http://www.dpo.it), that have managed the previous successful editions, continued the collaboration with Ton Dekkers (Galorath International) in the Program Management.

SMEF aspires to remain a leading worldwide event for experience interchange and knowledge improving on a subject, which is critical for the ICT governance.

The Software Measurement European Forum will provide an opportunity for the

publication of the latest researches, industrial experiences, case studies, tutorials, best practices on software measurement and estimation. This includes measurement education, methods, tools, processes, standards and the practical application of measures for management, risk management, contractual and improvement purposes. Practitioners from different countries will share their experience and discuss current problems and solutions related to the ICT Governance based on software metrics.

As the 2009 event focus is chosen: “Making software measurement be actually used“ Although software measurement is a young discipline, it has quickly evolved and many

problems have been adequately solved in conceptual frameworks. Available methods, standards and tools are certainly mature enough to be effectively used to manage the software production processes (development, maintenance and service providing). Unfortunately only few organisations, compared to the potential interested target, have brought the measurement process from theory to practice. Therefore the request for contributions related to implementation in order to analyse causes and possible solutions to the problem of measurement as a “useful but not used” capability. Papers are welcome to provide guidance to establish and actually using a roadmap for covering the distance from theory to practice.


iv

With this in mind, the Program Committee asked for papers focusing on the following areas:

• Focus Area: Making software measurement be actually used. • How to integrate Software Measurement into Software Production Processes. • Measurement-driven decision making processes. • Data reporting and dashboards. • Software measurement and Balanced Scorecard approaches. • Service measurement methods (SLA and KPI). • Software measurement in contractual agreements. • Software measurement and standard process models (ITIL, ASL, ...). • Software measurement and Software Maturity Models (CMM-I, SPICE, Six Sigma,

...). • Software measurement in a SOA environment. • Functional size measurement methods (IFPUG and COSMIC tracks). • (Software) Measurement to control risks. • Software Reuse metrics.

It’s quite ambitious to cover al of this, but we are confident that the presentations on the

conference and the papers in the proceedings will cover most of the issues. The angles chosen vary from theory to practise and from academic to pragmatic. This interesting mix of viewpoints will offer you a number of potential (quick) wins to take back.

Cristina Ferraroti Roberto Meli Ton Dekkers May 2009


v

TABLE OF CONTENT

DAY ONE - MAY 28

KEYNOTE

How to bring "software measurement” from research labs to the operational business processes Roberto Meli

1 An empirical study in a global organisation on software measurement in Software Maturity Model-based software process improvement Rob Kusters, Jos Trienekens, Jana Samalikova

11 Practical viewpoints for improving software measurement utilisation Jari Soini

25 IFPUG function points or COSMIC function points? Gianfranco Lanza


vi

DAY TWO - MAY 29

37 Personality and analogy-based project estimation Martin Shepperd, Carolyn Mair, Miriam Martincova, Mark Stephens

49 Effective project portfolio balance by forecasting project’s expected effort and its break down according to competence spread over expected project duration Gaetano Lombardi, Vania Toccalini

59 Defect Density Prediction with Six Sigma Thomas Fehlmann

67 Using metrics to evaluate user interfaces automatically Izzat Alsmadi, Muhammad AlKaabi

77 Estimating Web Application Development Effort Employing COSMIC: A Comparison between the use of a Cross-Company and a Single-Company Dataset Filomena Ferrucci, Carmine Gravino, Sergio Di Martino, Luigi Buglione

91 From performance measurement to project estimating using COSMIC functional sizing Cigdem Gencel, Charles Symons

105 A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation for Service Management: Case Study in Off-shore IT-Service Delivery Srinivasa-Desikan Raghavan, Monika Sethi, Dayal Sunder Singh, Subhash Jogia

117 FP in RAI: the implementation of software evaluation process Marina Fiore, Anna Perrone, Monica Persello, Giorgio Poggioli

129 KEYNOTE Implementing a Metrics Program: MOUSE will help you Ton Dekkers

APPENDIX

143

Program / Abstracts

159 Author’s affiliations


vii


viii


1

Results of an empirical study on measurement in Maturity Model-based software process improvement

Rob Kusters, Jos Trienekens, Jana Samalikova

Abstract This paper reports on a survey amongst software groups in a multinational organisation.

The paper presents and discusses metrics that are being recognised in the software groups. Its goal is to provide useful information to organisations with or without a software measurement program. An organisation without a measurement program may use this paper to determine what metrics can be used on the different CMM levels, and possibly prevent that the organisation makes use of metrics which are, not yet, effective. An organisation with a measurement program may use this paper to compare their metrics with the metrics identified in this paper and eventually add or delete metrics accordingly.

1. Introduction Although both CMM and its successor model CMMI address the measurement of process

improvement activities on its different levels, it does not prescribe how software measurement has to be developed, and what kind of metrics should be used. In accordance with experiences from practice, software measurement requires well-defined metrics, stable data collection, and structured analysis and reporting processes [10], [4]. The use of measurement in improving the management of software development is promoted by many experts and in literature. E.g. the SEI provides sets of software measures that can be used on the different levels of the Capability Maturity Model, see [1]. In [7] and an approach is presented for the development of metrics plans, and also gives an overview of types of metrics (i.e. project, process, product) that can be used on the different levels of maturity. To investigate the status and quality of software metrics in the different software groups at Philips, its Software Process Improvement (SPI) Steering Committee decided to perform an empirical research project in the global organisation.

The goal of this paper is to identify, on the basis of empirical research data, the actual use

of metrics in the software development groups of the Philips organisation. The data have been collected in the various software groups of this organisation. The main research questions are respectively:

• What types of metrics are recommended in literature regarding their usage on the different CMM-levels?

• What types of metrics are used in software groups? • What types of metrics are used in the software groups on the different CMM levels? • How do the empirical research findings on metrics usage contribute to literature?

In Section 2 of this paper relevant literature on the subject is addressed. Section 3 presents

the approach that has been followed to carry out the survey. Some demographics of the survey are presented in section 4. Section 5 addresses the results of the survey and section 6 presents the conclusions.


2

2. Usage of metrics versus the levels of the CMM: a literature overview As the maturity level of a software development organisation increases, additional

development issues should be addressed, e.g. project management, engineering, support, and process management issues, see e.g. [1]. To monitor and control these issues, it would seem reasonable that these additional issues would prompt the usage of additional metrics. Based on this consideration a literature review on the relationship between CMM levels attained and the type of metrics used has been carried out. Surprisingly, only a few papers were found that deal with this issue. We will discuss them in the following of this section.

In [8] a straightforward approach is taken that any software development organisation

requires five core metrics: time, effort, size, reliability and process productivity. The authors don't address in their book a possible connection between number and type of metrics and the CMM-levels, however they state that these five core metrics should form a sufficient basis for control. From a CMM perspective 'control’ is a level 2 issue, and one could deduce, that in essence the five metrics are needed for monitoring and control on the so-called Managed level 2. However, also on the Initial level 1, organisations will at least aspire towards control, and these five metrics will also be useful on this level 1. Other authors do differentiate between metric usage and the CMM level attained. We will discuss a number of publications in the following. We would like to keep in mind the logical assumption, that when a particular type of metrics is advised for a certain level, it obviously is also useful at the higher CMM levels.

The Software Engineering Institute, i.e. [1] provides guidelines for metric usage related to

the maturity level attained. For level 1 no guidelines are provided. Towards level 2, project metrics (progress, effort, and cost) and simple project stability metrics are advocated. At level 2 recording of planned and actual results on progress, effort and cost are required. Towards level 3 ranges are added for the project metrics, and at level 4 so-called 'control limits' complete the total project control picture. At level 2 also product quality metrics are introduced, to which at levels 3 and 4 metrics are added to, similarly as with regard to the foregoing project metrics, to provide the options to go from recording or registration of product quality aspects towards explanation and control. Further, starting from level 2 computer resource metrics are advocated, and starting from level 3 training metrics. Surprisingly these computer resource and training metrics are not mentioned in any of the publications that we investigated. At level 5, finally, metrics are added that should indicate the degree of process change on a number of issues, together with the actual and planned costs and benefits of this process change.

In [7] a direct line is drawn between the CMM level attained and the metrics that should

be used. Metrics are to be implemented step by step in five levels, corresponding to the maturity level of the development process. For level 1 some elementary project metrics (such as product size and staff effort) are advised, in order to provide a baseline rate of e.g. productivity. These baseline metrics should provide a basis for comparison as improvements are made and maturity increases. Logically deducing metrics requirements from the characteristics of each level, they indicated further that level 2 requires project related metrics (e.g. software size, personnel effort, requirements volatility), level 3 requires product related metrics (e.g. quality and product complexity), and level 4 so-called process wide metrics (e.g. amount of reuse, defect identification, configuration management, etc.). No recommendations are provided for level 5.


3

In [11] also some advice is provided and in essence the SEI approach is followed by these authors. However, the notion of base lining, advocated by [7], is adopted by them. So they advocate using baseline metrics at level 1, as what they refer to as post shipment metrics (e.g. defects, size, cost, schedule). At level 2 they, similar to [7], aim at project control, but also, similar to [1], they include product quality (e.g. defect) information. On levels 3 and 4 they provide similar advice as [1]. However, at level 4 they add base line metrics for process change, to support with baseline data the process change activities and associated metrics at level 5, e.g. on transfer of new technology.

Some differences between these approaches that can be identified are: • In [7] the usage of baseline metrics for level 1 is advised, In [1] it is refrained from

doing so, but [10] takes up this point. • In [7] product quality metrics are discussed starting from level 3, while in [1] this is

started at level 2. • In [7] process wide metrics are introduced at level 4, while in [1] the authors wait till

level 5. In [11] the same position as in [1] is taken, but adds to this baseline metrics at level 4.

Based on these papers, we can deduce the following framework of assumptions for the

usage of metrics on the different CMM-levels of maturity: • Level-1 organisations will focus on baseline metrics. • Level-2 organisations will add to this project related metrics and, according to [11]

and [1], also product quality metrics. However the usage of the latter type of metrics on this level is not in accordance with [7].

• Level-3 will certainly contain product metrics, both on quality and complexity. • Level-4 is less clear. According to [7] 'process wide' metrics will be included. In [11]

it is agreed on this, but only at a base line level. • Level-5 will definitely require process wide' metrics.

Summarising it will be clear that we can expect the number and diversity of metrics to

increase with the maturity level. From literature some survey based evidence is available. In [5] it is shown that elementary

metrics, such as size, cost, effort en lead time are used at the lower maturity levels and also show that they precede the usage of more complex metrics, such as on product quality and review effectiveness at the higher maturity levels. Similar results were found in [2]. In [6] metric usage in high maturity organisations is looked at. Metrics regarding size, effort, schedule, defects and risk were uniformly found. With the exception of ‘risk’ these metrics correspond directly to the core metrics as identified in [8] and the baseline metrics as suggested by [11]. Surprisingly, although metrics were used for process analysis and improvement, in [6] no specific process improvement oriented metrics are mentioned.

In this paper we will use survey data to compare these with our framework of

assumptions.


4

3. The survey The survey was carried out in the software groups of Philips. In these software groups,

software is developed for a large product portfolio, ranging from Shavers to TVs to X-Ray equipment. Software size and complexity are increasing rapidly and the total software staff is growing continuously, from less than 2000 in the early 90s to more than 5000 now. More and more software projects require more than 100 staff-years located on multiple development sites. The software groups also have to deal with the fact that the share of software developed by third parties is rapidly increasing.

Software Process Improvement (SPI) in Philips started in 1992. CMM is used for software

process improvement (SPI), because CMM is well defined and the measurement of the process improvements is done by objective assessors, supported by jointly developed CMM interpretation guidelines [9]. Although Philips has switched to the successor model CMM Integration (CMMI), the data discussed in this paper refer to the original CMM. Within Philips, the software capabilities and achievements are measured at corporate level by registering the achieved CMM levels of the company’s software groups. As a result the measurements are comparable across the organisation and the software process capability (CMM) level of each individual software group is known. Philips developed a cross-organisational infrastructure for SPI. From the organisation perspective SPI is organised top-down as follows: Core Development Council, SPI Steering Committee, SPI Coordinators and Software Groups. From the communication perspective the following events and techniques can be recognised, respectively a formal Philips SPI Policy, a bi-annual Philips Software Conference, temporary Philips SPI Workshops, a Philips SPI Award, an internal SPI Web Site, and so-called SPI Team Room sessions.

The research group used brainstorming sessions and discussions to develop a number of

questions regarding metrics usage. We asked: "which of the following quality measurements/metrics do you currently use". For this a list of 17 predefined metrics was provided (with a yes/no scale) that has been derived from earlier research sponsored by the SPI Steering Committee of the global Philips organisation. To this list software groups could add their own metrics. A drawback of this approach was, that no direct link to previous surveys (e.g. as reported in [2]) could be established. We felt that this drawback was compensated by the usage of terms that were familiar to the participants.

4. Survey demographics

This section gives briefly some survey demographics. The survey was aimed at group management and was completed by them. The number of responding software groups was 49 out of 74. This means a, very satisfactory, response of about 67 %. Table 1 shows the global distribution over the continents of the responding software groups. Table 2 presents the CMM-levels that have been achieved in the organisation.

Table 1: Distribution of responding software groups over the continents.

Continent Number of software groups Europe 32 Asia 8 America 9 Total 49


5

Table 2: Number of software groups on the different CMM-levels. CMM- level Number of groups Average # of metrics reported

1 20 6.2 2 13 6.3 3 10 11.5 4 0 n.a. 5 3 18.0

Not reported 3 Total 49

Table 2 shows that more than 50 % of the software groups that completed the survey

succeeded in leaving the lowest level one (initial software development) of the CMM. It is also clear, that the number of metrics reported increases substantially starting from level 3.

5. Results of the survey

This section introduces and discusses the results of the survey. Apart from reacting to the predefined metrics, most respondents added (often significant numbers of) self described metrics. In order to be able to manage these results, the data were classified. The classification used, is derived from existing classifications such as presented in [1] and [7].

The two-level categorisation used is presented in table 3. The results of the survey in these terms are presented in table 4.

Table 3: Categories of metrics.

Categories Project Product Process Process improvement

Computer utilisation

Training

Sub categories

Progress Effort Cost

Quality Stability Size Complexity

Quality Stability

Reuse Improvement

n.a. n.a.

Table 4: Frequencies per metric class.

Categories Freq. Percent Subcategories Freq. Percent project 179 45.4 progress 80 20.3

effort 97 24.6 cost 2 0.5

product 190 48.2 quality 137 34.8 stability 15 3.8 size 38 9.6

process 8 2.0 quality 4 1.0 stability 4 1.0

process improvement 13 3.3 reuse 8 2.0 improvement 5 1.3

computer 0 0.0 utilisation 0 0.0 Training 1 0.3 training 1 0.3 Other 3 0.8 other 3 0.8 Total 394 100.0 total 394 100.0


6

Three metrics were reported that fell outside this categorisation. These were ‘employee satisfaction’, ‘patents approved’ and ‘PMI’. Computer, with no metrics reported, and Training, with only a single metric, score extremely low. These can be discounted from the further analysis. The large majority of the other 391 metrics fall within the classes ‘project’ and ‘project’, which together account for 93,6%. Process and process improvement related metrics score low, with only 5.3% of metrics.

According to [8], all organisations should at least collect the five core metrics time, effort,

reliability, size, and productivity. Given that productivity can be seen as a combination of size and effort, table 5 shows the results on this requirement. An organisation scored on a category if at least one metric was present in that category.

Table 5: organisations (number and percentage) collecting data on five core metrics.

CMM level

Project progress

Project effort

Product quality

Product size

Productivity All core metrics

1 14 (82%) 14 (82%) 13 (76%) 12 (71%) 11 (65%) 8 (47%) 2 11 (85%) 13 (100%) 9 (69%) 7 (54%) 7 (54%) 5 (38%) 3 10 (100%) 10 (100%) 10 (100%) 9 (90%) 9 (90%) 9 (90%) 5 3 (100%) 3 (100%) 3 (100%) 3 (100%) 3 (100%) 3 (100%) All level 5 and nearly all level 3 organisations fulfill this requirement. However, less than

half of level 1 and level 2 organisations do so. The main missing metric seems to be product size. However, size can always be measured retroactively and sufficient automated support is available to support this. Taking this into account, all level 3 and 5 organisations, most (69%) of level 2 organisations, and about half (53%) of level 1 organisations fulfill this requirement. Apparently, starting from level 2, this requirement is largely met, providing these organisations sufficient (at least according to [8]) information for control.

A next interesting step, is to compare the type of metric collected, with the requirements as formulated in the literature survey presented above, i.e. [1], [11] and [7]. According to this we would expect (given that we have no data for level 4):

• Level-1 organisations will focus on baseline metrics. • Level-2 organisations will add to this project related metrics and, according to [11]

and [1], also product quality metrics. • Level-3 will contain product metrics, both on quality and complexity. • Level-5 will also require process wide metrics.

Table 6 contains numbers of metrics found per sub-category and per CMM-level, and can

be used to check these expectations.


7

Table 6: Number of metrics per sub-category per CMM-level. Metric subcategory CMM level

1 2 3 5 Total project progress 20 24 27 6 77 project effort 28 27 23 13 91 project cost 0 1 1 0 2 product quality 43 17 47 18 125 product stability 5 2 7 0 14 product size 14 8 9 3 34 process quality 0 0 0 4 4 process stability 3 1 0 0 4 process reuse 3 3 2 0 8 process improvement 1 0 1 3 5 Total 117 83 117 47 364

It is clear that level-1 organisations do not adhere to the expectations as formulated in

literature. We see a wide spread usage of all types of metrics. If we combine this with the less than stellar adherence to the principle of five metrics, as formulated in [8], we can conclude, that level-1 one organisations usage of metrics is in line with its CMM-level. It is ‘ad hoc’ or even ‘chaotic’. The lack of project control at this level apparently spills over in the lack of consideration of which metrics to use.

At level 2 we see a more concentrated usage of metrics. As expected, many metrics focus

on project matters. We also see a significant number of product metrics. This fits in with requirements as provided in [1] and at first glance contradicts those made in [7]. However, the CMM model is a growth model. It is to be expected that organisations at level 2 are involved in a transition process towards level 3. Usage of product metrics can therefore also be seen in this light, supporting the assumptions made in [7]. And given the relatively low number of product metrics encountered (just over one per organisation), this explanation may seem more likely. Some scattered process metrics are found, but in such low numbers (less than 0.25 of this type of metric per organisation) that we feel we can say that on average level 2 organisations select metrics in a such a way that is proper for their position in the CMM model. This is supported by the high level of usage of the five core metrics as advocated in [8].

Looking at level 3 organisations a similar picture can be seen. We see extensive usage of

project metrics. As can be expected for level 3, usage of product metrics is also extensive. This is exactly in line with what would be expected at this level. However, somewhat more surprising, almost no process metrics are reported. If we take the line of reasoning mentioned before, that level x organisations will be on the road to level x+1, and therefore report metrics that fit in with this higher level, a somewhat higher number would be expected. This can mean that the participating organisations have just reached level 3 and have not yet started the next step or that level-3 is considered a sufficient target and no further progress is envisaged. Data show, that the average level-3 organisation has had a SPI program in place for 6 years on average. All process metrics that were encountered, were found in those organisations that had their SPI program in place for 8 years on average. The remainder had their program in place for 5 years on average. This would seem to suggest that the first explanation is the most likely one. All level three organisations gather the core metrics as suggested in [8]. All together this suggests that level 3 organisations, as do level 2


8

organisations, follow a well considered approach towards metric usage that fits in with their position in the CMM maturity model.

No data are available for level 4 organisations. In [7] usage of reuse metrics at this level is

indicated. In the survey, reuse metrics where included as part of the 17 pre-defined metrics. Eight organisations indicated that they used this type of metric. Three of these were level 1, three, level 2 and two, level 3. None of these were level 5, as would have been expected according to [7]. Apparently usage of reuse metrics is not a characteristic of level 4 organisations.

Only limited data are available for level 5 organisations. However these data suggest a

similar result as was obtained at level 3. Metrics usage almost doubles for level 5 organisations, in line with the results as reported in [6]. And although only a few process metrics are reported, two out of three do use them. Furthermore, this limited number of organisations does account for one third (7) of all (21) process related metrics. On average each level 5 organisations uses 2.3 process related metrics, as opposed to 0,3 process related metric per organisation for the other organisations. And, as in [6] is suggested, considerate usage of project and product metrics can provide insight in the development of the process. Finally, all level 5 organisations collect data for the five core metrics. All this again suggests a carefully considered usage of metrics by these organisations.

In absolute numbers the amount of process related metrics (21) is small. Only 5.3% of

metrics fall into this category. A higher number would have been expected, especially for the organisations from levels 3 and 5. It is true, that level 5 organisations do show a higher number. On average each level 5 organisations uses 2.3 process related metrics, as opposed to 0.3 for the other organisations. This definitely shows that this type of metric is used more at higher levels of maturity, but still this is not a very high usage.

6. Conclusions

The main results of the analysis are as follows: • On each of the CMM levels, software groups make use of a quite large number of

project metrics (i.e. to measure progress, effort and cost). • Software groups on CMM level 1 (the Initial process level) make, or try to make, use

of a large diversity of metrics also from the categories product and process. This is in contradiction with recommendations from literature which stress the usage of project metrics (mainly to provide base line data as a start for further improvement).

• On CMM level 2 (the Managed process level) software groups are definitely restricting themselves to project metrics; the usage of product metrics shows a sharp decrease. Obviously software groups on this level know better what they are doing, or what they should do, i.e. project monitoring and controlling. As a stepping stone towards level three, also a number of product quality metrics can be found here.

• Software groups on CMM level 3 (the Defined level) clearly show a strong increase in product quality measurement. Obviously, this higher level of maturity reached enables software groups to effectively apply product metrics in order to quantitatively express product quality.

• Remarkable is that the analysis of the metrics showed that software groups on CMM level 3 and above, don't show a strong increase in the usage of process related metrics. The main instruments in software measurement to reach a satisfactory maturity level seem to be the project control and product quality metrics.


9

The research shows a clear emphasis on project progress and product quality metrics in

software groups that are striving at a higher level of maturity. Interesting is also to see that on each of the first three CMM levels software groups make use of about the same high amount of project metrics. So project control is important on all levels. Surprisingly, and in contradiction with literature, process metrics are hardly used (even by software groups on the higher CMM levels).

7. References [1] Baumert J.H., M.S. McWhinney, Software Measures and the Capability Maturity Model, Software

Engineering Institute, Technical Report CMU/SEI-92-TR-25, Carnegie Mellon University, Pittsburgh, 1992

[2] Brodman, J.G. and Johnson, D.L., Return on investment from software process improvement as measured by US industry, Crosstalk, april 1996.

[3] Ebert C., Dumke R., Software Measurement, Establish, Extract, Evaluate, Execute, Springer-Verlag Berlin Heidelberg, 2007.

[4] El-Emam. K., Goldenson. D., McCurley. J., Herbsleb. J., 2001. Modeling the Likelihood of Software Process Improvement: An Exploratory Study. Empirical Software Engineering, Vol. 6, Nr 3, pp 207 – 229.

[5] Gopal, A, Krishnan, M.S., Mukhopadadhyay, T, and Goldenson, D., Measurement programs in software development: determinants of success, IEEE Transactions on Software Engineering, vol. 28, no. 9. Pp. 863-875, 2002.

[6] Jalote, P., Use of metrics in high maturity organisations, Software Quality Professional, Vol. 4, No. 2, March 2002, pp. 7-13.

[7] Pfleeger, S. L. and McGowan, C., Software Metrics in the Process Maturity Framework. Elsevier Science Publishing Co., Inc., 1990.

[8] Putnam L.H., Myers, W., Five Core Metrics: The Intelligence Behind Successful Software Management, Dorset House Publishing, 2003

[9] SEI: Software Engineering Institute, Carnegie Mellon University, June 1995. The Capability Maturity Model: Guidelines for Improving the Software Process, Addison Wesley Professional, Pages: 464, Boston, USA.

[10] Trienekens J.J.M., 2004. Towards a Model for Managing Success Factors in Software Process Improvement. Proceedings of the 1st International Workshop on Software Audit and Metrics (SAM) during ICEIS 2004, Porto. Portugal, pp. 12-21.

[11] Walrad C, Moss E., Measurement: the key to application development quality. IBM [12] Systems Journal 32[3], 445-460. Riverton, NJ, USA, 1993.


10

SMEF 2009


11

Practical viewpoints for improving software measurement utilisation

Jari Soini

Abstract An increasing need for measurement data used in decision-making on all levels of the

organisation has been observed when implementing the software process, but on the other hand measuring the software process in practice is utilised to a rather limited extent in software companies. Obviously there seems to be a gap between need and use on this issue. In this paper we present the viewpoints on measurement issues of Finnish software companies and their experiences. Based on the empirical information obtained through interview sessions, this paper discusses the justification as well as potential reasons for challenging measurement utilisation in the software process. The key factors are also discussed that, at least from the users’ perspective, should be carefully taken into account in connection with measurement and also some observed points that should be improved in their current measurement practices. These results can be used to evaluate and focus on the factors that must be noted when trying to determine the reasons why measurement is only utilised to a limited extent in software production processes in practice. This information could be useful when trying to solve the difficulties observed in relation to measurement utilisation and thus advancing measurement usage in the context of software engineering.

1. Introduction In the literature there are a lot of theories and guides available for establishing and

implementing measurement in a software process. However, in practice difficulties have been observed in implementing measurement in the software development process [1],[2] and there is also evidence that measurement is not a particularly common routine in software engineering [3],[4]. These difficulties are partly due to the nature of software engineering as well as people’s attitude to measurement overall, but there may also be a lack of knowledge related to measurement practices, measurement objects, and also the metrics themselves. Very often difficulties arise when trying to focus the measurement and also when trying to use and exploit the measurement results. In many cases it is unclear what should be measured and how and also by whom the measurement data obtained should be interpreted [5].

This paper presents the viewpoints of software organisations – the advantages and disadvantages - related to the setting up or use of software measurement in the software engineering process. Based on empirical information obtained through interviews, this paper discusses the potential issues challenging measurement utilisation in the software process. The aim of this study is to give empirical experiences of which factors are felt to be important in measurement implementation. The empirical research material used in this examination was obtained from a software research project [6] carried out between 2005 and 2007 in Finland. The aim of the two-year-long research project was to investigate the current status and experiences of measuring the software engineering process in Finnish software organisations and to enhance the empirical knowledge on this theme. The overall goal of the research was to advance and expand understanding and knowledge of measurement usage, utilisation, and aspects for its development in the software process. In this paper, we focused on the interviews performed during the research project and the interpretation of the results from these interviews.


12

This empirical information gives us some hints as to which are the important issues when designing, establishing, and implementing measurement in the context of software engineering.

During the past years, many research projects and empirical studies including experiences of measurement utilisation in software engineering have been carried out. Research closely related to this subject in the software measurement field has been performed [7],[8],[9],[10],[11],[12],[13]. In these studies the measurement practices and experiences in software organisations was examined empirically. The main outcome of many of the studies has been practical guidelines for managing measurement implementation and usage in software engineering work. However, this issue still seems to require further examination.

The structure of this paper is as follows: Section 2 describes the research context. In section 3, the scope and method used in the empirical study are described. This is followed by section 4, in which the interview results are presented and, in section 5, a discussion and evaluation of the key findings of the study are made. Finally, in section 6, a summary of the paper is presented with a few relevant themes for further research.

2. Research context

In qualitative research there is a need for a theory or theories against which the captured data can be considered. Below is a short overview of the purpose of use and utilisation of measurement in business overall and, in particular, in the context of software engineering. It is well known that measurement can be utilised in various ways for many different purposes in business. In general, measurement data provides numerical information, which is often essential for comparing different alternatives and making a reasoned selection among the alternatives. McGarry et al. [14] listed typical reasons for carrying out measurement in business:

• Without measurement, operations and decisions can be based only on subjective estimation.

• To understand the organisation’s own operations better. • To enable evaluation of the organisation’s own operations. • To be better able to predict what will happen. • To guide processes. • To find justification for improving operations.

From the software engineering perspective, a few fundamentals can be pointed out that are

considered the key benefits when discussing the utilisation of measurement. In practice, the main purpose of software engineering measurement is to control and monitor the state of the process and the product quality, and also to provide predictability. Therefore, more software engineering-related reasons for measurement are: (e.g. [4],[15],[16])

• To control the processes of software production. • To provide a visible means for the management to monitor their performance level. • To help managers and developers to monitor the effects of activities and changes in

all aspects of development. • To allow the definition of a baseline for understanding the nature and impact of

proposed changes. • To indicate the quality of the software product (to highlight quality problems). • To provide feedback to drive improvement efforts.


13

There is no doubt that measurement is widely recognised as an essential part of understanding, predicting and evaluating software development and maintenance projects [17],[18],[19]. However, in the context of software engineering, its utilisation is accepted as a challenging issue [1],[2],[4],[19]. Although information relating to the software engineering process is available for measurement during software development work and, in addition, there are plenty of potential measurement objects and metrics related to this process, the difficulties related to measurement utilisation in this context are considerable. Due to the factors described before, it must be assumed that the basic issue with measurement utilisation is not the lack of measurement data, so in all likelihood there must be some other reasons for this phenomenon. The following section presents one example of how this issue has been studied empirically and also the type of answers that were obtained to explain this phenomenon.

3. Empirical study – Software Measurement project

This study is based on the results of the research project that was carried out between 2005 and 2007 in Finland. The starting point of the SoMe (Software Measurement) project [6] was to approach software quality from the software process point of view. The approach selected for examining the recognised issues in relation to software quality was an empirical study, in which the target was to examine measurement practices related to the software process in Finnish software companies. The aim and scope of the project was to examine the current metrics and measurement practices for measuring software processes and product quality in software organisations. The findings presented in this paper are based on the information captured with personal interviews and a questionnaire format during the research project.

3.1. Scope of the research project

There were several limitations related to the research population and unit when implementing the study. Firstly, the sample was not a random set of Finnish software companies. The SoMe project was initiated by the Finnish Software Measurement Association (FiSMA) [20] and two Finnish universities, the Tampere University of Technology (TUT) [21], and the University of Joensuu (UJ) [22]. In this study the population was limited so that it covers software companies who are members of FiSMA (which consists of nearly 40 Finnish software companies, among its members are many of the most notable software companies in Finland). Secondly, inside the selected population (FiSMA members) too, the sample was not randomly selected. Instead, purposive sampling was used in this study, which means that the choice of case companies was based on the rationale that new knowledge related to the research phenomenon would be obtained (e.g.[23],[24]). This definition was based on the objective of the SoMe research project, which was primarily to generate a measurement knowledge base consisting of a large metrics database including empirical metrics data [6]. In this case we approached companies that had previous experience with software measurement. The most potential companies who perform software engineering measurement activities were carefully mapped in close co-operation with FiSMA. All the participant companies volunteered and the number of the set was limited to ten companies, based on the selected research method (qualitative research), project schedule and research resources. The third limitation relates to the research unit. The target group selected within the company was made up of persons who had experience, knowledge, and understanding of measurement and those responsible for measurement operations and the metrics used in practice. From the research point of view, the research unit was precisely these people in the company.


14

These activities described above generally come under the job description of quality managers, which was the most common position among the interviewees (see [25]).

3.2. Participants of the research project

The companies involved in the study were pre-selected by virtue of their history in software measurement and voluntary participation. The people interviewed were mainly quality managers, heads of departments and systems analysts, i.e. persons who were responsible for the measurement activities in the organisation. A total of ten companies participated in this study. Three of them operate in the finance industry, two in software engineering, two in ITC services, two in manufacturing and one in automation systems. Six of the companies operate in Finland (Na) only, the other four also multi-nationally (Mn). Table 1 below shows some general information related to the participant companies. The common characteristic shared by these companies is that they all carry out software development independently and they mainly supply IT projects.

Table 1: Description of participating companies.

Company Business of the company

SWBusiness

SWApplication

National or Multinational

Employees total

Employees SE

SPICE Level

A Finance industry, ICT Customer Business Systems (BS) Na 220 220 3B Finance industry, ICT Customer Business Systems Na 280 150 2C Finance industry, ICT Customer Business Systems Na 450 35 2D Finance industry, ICT Customer Business Systems Na 195 195 3E SW engineering, ICT Customer, Product BS, Package Mn 15000 5000 2F SW engineering, ITC Customer Business Systems Na 200 60 2G ITC services Customer Business Systems Na 200 200 3H Manufacturing Customer, Product Emb Syst, Package Mn 1200 30 1I Manufacturing Product Emb Syst, Real-Time Mn 24000 200 2J Automation industry Product Real-Time Mn 3200 120 3

Based on the captured information, measurement practices and targets varied slightly

between the companies. Most of them had used measurement with systematic data collection and results utilisation for about 10 years, but there were also a couple of companies who had only recently started to adopt more systematic measurement programs. The contact people inside the companies were mainly quality managers, heads of departments and systems analysts, i.e. persons who were responsible for the measurement activities in the organisation.

3.3. Research method used

The nature of the method selected in the SoMe research project was strongly empirical and the research approach selected was a qualitative one based on personal interviews, combined with a questionnaire. This method was used in order to collect the experiences of the companies about the individual metrics used and also opinions and viewpoints about measurement, based on their own experiences. To be exact, in these interview sessions we used a semi-structured theme interview base as a framework. We used the same templates in all interview sessions: one form to collect general information about the company and its measurement practices, and another spreadsheet-style form to collect all metrics the company uses or has used (see [25]). The use of the interview method can be justified by the research findings made by Brown and Duguid [26]. They posit that there is a parallel between knowledge sharing behaviour and other social interaction situations. Huysman and de Wit [27] and Swan et al. [28] have also come to similar conclusions in their research. Therefore, it was decided to use a research method focusing on personal interviews at regular intervals with participants instead of purely a written questionnaire. It is distinctive of qualitative


15

research that space is often given to the viewpoints and experiences of the interviewees and their attitudes, feelings, and motives related to the phenomenon in question [29].

These are difficult, or even impossible to clarify by quantitative research methods. With the interviews we wanted particularly to examine the individual experiences and feelings related to the research theme. Using this approach, the aim is not to generalise. Instead, with face-to-face interviews, our aim is to get the views of the persons about the focal factors and issues related to measurement implementation. With these interview sessions we tried to identify the users’ positive and less positive experiences, noted critical issues, and also improvement and development issues as to the current situation regarding measurement.

4. Interview results

The following section highlights the interview results from the stance in which we are particularly interested in this study. This paper presents the respondents’ opinions about the four particular themes included in our interview base. The captured information taken from the interviews was carefully analysed by the research team and the results are presented below in a condensed form. The opinions are divided into four theme categories as follows:

Positive experiences of measurement usage: • Real status determined through measurement. • Measurement makes matters transparent. • Scheduling and cost adherence become notably better. • Resource allocation can be clarified. • Key issues emphasised using measurement. • Trends visible in time series. • Supports decision-making. • Motivates employees to work in a better and more disciplined way. • Agreed metrics help when planning company goals and operations. • Process improvement achieved through measurement. • Tool for monitoring effects of SPI actions. • Supports quality system implementation. • Practical means for integrating a company which has grown through acquisitions and

creating a coherent organisation culture. • Good way of observing trends and benchmarking. • Can inspire confidence in new customers by presenting measurement data.

Negative experiences of measurement usage: • Resistance to change at the beginning. • People in other departments do not see the benefit of measurement. • Measurement often suffers from a negative image (monitoring). • Easily viewed as extra work. • Mainly viewed negatively if the metrics are linked to an employee bonus system. • Measurement may underline and guide certain operations too much. • Benefits are quite difficult to see in the short term and also on a personal level. • Management sets metrics without in-depth knowledge of the trends. • Concrete utilisation of the metrics is difficult. • Interpreting the measurement results may be difficult. • When trying to use without having necessary measurement skills. • Measurement without utilising its results is a waste.


16

Observed issues for improvement and development: • Measurement results should be made available to a wider audience - wider

publication and transparency. • Insufficient analysis and utilisation of measurement results. • Measurement utilisation in operative management. • Developing and implementing the measurement process. • Assessing realistic measurement goals is challenging. • Determining how to ensure a particular metric is utilised. • Extent of measurement (do we measure enough and the right factors?). • Finding and defining good metrics. • Lack of forward-looking real time metrics (need for real time measurement. • Lack of measurement of remaining work effort. • Need to improve the measurement related to the area of change and requirement

management. • Evolving the utilisation of limit areas. • Delivery of top-quality measurement. • Improving testing of data utilisation: should measure exact causes and “originating

phase” of defects. • Increasing the reliability of the collected measurement data. • Improving performance measurement.

Critical issues in utilising and implementing measurement: • Demand for management commitment and discipline of the whole organisation. • There should be a versatile and balanced set of metrics. • Enough time must be reserved to analyse measurement results and also to make the

necessary rework. • Feedback on measurements must be given to all the data collectors, to sustain

motivation. • Benefits of the measurement must be indicated in some way, if not, motivation will

fade. • Capturing measurement information must be integrated in normal daily work. • Measurement must be automated as much as possible, so that data capture is not too

time-consuming. • Trends are important, as it takes time to notice process improvement. • Should be a large number of metrics, which combine to give accurate status. • Avoid placing too much weight and over-emphasis on one single measurement result. • Establishing of reliable metrics and their implementation is time-consuming.

These were the summarised experiences of the themes discussed in this study. As a result

of the information captured on the theme in the interviews, a few key factors have arisen. It seems that the interviewees rated measurement issues that relate closely to the human aspect as extremely important factors in the measurement context (e.g. utilisation and publishing, purpose of use, measurement benefits, feedback, response, etc.). Other issues that were emphasised relate to the skills and knowledge required for using the metrics and measurement (e.g. interpreting, analysing and utilising the collected measurement data, measurement focus, etc.) and also difficulties in metrics selection (amount, balance, measurement objects, etc.).


17

As is well known, measurement itself is a process that requires sufficient resources and know-how of software development work and the developed software itself [30],[31]. In the following sections the observations related to the human aspect are examined more closely. These interview results are discussed based on the current measurement practices in the participant organisations and also in relation to previous research results.

5. Discussion

This section deals more extensively with one of the key factors that arose repeatedly during the interview sessions – the human aspect. This observation supports the widely accepted concept that there is a strong emphasis on the human nature of software engineering work [32],[33],[34],[35],[36]. Next, the interview results are compared and discussed based on the information that is derived from current measurement practices in the participant organisations.

5.1. Prevalent measurement usage from the human perspective

In one part of the SoMe project, detailed information was collected on the current metrics used in the participant companies. The methods of collecting, analysing and classifying the research material as well as the entire research process have been described in previous research papers [6],[37]. Briefly, after the captured measurement data was organised and evaluated, and the necessary eliminations were made, 85 individual metrics were left which fulfilled the definition of the research scope. The following evaluation related to results from the empirical study deals with the measurement users’ perspective. This examination includes the results that enabled us to see who the beneficiaries of the measurement results were in practice. For this evaluation, the metrics were classified according to three user groups: software/project engineers; project managers; and upper management (management above project level including, but not restricted to, quality, business, department and company managers). In each group the metrics were further divided into two segments: metrics that are primarily meant for the members of that group, and those that are secondarily used by that group after some other group has had access to them (according to the order given on the questionnaire form, primary or secondary users of the metric may be from more than one user group). Figure 1 (below) presents the users of the measurement data, based on their position within the organisational hierarchy.

1022

66

16

16

17

010

203040

506070

8090

Software engineers Project managers Upper management

Measurement Data User Groups

Num

ber o

f Met

rics

SecondaryPrimary

Figure 1: Distribution of measurement results and their users in the company organisation.


18

With the captured metric data, the users of the measurement data and their position within the organisational hierarchy can be recognised (see Figure 1). It is clear that the majority of measurement data produced benefits for upper management. 98% (83/85) of all the data produced by 85 metrics comes to upper management, 78 % (66/85) primarily. Project managers are in the middle with 38 metrics, but 58 % (22/38) of them are primary recipients. Considerably fewer metrics are targeted to the engineering/ development level. Software engineers benefit from only 26 metrics, and of those only ten are primarily intended for them. Moreover, according to the study results, roughly 75% (66/85) of the data collecting activities were carried out primarily by developers or project managers. As Figure 1 clearly shows, the captured measurement data focuses strongly on use by a certain user group. In this case, there is obviously an imbalance of who receives the results and how the metrics are selected for communication in the organisation. Moreover, this phenomenon is confirmed by previous studies. Metrics designated for upper management are fairly common in software engineering organisations (e.g. [14],[38],[39]). Next this result and its possible effects are evaluated from the measurement perspective.

5.2. The human factor – critical for measurement success

The next section highlights the focal - human-related - issues of which one should be aware when implementing measurement in the context of software engineering, based on the empirical results obtained. After analysing and evaluating the captured qualitative data, the following key factors seem to be significant enablers from the measurement implementation viewpoint.

Interpreters and analysers of measurement data

Regarding measurement data analysis, it must be recognised that the data collected has no intrinsic value per se, but only as the starting point of an analysis enabling the evaluation of the state of the software process. The final function of the measurement process is precisely to analyse the collected measurement data and draw the necessary conclusions. According to Dybå [8], measurement is meaningless without interpretation and judgment by those who make the decisions and take actions based on them. The human aspect relates closely to this issue of measurement data processing. According to Florac and Carleton [40], whatever level is being measured, problems will occur if the analysers of the measurement data are completely different people from those who collected the measurement data. Van Veenendaal and McMullan [41], as well as Hall & Fenton [42], share this view and state that when collecting or analysing measurement data, it is fundamental that the participants are those who have the best understanding of the measurement focus in question. Being aware of this is helpful when establishing and planning new software process measurement activities, especially from the perspective of analysing the measurement results. Based on the results of the study, we can state that the under-representation of engineer-level beneficiaries and the uneven distribution of measurement information very probably have detrimental effects: if not for the projects or software engineering, then at least for the validity and reliability of the measurements and their results. This result (Figure 1) supports the need that arose during the interviews for improving the analysis of measurement data.

Measurement data feedback

Feedback including utilisation of the collected measurement data for all organisation levels is paramount. As Hall and Fenton [42] identified, the usefulness of measurement data should be obvious to all practitioners.


19

Therefore, it is important that the measurement results are, of course, where appropriate, open, and available to everybody in all levels of the organisation. It should be noted that appropriate measurement supports managerial and engineering processes at all levels of the organisation [43],[44]. Therefore, it is important that measurement and metric systems are designed by software developers for learning, rather than by management for controlling. As well as managers, measurement provides opportunities for developers to participate in analysing, interpreting, and learning from the results of measurements and to identify concrete areas for improvement. The most effective use of measurement data for organisational learning, and also process improvement, is to feed the data back in some form to the members of the organisation. According to the results of this study, roughly 75% (66/85) of the data collecting activities were carried out primarily by developers or project managers. Research would indicate that they expect feedback on how the collected measurement data is utilised [13]. In the interviews, the feedback issue did not come up. This may be due to the fact that all those interviewed belonged to upper management, which benefited from 98% of the collected measurement data. However, there is evidence that this feedback issue seems to be problematic. The lack of feedback and thus evidence of the usefulness of collected measurement data has been widely observed (e.g. [13],[42],[45],[46]). The literature contains many cautionary examples of measurement programs failing because of insufficient communication, as the personnel is unable to see the relevance or benefit in collecting data for the metrics [13],[16],[47]. People should be made aware that when collecting measurement data, it is also essential to organise and produce instructions for steering the feedback channel, and also to give feedback to everyone or at least to those who had gathered or supplied the data. Success in giving feedback is indeed recognised as an important human-related factor in sustaining motivation for measurement [8],[46].

Reaction to the measurement results

There is yet another important human-related issue to highlight concerning measurement utilisation: the reaction to the measurement results. This is one of the most crucial issues from the motivation perspective in relation to measurement data collecting work. One of the significant factors for success in measurement is the motivation of the employees to carry out measurement activities. One of the most challenging assignments is to implement measurement in a manner that has the most positive impact on each of the projects within the organisation [14]. If people can see that the measurement data they had collected was firstly analysed and then, if necessary, led to corrective actions, the motivation for colleting the measurement data will be sustained. However, as in the previous issues, there is also evidence of a lack of management of this too. According to the results of a large study [7] made by the Software Engineering Institute at Carnegie Mellon University (software practitioners, 84 countries, 1895 responses) only 40% of all respondents reported that corrective action is taken when a measurement threshold has been exceeded and close to 20% of respondents reported that corrective action is rarely or never taken when a measurement threshold is exceeded. Furthermore, the studies carried out by Solingen and Berghout [19], Hall et al. [13] and Mendonca and Basili [48] support this observation. In these interviews (see section 4.1), the observations that came up closely relate to this issue. These included the fact that the concrete utilisation of the metrics and also interpretation of the measurement results are often considered difficult, that measurement without utilising its results is a waste, and that there is a need for wider publication and transparency of the measurement results, etc. These are examples of issues that can be linked to a lack of reaction to measurement results and thus decreased motivation to carry out measurement.


20

The factors presented above tend to be considered essential from the human perspective in the context of measurement in software engineering. The issues described - the beneficiaries, analysis and also the appropriate organisation of the collected measurement data - must be taken into consideration in connection with measurement. Previous studies support these observations, derived from interpretation of the data captured during the qualitative study. This confirms the reliability of the study results. In conclusion, it could be stated that paying attention to these issues may enhance measurement usage in software engineering work.

5.3. Evaluation of the study results

In qualitative research, such as the interviews in this case, subjective opinion is allowed and subjectivity is unavoidable [49]. With qualitative material, generalisations cannot be made in the same way as in quantitative studies. In this way it should be noted that generalisations can only be made on the grounds of interpretation, not directly on the grounds of the research material [29]. Therefore, the most challenging part of this approach from the researcher’s point of view is data interpretation and the subsequent selection of the key observations and themes that came up in the research material. Finally, it is very much a question of the researchers’ interpretation of the results that arose that they consider to be the most significant. If one looks for possible weaknesses or bias in qualitative research regarding reliability, it could be said to lie in the phase of analysing and interpreting the research material. However, the results of the previous studies support the observations presented here.

On the other hand, because the starting point of the study was the SoMe project, which was designed to capture experience-based measurement data, it could be said that the sample does not correspond well to the population from the aspect of software engineering measurement activity. Therefore it must be assumed that the participant companies were more measurement-oriented than the average population. This, together with them being members of FiSMA, may create a bias in results, as these companies are more likely to perform measurement than the Finnish software industry in general. From the internal validity point of view this means that the results cannot be generalised to represent normal practice in the population. However, the results of this study can be applied when examining the current measurement practices and metrics used in a similar population and a similar research focus.

The statistical examination relating to the study result was omitted, because of the qualitative research method used and also due to the limited amount of research data. A statistical generalisation of the results cannot be made with the collected data set (e.g. [50]). As stated by Briand et al. [51], the sampling of too small a dataset leads to the lack of statistical authority. However, a more detailed validity and reliability examination is given related to the research carried out and also its results is in the doctoral thesis [25] of which the research material presented here is the focal data.

6. Conclusion

This article dealt with the possible reasons for the observation that measuring the software process in practice is utilised to a rather limited extent in software companies. The empirical data used in this study was based on a software research project on the Finnish software industry’s measurement practices. With the help of empirical information obtained through interview sessions, this paper presented the viewpoints on measurement issues of a set of Finnish software companies and their experiences. Based on the study results, this paper mapped out and discussed the potential reasons for the challenges facing measurement utilisation in the software process.


21

This paper focused on the human factor, which was strongly emphasised in the interviews. Based on the empirical results obtained, a set of human-related issues arose which one should be aware of and also take into account when implementing measurement in the context of software engineering. The results showed that the users felt these issues to be the most crucial when implementing measurement. The observations made in this study pointed out where measurement needs to be focused and improved. The interview results also exposed some points that should be improved in current measurement practices. These results can be used to evaluate and focus on the factors that must be noted when trying to determine the reasons why measurement is only utilised to a limited extent in software production processes in practice.

In conclusion, the empirical information presented in this paper can help when establishing or improving measurement practices in software organisations. The information provided by the empirical study enables us to recognise where the critical issues lie in measurement implementation. This may help us to take a short step on the road toward extended utilisation of measurement in software engineering. The results also showed some of the potential directions for further research on this theme. One interesting issue for future research would be to examine what really happens after the measurement data is analysed and how the response process is organised in practice in these organisations.

7. References [1] Bititci, U. S., Martinez, V., Albores, P. and Parung, J., “Creating and managing value in collaborative

networks”, International Journal of Physical Distribution and Logistics Management, vol. 40, no. 3, 2004, pp. 251-268.

[2] Testa, M. R., “A Model for organisation-based 360 degree leadership assessment”, Leadership and Organisation Development Journal, vol. 23, no. 5, 2002, pp. 260-268.

[3] Ebert, C., Dumke, R., Bundschuh, M. and Schmietendorf, A., “Best Practices in Software Measurement: How to Use Metrics to Improve Project and Process Performance”, Springer, 2004.

[4] Fenton, N. and Pfleeger, S., “Software Metrics: A Rigorous & Practical Approach - 2nd edition”, International Thompson Computer Press, 1997.

[5] Kulik, P. (2000), “Software Metrics State of the Art 2000”, KLCI Inc., 2006, http://www.klci.com. [6] Soini, J., “Approach to Knowledge Transfer in Software Measurement“, in International Journal of

Computing and Informatics: Informatica, vol. 31, no. 4, Dec 2007, pp. 437-446. [7] Kasunic, M., “The State of Software Measurement Practice: Results of 2006 Survey”, Technical Report

CMU/SEI-2006-TR-009, ESC-TR-2006-009, Software Engineering Institute (SEI), 2006. [8] Dybå, T., “An Empirical Investigation of the Key Factors for Success in Software Process

Improvement”, IEEE Transactions on Software Engineering, vol. 31, no. 5, 2005, pp. 410-424. [9] Card, D. N. and Jones, C. L., “Status Report: Practical Software Measurement”, in the 3rd International

Conference on Quality Software (QSIC’03) proceedings, pp. 315-320, 2003. [10] Baddoo, N. and Hall, T., “De-motivators for software process improvement: an analysis of practitioners’

views”, The Journal of Systems and Software, vol. 66, no. 1, 2003, pp. 23-33. [11] Rainer, A. and Hall, T., “A quantitative and qualitative analysis of factors affecting software processes”,

Journal of Systems and Software, vol. 66, no. 1, 2003, pp. 7-21. [12] Dybå, T., “An Instrument for Measuring the Key Factors of Success in Software Process Improvement”,

Empirical Software Engineering, vol. 5, no. 4, 2000, pp. 357-390. [13] Hall, T., Baddoo, N. and Wilson, D. N., “Measurement in Software Process Improvement Programmes:

An Empirical Study”, in the 10th International Workshop on Software Measurement (IWSM 2000) proceedings, pp. 73-83, 2000.

[14] McGarry, J., Card, D., Jones, C., Layman, B., Clark, E., Dean, J. and Hall, F., “Practical Software Measurement. Objective Information for Decision Makers”, Addison-Wesley, 2002.

[15] Zahran, S., “Software Process Improvement: Practical Guidelines for Business Success”, Addison-Wesley, 1997.

[16] Briand, L. C., Differding, C. M. and Rombach, H. D., “Practical Guidelines for Measurement-Based Process Improvement”, in Software Process – Improvement and Practice, vol. 2, no. 4, 1996, pp. 253-280.


22

[17] Card, D. N., “Integrating Practical Software Measurement and the Balanced Scorecard”, in the 27th annual international computer software and applications conference (COMPSAC’03) proceedings, pp. 362-367, 2003.

[18] Dybå, T., “Factors of Software Process Improvement Success in Small and Large Organisations: An Empirical Study in the Scandinavian Context”, in the 9th European Software Engineering Conference (ESEC´03) proceedings, pp. 148-157, 2003.

[19] van Solingen, R. and Berghout, E., “The Goal/Question/Metric Method: A Practical Guide for Quality Improvement of Software Development”, McGraw-Hill, 1999.

[20] FiSMA, Finnish Software Measurement Association, 2008, http://www.fisma.fi/eng/index.htm. [21] TUT, Tampere University of Technology, 2008, http://www.tut.fi/ [22] UJ, University of Joensuu, 2008, http://www.joensuu.fi/englishindex.html. [23] Kitchenham, B., Pfleeger, S. L., Pickard, L., Jones, P., Hoaglin, D., El Emam, K. and Rosenberg, J.,

“Preliminary Guidelines for Empirical Research in Software Engineering”, IEEE Transactions on Software Engineering, vol. 28, no. 8, 2002, pp. 721-733.

[24] Curtis, S., Gesler, W., Smith, G. and Washburn, S., “Approaches to sampling and case selection in qualitative research: examples in geography of health”, Social Science & Medicine, vol. 50, issue 7-8, 2000, pp. 1001-1014.

[25] Soini, J., “Measurement in the software process: How practice meets theory”, Doctoral thesis, ISBN 978-952-15-2060-0, Tampereen teknillinen yliopisto 2008, Publication 767, 2008.

[26] Brown, J. S. and Duguid, P., “Balancing Act: How to capture Knowledge without Killing It”, Harvard Business Review, vol.78, no. 3, 2000, pp. 73-80.

[27] Huysman M. and de Wit, D., “Knowledge sharing in practice”, Kluwer Academic Publishers, 2002. [28] Swan J., Newell, S., Scarborough, H. and Hislop, D., “Knowledge management and innovation:

networks and networking”, Journal of Knowledge Management, vol. 3, no. 4, 1999, pp. 262-275. [29] Räsänen, P., Anttila, A. H. and Melin H., ”Tutkimusmenetelmien pyörteissä”, WS Bookwell Oy, 2005. [30] Schneidewind, N., “Knowledge requirements for software quality measurement”, Empirical Software

Engineering, vol. 6, no. 3, 2001, pp. 201-205. [31] Baumert, J. and McWhinney, M., “Software Measures and the Capability Maturity Model, Software

Engineering Institute Technical Report, CMU/SEI-92-TR-25, ESC-TR-92-0, 1992. [32] Ravichandran, T. and Rai, A., “Structural Analysis of the Impact of Knowledge Creation and

Knowledge Embedding on Software Process Capability”, IEEE Transactions on Engineering Management, vol. 50, no. 3, 2003, pp. 270-284.

[33] Nowak, M. J. and Grantham, C. E., “The virtual incubator: managing human capital in the software industry, Research Policy, vol 29, no. 2, 2000, pp. 125-134.

[34] O’Regan, P., O’Donnell, D., Kennedy, T., Bontis, N. and Cleary, P., “Perceptions of intellectual capital: Irish evidence”, Journal of Human Resource Costing and Accounting, vol. 6, no. 2, 2001, pp. 29-38.

[35] Drucker, P., “Knowledge-Worker Productivity. The Biggest Challenge”, California Management Review, vol. 41, no. 2, 1999, pp. 79-94.

[36] Ould, M. A., “Software Quality Improvement Through Process Assessment – A View from the UK”, IEEE Colloquium on Software Quality, pp. 1-8, 1992.

[37] Soini, J., Tenhunen, V. and Mäkinen, T., “Managing and Processing Knowledge Transfer between Software Organisations: A Case Study”, in the International Conference on Management of Engineering and Technology (PICMET’07) proceedings, pp. 1108-1113, 2007.

[38] Kald, M. and Nilsson, F., “Performance Measurement at Nordic Companies”, European Management Journal, vol. 18, no.1, 2000, pp. 113-127.

[39] Gibbs, W. W., “Software’s Chronic Crisis”, Scientific American, Sept 94, vol. 271, no. 3, 1994, pp. 86-95.

[40] Florac, W. A. and Carleton, A. D., “Measuring the Software Process: Statistical Process Control for Software Process Improvement”, Addison-Wesley Longman, Inc., 1999.

[41] van Veenendaal E. and McMullan, J., “Achieving Software Product Quality”, UTN Publishers, 1997. [42] Hall, T. and Fenton, N., “Implementing Effective Software Metrics Programs”, IEEE Software, vol. 14,

no. 2, 1997, pp. 55-65. [43] Simmons, D. B., “A Win-Win Metric Based Software Management Approach”, IEEE Transactions on

Engineering Management, vol. 39, no. 1, 1992, pp. 32-41. [44] Andersen, O., “The Use of Software Engineering Data in Support of Project Management”, Software

Engineering Journal, vol. 5, issue 6, 1990, pp. 350-356.


23

[45] Varkoi, T. and Mäkinen, T., “Software Process Improvement Network in the Satakunta Region – SataSPIN”, in the European Software Process Improvement Conference (EuroSPI’99) proceedings, 1999.

[46] Grady, R. B. and Caswell, D. L., “Software Metrics: Establishing a Company-Wide Program”. Prentice-Hall, 1987.

[47] Kan, S. H., “Metrics and Models in Software Quality Engineering”, 2nd Edition, Addison-Wesley, 2005. [48] Mendonca, M. G. and Basili, V. R., “Validation of an Approach for Improving Existing Measurement

Frameworks”, IEEE Transactions on Software Engineering, vol. 26, no. 6, 2000, pp. 484-499. [49] Eskola, J. and Suoranta, J. ”Johdatus laadulliseen tutkimukseen”. 7. painos. Vastapaino, Tampere, 2005. [50] Lukka, K., Kasanen, E., “The problem of generalizability: anecdotes and evidence in accounting

research”, Accounting, Auditing and Accountability Journal, vol. 8, no. 5, 1995, pp. 71-90. [51] Briand, L., El Enam, K. and Morasca, S., “Theoretical and Empirical Validation of Software Product

Measures”, ISERN Technical report 95-03, 1995.


24

SMEF 2009


25

IFPUG function points or COSMIC function points?

Gianfranco Lanza

Abstract The choice of the more suitable metric to obtain the functional measure of a software

application is not an easy task as it could appear. Nowadays a software application is often composed by many modules, each of them with its own characteristics, its own programming languages and so on.

In some cases one metric could appear more suitable to measure one particular piece of software than another, so in the process of measuring a whole application it’s possible to apply different metrics: obviously what we can’t do is to add one measure to the other (we can’t add inch with cm!).

In our organisation we are starting to use COSMIC function points to measure part of the application and to use IFPUG function points to measure other parts.

One of the possible problems is: “when the management ask us only one measure of the functional dimension of the application what can we do?” It’s possible to obtain only one measure using the conversion factors from COSMIC to IFPUG but it would be better to maintain two distinct measures.

We are starting to compare the two metrics in different environments: in some cases there is a significant difference in the functional measure. Using a metric is mainly a matter of culture. It’s important to know what a metric can give us and what not, but above all, what is our goal. If, for example, our goal is to obtain a prevision of effort and costs we have to know the productivity of each metric in the different environments: so we have to collect data!

This paper illustrates how we can use both the metrics in our application.

1. Introduction Every day we have to do with metrics to satisfy our needs. For example how many times a

day do we look at our watch to know what time is it? We choose the right metric to satisfy our goal. If the goal is to know what time is it, we

use the metric “time”. If the goal is to weigh some fruits we use the metric that give us the weigh.

Sometimes the metrics give us measures that represent the same concept but in different measure unit (i.e. inch and cm for length). The metrics were born to measure and we have to measure to know. Every metric gives us an information, the functional metrics give us information about the dimension of our software products. In every moment of our life we have to use many metrics; in the world of the applications sizing is it sufficient to use one single metric? Can one single metric be sufficient to answer all our questions, all our needs? Today, the architecture of our software applications is complex and full of software components, each one with its own characteristics, programming languages and so on. It’s absolutely normal that we can use different metrics according to the nature of the software component that we have to measure.

2. User’s point of view

The dimensional measure of a software application is influenced by the choice of the user; the user’s point of view is very important to determine our measure. The choice of the user depends also by the goal that we have.


26

If we have to measure our software application because our client wants to know the functional dimension of the software he’ll buy, the user is our client. If we have to measure the software application to have information about the effort to develop the application, the user could be different. In a client server architecture we should like to know the functional dimension of the client component and the functional dimension of the server component, as they were two different applications, we could have two different borders, simply because we could have different data about productivity for the client component and for the server component. If our application is built with business services, we could have to measure separately these business services for knowing the effort of developing them, if some of these business services are already done, we would like to know how much reuse we will have. In an application we can have different user’s point of view that give us different measures; the important thing is that we have not to mix them!

3. Elementary Process

To determine the functional dimension of an application is necessary to identify the elementary processes, independently by the metric we use. The elementary process is the smallest unit of activity, which satisfies all of the following:

• Is meaningful to the user. • Constitutes a complete transaction. • Is self contained. • Leaves the business of the application being counted in a consistent state.[1]

The elementary process depends by the user’s point of view and as at the end of the

process the system has to be in a consistent state, it can’t be divided into sub processes. In an application, if we change the user’s point of view, we can have different elementary processes.

4. Batch Processes

In some cases a batch process is a stream of processes, each of them linked together to constitute a unique elementary process, whichever user’s point of view we have. If something of wrong happens during the execution of one process, the whole batch process has to be restored. In the figure 1 is represented the schema of this type of batch.

Figure 1: Batch process.

4.1. Batch processes using IFPUG function points If we use the IFPUG function points, we have to consider one unique elementary process

for one batch process (in fact we can’t divide the elementary process in different elementary processes because the system wouldn’t be in a consistent state at the end of each process of the stream; this independently by the user’s point of view).

Then we have to discover the primary intent of this process, if it is to maintain some logical files or change the system behaviour, or if it is to expose data outside the boundary of our application. In the first case we have an EI, otherwise an EO or EQ.

Process 1 Process 2 Process n

Start End


27

To establish the complexity of our IFPUG functions we have to identify the FTR (File Type Referenced) and the DET (Data Element Type); we could have from 3 to 7 IFPUG function points, according to the complexity and the type of the transactional function.

We have to add this function points to the number of the data functions (ILF or EIF) function points but very often these batch processes are inside the boundary of a Client-Server or WEB application, the data functions that they reference are the same counted in the Client-Server or Web application; so being inside the same boundary we can’t count twice our data functions!

We have also to observe that in an application of great dimension we can have a great number (more than 50) of these batch processes. At the end of the process, in any case, we have a number of function points for each of this process that are very similar, probably with a difference of three, four function points. Is this measure useful to us? Is it sufficient significant to apply, for example, a model to obtain the cost and the effort of the batch processes developing?

Probably the answer is not!

4.2. Batch processes using COSMIC function points If we use COSMIC function points we have also to identify the elementary process by the

user’s point of view. In this case the elementary process is the same of that identified in the IFPUG function points process. In COSMIC function points there aren’t data functions but there are objects of interest and data movements, each data movement has one object of interest.[2] There are four type of data movements: Entry, Exit, Read and Write. In a bath process with a stream of processes linked together to constitute a single elementary process there are a multitude of data movements. In the figure 2 is illustrated an example. The weigh of each data movement is one single function point. There is no limit to the numbers of data movements in a elementary process (at least one entry and one of the others), so we can have functional processes with a strong difference in function points, according to the numbers of data movements. Is this measure useful to us? Is it sufficient significant to apply, for example, a model to obtain the cost and the effort of the batch process developing?

Probably the answer is yes!

Figure 2: Data Movements.

4.3. A real case In the Table 1 is represented the result in IFPUG function points and in Table 2 of

COSMIC function points of the functional measure of the batch processes of a business application. The number of IFPUG function points of each batch process is obtained by the sum of the relative transactional functions with the portion of the Data functions weigh (the entire Data Function weigh is of 228 FP / 32 Batch Processes = 7,125 FP).

Entry 1 Entry m

Exit 1 Exit n

Read 1 Read k

Update1 Update j

Batch process


28

The number of COSMIC Function Point is obtained by the number of data movements of each batch process.

Table 1: The IFPUG function points counting result. Name Function FP Print and Update BATCH CTTRIS000-002 EOH 7 CTTRIS000-003 EOH 7 CTTRIS000-008 EOH 7 CTTRIS500-001 EOH 7 CTTRIS500-003 EOA 5 CTTRIS600-001 EOH 7 CTTRTP000-002 EOH 7 CTTRTP000-003 EOH 7 CTTRTP000-005 EOH 7 CTTRTP000-008 EOH 7 CTTRTP000-009 EOH 7 CTTRTP000-010 EOH 7 CTTRTP000-011 EOH 7 CTTRTP000-012 EOH 7 CTTRTP000-013 EOH 7 CTTRTP000-014 EOH 7 CTTRTR500-001 EOH 7 CTTRTR600-001 EOH 7 CTTRTR600-002 EOH 7 CTTRTR600-004 EOH 7 CTTRTR800-001 EOH 7 CTTRTR800-002 EOH 7 CTTRTS000-001 EOH 7 CTTRTS700-001 EOH 7 CTTRTS700-002 EOH 7 CTTRTS700-008 EOH 7 CTTRTU000-001 EOH 7 CTTRTU000-002 EOH 7 CTTRTU000-003 EOH 7 CTTRTU000-008 EOH 7 CTTRTS700-001 EOH 7 CTTRTS700-002 EOH 7 Total = 222


29

Table 2: The COSMIC function points counting result.

Name Entry Exit Read Write Total Print and Update BATCH CTTRIS000-002 1 5 5 10 21 CTTRIS000-003 1 19 5 10 35 CTTRIS000-008 1 8 6 3 18 CTTRIS500-001 1 8 6 10 25 CTTRIS500-003 1 4 2 1 8 CTTRIS600-001 1 9 11 14 35 CTTRTP000-002 1 27 15 7 50 CTTRTP000-003 1 29 16 8 54 CTTRTP000-005 1 5 4 1 11 CTTRTP000-008 1 43 16 8 68 CTTRTP000-009 1 35 15 7 58 CTTRTP000-010 1 26 15 7 49 CTTRTP000-011 1 22 14 6 43 CTTRTP000-012 1 27 14 6 48 CTTRTP000-013 1 28 14 6 49 CTTRTP000-014 1 27 15 7 50 CTTRTR500-001 1 15 9 9 34 CTTRTR600-001 1 17 9 5 32 CTTRTR600-002 1 11 9 5 26 CTTRTR600-004 1 9 9 5 24 CTTRTR800-001 1 15 12 1 29 CTTRTR800-002 1 22 13 1 37 CTTRTS000-001 1 3 4 2 10 CTTRTS700-001 1 1 3 6 11 CTTRTS700-002 1 1 3 4 9 CTTRTS700-008 1 4 3 1 9 CTTRTU000-001 1 8 12 5 26 CTTRTU000-002 1 8 12 5 26 CTTRTU000-003 1 16 5 3 25 CTTRTU000-008 1 3 3 1 8 CTTRTS700-001 1 1 3 6 11 CTTRTS700-002 1 0 3 4 8

Total

= 947 As we can see there is a strong difference between the two measures, the difference is

represented in the graphics in Figure 3 (IFPUG FP) and Figure 4 (COSMIC FP). There is a strong difference as total sum of the batch processes but there is, above all, a

very strong difference between the batch processes themselves.


30

IFPUG FP Batch

012345678

0 10 20 30 40

Batch Id

IFPU

G F

P Va

lue

Serie1

Figure 3: Distribution of IFPUG FP Batch.

Cosmic FP Batch

01020304050607080

0 5 10 15 20 25 30 35

Batch Id

CO

SMIC

FP

Valu

e

Serie1

Figure 4: Distribution of COSMIC FP Batch.

5. SOA Architecture and Reuse

Nowadays, many applications are developed in a SOA architecture, some pieces of software are available as business services. This type of architecture should allow a certain amount of reuse in the application developing; the question is: “can we measure this amount of reuse?”

In some cases the SOA approach allows to have a business service as a functional process

recognisable by the user (i.e. the authentication procedure), in other cases business services are part of a functional process that contain them (i.e. the reading of some external data, maintained by an another application, during the flow of an elementary process). It would be useful to calculate a functional measure of the amount of the reusable software. When the business service available in a SOA architecture is a functional process and it is recognisable by the user, the measure can be calculated as one elementary process, so it shouldn’t be a problem.


31

When the business service is contained in a functional process, it is more difficult to measure it as an elementary process. In both cases we have some reuse: this means that some operations that we will have to do are already done by a piece of software available. The problem is only how to measure this piece of software through a metric.

While in the first case we can use both COSMIC and IFPUG function points for the

measure of the elementary process; in the second case, without the presence of an elementary process, the use of COSMIC function points through the identification of data movements can help us; the fact is: every data movements has associated an object of interest that have to be meaningful to the user, so which data movements can we count? It can depends by the user’s point of view.

In the Figure 5 there is a representation of the two cases.

Figure 5

5.1. A Real Case We illustrate a Use Case of inserting personal data into a register of birth. During the

elementary process of inserting, the system has to reference many external interface files, to control if the person is already present, if he is present in other registers of birth and so on.

The first operation of the elementary process is to control if the user has the privilege to insert the person, otherwise an error message is displayed, and then the system can proceed with the insert function.

As the insert function can be activated by more than one business application a business service has been built to perform this operation.

The user knows all the controls needed (they are functional requirements), but the fact to create a business service is not a specific requirement.

In the following Figure 6 we can see the COSMIC function points and IFPUG function points counting of the elementary process of inserting a person.

In the IFPUG counting there are also the Logical Data Files used in the EI.

Authentication Procedure ( Business Service)

Soa Business service 1 ………… Soa Business service 2 ……….. Soa Business service n

Elementary Process Application 1

Application 2


32

COSMIC FP Counting Entry Exit Read Write FP

PersonaFisica Inserimento 2 1 2 1 6

Entry: PF, Residence Data Read: ID Utente Read : Privilege; Write: PF (business service inserisciPersonaFisica) Exit: message

inserisciPersonaFisica (business service)

2 1 7 3 13 Entry: PF Read : PF in GMS Read: NAO; Read: BPR, Read: Topo; Read SITAD, Read: SAS Write: Anagrafica PF , Residenza Write: Log; Exit: result Entry:ID Utente Iride; Read: Iride

IFPUG FP Counting

FTR RET DET FP

Persona Fisica Inserimento EI 8 23 6 GMS EIF 1 8 5 NAO EIF 1 15 5 BPR EIF 1 6 5 Topo EIF 1 5 5 SOTAD EIF 1 5 5 SAS EIF 1 6 5 Anagrafica PF ILF 2 21 10 Iride EIF 1 6 5

Figure 6 If we have to implement an application that performs the inserting function using the

business service already available, we can notice that in IFPUG function points there is one transactional function EI that performs the inserting operation. It’s impossible to weigh in function points the reuse of this function, the only thing that we can do is to apply a percentage to the 6 function points (a corrective factor, i.e. using the Nesma Enhancement Process)[3].

If we use the COSMIC function points we see that is possible to identify the data movements that we have not to implement. In the Figure 6 they are represented in bold: there are seven data read movements (Read : PF in GMS Read: NAO; Read: BPR, Read: Topo; Read SITAD, Read: SAS, Read: Iride) and two data write movements (Write: Anagrafica PF , Residenza) for a sum of nine COSMIC function points. If we consider the nine data movements minus the write data movement of the call of business service, we can consider a reuse of eight COSMIC function points.

In this manner we can consider a certain amount of function points of reuse for every elementary process.


33

6. Complex Functions

In some cases, in a GUI application, there are complex procedures that are activated through buttons. Each of these procedures is a unique elementary process by the user’s point of view. Generally they reference many FTRs but they haven’t always many DETs. If the application is not so big they can have a strong impact about productivity and, consequently, on the effort of software developing. It’s not so easy to evaluate their impact.

We illustrate a real case of an application of 190 IFPUG function points that has the very complex function “Esegui Verifica”, activated by a button. The procedure is an EO transactional function.

IFPUG FP Function FTR DET Compl. FP ……………

Procedimenti

Vsualizzasione elenco procedimenti per asienda EQ 1 1 Bassa 3,0

Dettaglio procedimento EQ 2 2 Media 4,0 Nuovo procedimento EI 2 5 Media 4,0 Modifica EI 2 3 Bassa 3,0 Richiedente EQ 1 1 Media 4,0 Effluenti prodotti EO 1 1 Bassa 4,0 Dettaglio effluente EO 1 1 Bassa 4,0 Modifica effluente EI 1 3 Bassa 3,0 Annulla EI 1 2 Bassa 3,0 Elimina EI 1 2 Bassa 3,0 Stampa EQ 2 2 Alta 6,0 Revoca stampa EI 1 2 Bassa 3,0 Esegui verifica EO 6 23 Alta 7,0

U.P.A. Elenco UPA EO 1 1 Bassa 4,0 …………

Figure 7 In the Figure 7 there is an abstract from the IFPUG function points counting, while in

Figure 8 there is the COSMIC function points counting of “Esegui Verifica” of the same function.

As we can see the value in IFPUG function points of “Esegui Verifica” is seven while the

value in COSMIC function points of the same function is 25 function points.


34

COSMIC FP Entry Exit Read Write FP

Esegui Verifica 1 12 12 0 25

Entry: Procedimento; Read: anagrafe asiende, anagrafe tributaria, infocamere,Iter, dati, efflenti prodotti, UPA, tecnica colturale, tipologie di smaltimento, dichiarasioni, allegati, controlli, Exit:anagrafe asiende, anagrafe tributaria, infocamere,Iter,

Figure 8 Also in this case the number of COSMIC function points seems to weigh in a more

significant manner the weight of the function.

7. How do we manage IFPUG and COSMIC FP measures in the same application?

When part of a business application has been measured in IFPUG FP and other parts in COSMIC function points the problem is: “Which is the real measure of the whole application?”

We can’t obviously sum IFPUG measure with COSMIC measure! If there is a need to have a whole measure of the application, we can apply a conversion

factor from COSMIC to IFPUG (i.e. Desharnais: CFP = 1,0*IFPUG –3; van Heeringen: CFP = 1,22*IFPUG – 64)

Can we trust these factors? According to the previous examples seen it’s difficult to trust them completely.

It would be better to consider the two measures separately, each of them with its own characteristics.

To consider the two measures separately means naturally that we will have different indicators for each of them (i.e. Productivity, Defective, …)

We can use Literature Data (ISBSG, Capers Jones, …) about some indicators. In the case of COSMIC the historical data available are not so many as in IFPUG function points, so, perhaps, it can be difficult to trust them at the moment. The better thing would be, in any case, to collect own data into a repository and to build own indicators.


35

8. Conclusions Using metrics is undoubtedly a matter of culture. To use different metrics in an application gives the project manager more suitable

information to control the software process. However we have to pay more attention in using metrics. Someone says that metrics are like spies, under pressure they tell you what you wants!

Naturally we have to use the right metric in the right place and at the right moment, without any prejudice but, above all, we have to collect data for understanding and knowing what they can exactly give us, which indicators are useful to us.

This process is a delicate one, it has to be performed by metrics expert people that guide all the metric process and avoid a wrong use of it.

The most important thing, in any case, is that metrics can help us but they are not the solution of all our problems.

9. References [1] IFPUG, “Function Point: Manuale delle Regole di Conteggio”,versione 4.2. [2] COSMIC, “The COSMIC Functional Size Measurement Method” version 3.0. [3] NESMA, "Function Point Analysis for Software Enhancement”, version 1.0.


36

SMEF 2009


37

Personality and analogy-based project estimation

Martin Shepperd, Carolyn Mair, Miriam Martincova, Mark Stephens

Abstract The aim of this research is to investigate the relationship between personality and expert

prediction behaviour when estimating software project effort using analogical reasoning. For some years we have been developing tools and techniques for estimation by analogy (EBA). However the variability of results from using these tools and techniques can be difficult to interpret. We have conducted a pilot study to integrate knowledge from cognitive psychology and computer science to investigate how to improve estimation when using analogy-based tools. We interviewed and assessed the personality of two experienced project managers to gain an understanding of their background and the problem solving strategies they currently employ. Following these interviews, the project managers were given a typical project effort estimation task. The project managers were asked to complete the task using our analogical reasoning tool and articulate their processes by means of a ‘think aloud’ protocol. We found significant differences in prediction approach that may be in part be explained by personality differences. One aspect, i.e. the strong need to acquire personal understanding may present obstacles to the successful use of some prediction tools.

Keywords: personality, analogy, case-based reasoning, cognitive psychology, software

projects, effort prediction.

1. Introduction Our research aims to better understand the cognitive processes and the impact of

personality on estimation by analogy for software professionals. We are investigating how these psychological aspects impinge on project cost estimation specifically when the estimator is using software tools to facilitate his or her prediction. Estimation by analogy (EBA), as encapsulated by case-based reasoning (CBR) tools, is loosely based on human cognitive processes. However, human demonstrate a wide range of individual differences and cognitive styles which may have a bearing upon performance. In particular we are interested in personality.

Despite the economic importance of accurate project cost estimation there remain many

difficulties associated with reliably estimating, particularly at an early stage of the project. Many techniques have been proposed and many validation studies conducted. Unfortunately no clear picture emerges. Although Jørgensen has championed research into the use of what is often called expert judgment (i.e. humans making predictions unaided by formal systems), the interplay between expert and EBA has not been the target of empirical study. Hence we believe this research to be novel. For a comprehensive review of cost prediction research see (Jørgensen and Shepperd, 2007).

The remainder of the paper is organised as follows. In the next section we review the state

of play for EBA project prediction research. This is followed by a brief description of the cognitive psychology viewpoint on personality (or individual differences) and then a discussion of how this has been applied to software projects in general. In the third section we describe our empirical investigation of project managers from our partner organisation.


38

This involved semi-structured interviews, personality tests and a think-aloud protocol whilst solving a real prediction task using our CBR tool. Preliminary results are presented in Section Four, and the paper concludes with a discussion of our findings.

2. Related work

Over the years many different approaches have been proposed for the task of predicting software project costs (usually effort) at an early stage in the process. One approach that has received considerable attention and some success is EBA. The idea here is that costs are best estimated by reference to similar past projects for which total cost or effort is already known. Thus there are essentially three steps. First, to code the new or target problem in terms of known characteristics, for example size in function points. Second, to retrieve completed projects that are similar to the target project. Third, to use the retrieved costs, possibly with adaptation, to predict the new project cost.

Whilst EBA may be conceived as an informal problem solving strategy there has been a

good deal of research formalising this style of prediction into an artificial intelligence technique known as case-based reasoning (CBR). This constitutes the underpinnings of our EBA tool, called ANGEL (Shepperd and Schofield, 1997). Subsequently there have been a number of independent studies that have sought to compare the accuracy of EBA predictions with those obtained by other means, most commonly the benchmark technique of linear regression analysis. In 2005 we identified 20 such studies. We found approximately equal evidence for and against analogy-based methods (Mair and Shepperd, 2005). This naturally poses the question, why should EBA prediction accuracy be so variable?

One avenue we are therefore exploring is the interplay between the expert and estimation

technique as encapsulated by ANGEL. Surprisingly, this is a rather neglected topic despite studies such as (Myrtveit and Stensrud, 1999) who reported that the combination of expert and CBR tool outperformed either individually. In particular there is a rich vein of psychology work that considers the cognitive processes of expert problem-solving (of which prediction is an example of an ill-defined problem). For a more extensive review of this literature and how it might apply to project prediction see (Mair et al. 2009). However in this paper we focus upon the role of personality in particular.

Project managers play a vital role in the success of software project cost estimation. It is

therefore important that psychometric data also be collected. However, research in software engineering has mainly focused on models, algorithms and improvement of tools while overlooking the importance of human factors such as the personality of software professionals. Personality is made up of the characteristic patterns of thoughts, feelings, and behaviours that make a person unique. It is typically measured in terms of type or trait.

Type theories of personality propose that types are qualitatively distinct categories. That is

that people are either introverts or extraverts. However, types do not reflect durable personality patterns; they tend to be a product of a particular place, time, and culture (Carver & Scheier, 2004). On the other hand, personality traits are persistent and exhibited in a wide range of social and personal contexts. Furthermore, trait theorists view personality as the result of internal biologically based characteristics that influence our behaviour. In addition, according to trait theorists, introversion and extraversion are part of a continuous dimension, with many people falling in the middle (Carver & Scheier, 2004).


39

Myers-Briggs Type Indicator (MBTI) is a popular example of a personality measure. Empirical studies using the MBTI have tended to find that certain personalities are disproportionately represented in software engineering personnel. For example, Capretz et al. (2003) found that most software engineers are introverts. The most common type found was ISTJ (introverted, sensing, thinking, judging). In a systematic literature review (Beecham et al. 2008) engineers were found to be sociable yet introverted, needing stability yet desiring a variety of new tasks and challenges. Gorla and Lam (2004) identified the best personality attributes for individual roles within a software development team. According to their survey results, the optimal personality for a team leader was intuitive and feeling; the optimal personality for a programmer was extrovert, sensing and judging; whereas the most suited type for system analyst was thinking and sensing. Of course, project teams comprise more than one individual and personality heterogeneity within the team can lead to success of the project (Howard 2001). It is therefore important to build up a team where all personality categories are represented. The majority of studies in software engineering have used the MBTI (Myers & McCaulley, 1985) as a measure of personality which demands the individual adopt the personality type he or she would use in a specific situation (e.g. at work). However, we are more concerned with biological bases of personality which are durable and manifest in a range of situations. For this reason, we elected to use the Eysenck Personality Questionnaire EPQ-R Short Scale (Eysenck and Eysenck, 1991), and the Impulsiveness (IVE) questionnaire (Eysenck and Eysenck, 1975).

3. Our study

We believe there is a need for cognitive, qualitative empirical research of professionals making predictions. This is to complement extensive current research into algorithmic and statistical aspects of prediction. We focus on making predictions using analogy. In a systematic review of the relevant empirical studies we found an absence of published work on the interplay of personal differences (personality) and experts carrying out prediction tasks (Mair et al., 2009). So this is the motivation for the empirical study reported in the remainder of this paper.

Working with our collaborator we conducted interviews, personality profiles and then a

think-aloud protocol with a prediction task supported by our CBR tool. The participants were highly experienced project managers employed by the collaborator for many years. To assess personality we used the Eysenck model of personality which is based upon three major traits: Extraversion, Neuroticism (emotionality) and Psychoticism (tough-mindedness). The Eysenck Personality Questionnaire (EPQ) is a development of various personality questionnaires over 40 years and is regarded as a highly reliable and valid measure of personality. Internal consistencies and test-retest reliabilities of all three factors are above 0.7, many above 0.8 (on a 0 to 1 scale) and thus are highly satisfactory. The validity of these scales is supported by much experimental evidence. In fact, it is the best supported of any generally available personality measure.

The organisation we are collaborating with is a major, international software developer

with clients around the globe. They have had an extensive software measurement programme in place since the early 1990s and have amassed a database of over 10,000 projects. This database includes information about duration, team size, methods and language, client details, project size (typically measured in function points (FPs) and lines of code (LOC)) and total effort. Unfortunately there are extensive problems of missing values and these problems are compounded by issues of trustworthiness.


40

However, guided by a Metrics Specialist from our collaborator, we identified a small subset of 18 comparable, recent UK enhancement projects. These were then used as the basis of a relevant realistic prediction task for our expert participants.

Table 1: Summary Data of Project Size and Effort Information.

Variable Min Median Mean Max Unadjusted FP 90 459 658 1719 Adjusted FP Count 90 525 701 1822 Total Logical LOC 2676 25959 29940 64031 Duration 192 380 393 544 Effort 6174 14760 18200 50886

The projects range in Effort from 6174 to 50886 person hours, essentially an order of

magnitude. There is a slight tendency for the mean to exceed the median implying a positive skew to the data, in other words a few atypically large values (Table 1). A total of 16 variables were selected including methodology, client, full project name, language, maximum staffing, start and end dates and the information from Table 1.

12500

25000

37500

50000

400 800 1200 1600

Unadjusted FP

Effort

Figure 1a: Relationship between project size in FPs and Effort.


41

Figure 1b: Relationship between project size in Logical LOC and Effort. In Figure 1a and 1b we use scatter plots to show the poor (certainly non-linear)

relationship between size as measured by either FPs or LOC. The highlighted projects indicate extreme anomalous values. The regression lines are added for purposes of clarity only. This ill-defined relationship between obvious size measures and effort is a naturally occurring example of why cost prediction is difficult. There was also no clear relationship (not illustrated) between LOC and FPs even though all projects were developed in the same language. This indicates that the backfiring method of converting LOC to FPs is unlikely to be useful in this particular environment. More generally, it means that simple ratio-based approaches to estimation are unlikely to be effective (or indeed any form of linear regression modelling). This characteristic of the data caused the participants some difficulties for the prediction task, something we will shortly return to. Finally, we note in passing that even a naïve approach to this data set using our EBA tool ANGEL with a jackknife cross-validation could yield average prediction errors of less than 20% MMRE.

The estimation task was devised by taking one of the 18 completed projects and reducing

the actual effort threefold (i.e. 15258 to 5086 person-hours). Each participant was then given the scenario that this estimate had been provided by another, unknown manager and their task was to perform a ‘sanity check’ using the data set of projects and the EBA tool ANGEL. Since they were unfamiliar with this tool, other than a prior 15-20 minute demonstration with a ‘toy’ example, one of the investigators would ‘drive’ the tool on their behalf. As they carried out the estimation task they were encouraged to use a think-aloud protocol and this was recorded and subsequently transcribed. In addition, the analogy tool ANGEL also generated a log file of usage.

12500

25000

37500

50000

15000 30000 45000

Total Logical LOC

Effort


42

4. Results

From the think-aloud transcription from the two project managers we produced process maps of how the two participants approached the estimation task (Figs. 1 and 2). These are intended to indicate the major steps in the prediction task which took about (60 minutes and 20 minutes respectively) for the two participants.

The diagram is divided into three ‘swimming lanes’ which represent: • References to the target problem (i.e. the estimate for which the sanity check is

required). • The prediction activities (e.g. computing a ratio). • Explicit requests for external information (e.g. speak to a colleague), which of course

could not be satisfied due to the constrained nature of the investigation. The thought clouds denote self-reflective processes such as familiarity with a technique or

degree of confidence in a result. In addition, time is conveyed by moving top-down through the diagram and lastly the small number boxes represent explicit references to other projects, in other words making use of analogies.

Even a cursory comparison of the two process maps reveals some clear differences. Most significantly, P1 was reluctant to make a decision concerning the prediction task without additional information. By contrast, P2 not only decided that the provided estimate was too low (5086 hours) but suggested an alternative value (12000 hours). Recall, the true value was 15286 hours. Interestingly P1 took approximately three times longer than P2. Given our interest is in analogical reasoning we also note that P2 made 16 explicit analogical references compared with 5 for P1. In general P1 did not wish to use the ANGEL tool to retrieve similar projects but preferred to establish and “validate” theories concerning relationships among the data. In particular, P1 sought to find linear relationships and stable ratios, for example in terms of delivery rates. As we discussed in the previous section such relationships are not to be found within this project data set and this caused P1 some difficulty. Interestingly towards the end of the task P1 speculated that the problem might lie within the quality of the data provided for the task. At one stage P2 also looked for some linear relationship between FPs and effort but when this was not supported by the data, P1 retreated from this viewpoint and investigated a new approach based on finding analogies using two (not one) dimension. This proved to be successful.


43

Figure 2: Process map of P1 carrying out the prediction task.


44

Figure 3: Process map of P2 carrying out the prediction task.


45

Next we used the Adult EPQ-Revised Short Scale, comprising of 48 items, and Eysenck’s Impulsiveness (IVE) questionnaire, which consists of 54 items, and is designed to measure three personality traits: Impulsiveness, Venturesomeness, and Empathy. The IVE supplements and complements the EPQ-R Short Scale measure. A dichotomous response format is used in both questionnaires with respondents ticking “Yes” or “No”. Hence our participants’ profiles are comprised of 6 dimensions: Extraversion (E, sociability), Psychoticism (P, tough-mindedness), Neuroticism (N, anxiety), Impulsiveness (I), Venturesomeness (V, tendency to be adventurous), and Empathy (Emp).

Table 2 Personality Profiles of the Study Participants

Parti cipant

P Psychoticism

E Extraversion

N Neuroticism

I Impulsiveness

V Venturesomeness

E Empathy

P1 Below average Below average

Below average Below average Above average Average

P2 Below average Above average

Below average Below average Average Average

Across the two questionnaires, the participants scored below average on the Psychoticism,

Neuroticism, and Impulsiveness traits. However, P1 scored below and P2 above average on Extraversion. On Venturesomeness, P1 scored above and P2 below average. P1 scored below average on empathy and P2’s score was average (Table 2).

A high score on Psychoticism, or tough-mindedness, is associated with aggressiveness,

hostility, anger, non-conformity and inconsideration. Our participants scored below average on this trait, demonstrating tender-mindedness, empathy, unselfishness, altruism, warmth, and placidness. Responses indicated that the participants enjoyed cooperating with others, were conforming, took notice of what other people thought, and cared about good manners.

A high score on Extraversion characterises sociability, outgoingness, and a need for

external stimulation and action. In general, an extravert dislikes solitary pursuits, preferring excitement often achieved through taking chances and acting on impulse. In contrast the introvert tends to be quiet, retiring and studious. He or she can be reserved and distant except in intimate friendships, tends to plan ahead, and usually is not impulsive. Introverts prefer order and keep their feelings controlled. Hence, introverts are generally reliable, somewhat pessimistic, even tempered and tend to place great value on ethical standards. Our participants’ scores differed on this trait. P1 was below average and P2 above. The effects of this difference on task performance are discussed below.

Scoring high on Neuroticism (emotionality or anxiety) characterises high levels of

depression and anxiety, low self-esteem, and feelings of guilt. Both participants scored below average on this trait which suggests emotional stability.

High scores on Impulsiveness are associated with an inclination to act on impulse. Again

both participants scored lower than average, suggesting that they are not impulsive and think carefully before making a decision.


46

Venturesomeness characterises being adventurous and exhibiting risk-taking behaviour. On this trait, P1 scored above average and P2 below average. We interpret this as P1 enjoying taking risks, welcoming new, exciting experiences and sensations. However, these scores contradict the participants’ scores on the Extraversion trait. This might be explained by the fact that each trait comprises a range of dimensions. We intend to investigate this apparent incongruence later. The effects of this difference on task performance are further discussed below.

Empathy is associated with the capacity to share and understand another’s state of mind or

emotion. The participants’ scores were average. This suggests they can easily identify with and understand another’s situation, feelings, and motives. They can become engrossed in their friends’ problems which may be reflected in their moods.

Thus, to summarise, P1 and P2 differ in terms of Extraversion and Venturesomeness out

of the six traits (Table 2). Now we turn to the question of how these differences might explain at least in part differences in the participants’ prediction behaviour. Other researchers such as (Huitt, 1992) have found that when solving problems, introverted individuals tend to take time to think and clarify their ideas before engaging, whilst extraverts tend want to talk through their ideas in order to clarify them. In addition, introverts are often concerned with their own understanding of important concepts and ideas, whilst extroverts seek feedback from others about the value of their ideas. In essence, P1 as an introvert operated in an “inner world of ideas” and attended to his / her internal consistency whilst P2 operated more in an outside world and attended to the “external reality".

Consequently P1 found the whole basis of the CBR tool, which is an example of a lazy learner, incongruent with the introvert’s preferred style of problem solving. Finding analogies in high-order feature space is unintuitive and fits ill with simple linear modelling. Much of the raison d'être of estimation by analogy is that data are irregular and that by using past history one can avoid the need to build explanatory models. There is no support for inductive reasoning, i.e. to induce general theories or principles from example project data.

It is less easy to explain the impact of differences in the Venturesomeness trait. P1,

perhaps surprisingly, given the below average score for Extraversion exhibited an above average score for Venturesomeness. However, it may be that the risk seeking behaviour of the participants did not play much of a role since the task was artificial without little likelihood of harm, particularly compared with, say, a multi-million pound or euro project.

5. Discussion

To recap, in this paper we have described a pilot study that we have conducted to empirically investigate the relationship between personality or individual differences and prediction behaviour of experts when using an estimation by analogy method. To do this we conducted semi-structured interviews to establish a context. For this pilot we worked with two experienced participants. This was followed by an estimation task using a think-aloud protocol. The task was based upon actual project data from the collaborating organisation. This was complemented by a personality assessment using the Adult EPQ-Revised Short Scale and Eysenck Impulsiveness (IVE) questionnaires.


47

The two participants differed in personality in terms of two traits: Extraversion and Venturesomeness. We also observed significant differences in prediction processes. P1 took considerably longer than P2 and was reluctant to commit to an answer without further information whilst P2 was happy to provide a single point value. P1 made little use of the EBA tool ANGEL whilst P2 used it extensively and effectively.

The question therefore arises to what extent can we explain this in terms of Extraversion

and Venturesomeness. Other studies have found significant differences in problem-solving approach between introverts and extroverts and we believe this may be salient to our study. However, the impact in our study of Venturesomeness or risk taking may be constrained by the artificial nature of the task since little rested upon the outcome.

So, what practical significance does this study have? One clear lesson is that designers of

both estimation methods and tools need to take care to avoid a “one size fits all” mentality. It may well be that a partial explanation for the mixed accuracy results from EBA stems

from individual differences of the estimators. Results such as from this study certainly should be fed back into the design of the next generation of EBA tools.

Finally there are some limitations to the results. In this pilot we only analyse results from

two participants. We intend to extend this analysis to a larger number of project managers. In addition we note that, the Extraversion and Venturesomeness trait results seem to be contradictory in that they are negatively correlated. Further investigation is required along with the use of other personality measures.

6. Ackowledgements:

This work is funded by EPSRC grants EP/G007683/1 (Southampton Solent University) and EP/G008388/1 (Brunel University). We are also grateful to the collaborating organisation for supporting this research and allowing us access to their project managers and data.

7. References Beecham, S., Baddoo, N., Hall, T., Robinson, H. & Sharp, H. (2008). “Motivation in software engineering: A

systematic literature review”, Information and Software Technology, 50 (9-10), 860-878. [1] Capretz, L.F. (2003). “Personality types in software engineering”. International Journal of Human-

Computer Studies, 58 (2), pp207-214. [2] Carver, C. S., & Scheier, M. F. (2004). Perspectives on personality (5th ed.) Boston: Allyn and Bacon. [3] Eysenck, H. J., & Eysenck, S. B. G. (1975). Manual of the Eysenck Personality Questionnaire. Hodder

& Stoughton, London. [4] Eysenck, H. J. & Eysenck, S. B. G. (1991) Manual of the Eysenck Personality Scales (EPS Adult):

Hodder & Stoughton, London. [5] Eysenck, S.B.G., Pearson, P.R., Easting, G. & Allsopp, J.P. (1985). “Age norms for Impulsiveness,

Venturesomeness and Empathy in Adults”. Personality & Individual Differences, 6(5), pp613-619. [6] Gorla, N. & Lam, Y.W. (2004). “Who should work with whom? Building effective software project

teams”. Communications of ACM, 47(6), pp79-82. [7] Howard, A. (2001). “Software engineering project management”. Communications of the ACM, 44(5),

pp23-24. [8] Huitt, W. (1992), “Problem solving and decision making: Consideration of individual differences using

the Myers-Briggs Type Indicator”. Journal of Psychological Type, 24, pp33-44. [9] Jørgensen, M. and M. Shepperd (2007). "A Systematic Review of Software Development Cost

Estimation Studies." IEEE Transactions on Software Engineering 33(1): 33-53. [10] Mair, C. and M. Shepperd (2005). The Consistency of Empirical Comparisons of Regression and

Analogy-based Software Project Cost Prediction. 4th Intl. Symp. on Empirical Softw. Eng. (ISESE), Noosa Heads, Australia, IEEE Computer Society.


48

[11] Mair, C., M. Martincova, and Shepperd, M. (2009). A Literature Review of Expert Problem Solving using Analogy. 13th International Conference on Evaluation & Assessment in Software Engineering (EASE 2009), Durham, UK, BCS.

[12] Myers, I. & McCaulley, M. (1985). Manual: A guide to the development and use of the Myers-Briggs Type Indicator. Palo Alto, CA: Consulting Psychologists Press.

[13] Myrtveit, I. and E. Stensrud (1999). "A controlled experiment to assess the benefits of estimating with analogy and regression models." IEEE Transactions on Software Engineering 25(4) pp510-525.

[14] Shepperd, M. J. and C. Schofield (1997). "Estimating software project effort using analogies." IEEE Transactions on Software Engineering 23(11), pp736-743.


49

Effective project portfolio balance by forecasting project’s expected effort and its break down according to competence

spread over expected project duration

Gaetano Lombardi, Vania Toccalini

Abstract The proper allocation of an organisation’s finite resource is crucial to its long term

prospect. The most successful organisations are those that have in place a project portfolio planning process. Because of the large number of scenarios, taking the right decision requires the use of quantitative techniques. Then the organisation needs to plan how much work to take in, before keeping any decisions to change the current capacity. Lack of an early planning can lead to paralysis as a result of firefighting or to inefficiencies in using available resources. On the other hand, making an early reliable project plan starting from rough and incomplete market requirements is very difficult and the error in planning is quite high (usually underestimated): it leads to have wrong plans causing paralysis and inefficiencies as well.

The paper presents a set of indicators based from past project data that can be used to improve early planning capability by providing planning constants referring to time, effort and effort distribution among competencies needed in projects. Those are used according to Project Portfolio Model to prepare scenarios and to support effective decision on product roadmap.

1. Introduction As organisations move toward a multi project approach as their preferred way of

developing new products the difficulty of maintaining aligned business strategies, resource needs and delivery dates increase exponentially. Common symptoms of those difficulties are continue fire fighting to solve latest problems, weekly reprioritisations among projects, waste of resources in not critical activities, resource over commitment, and consequently missed deadlines.

As most of the causes are related to poor management practices or consequences of poor organisational policies, those are difficult to recognise and more often are classified as not effective project estimation, project planning and project tracking techniques. In the reality projects launched in organisation handling multi projects without having the capability soon become disasters when the necessary resources are not made available on time or promised delivers from peer projects are delayed. The solution to chronic schedule delays and over budgeting requires then not new method of planning, but avoidance of the failure conditions.

The introduction of a Project Office as the means of coordinating, managing and reporting on projects across the organisation has been identified as the most effective to achieve business objectives of organisation. It has implied the needs of defining project portfolio model and techniques to handle complex behaved system as multi projects environment.

In this paper, we present a technique piloted and used in ERICSSON MW Project Office

to address scenario planning activities with the purpose to identify critical resources needs in the early phase of the project as well as put constraints on project planning, to avoid over commitment and then make an effective usage of finite resources.


50

2. Project Portfolio Model

Project Portfolio Management Model is the model that provides a common approach to project portfolio management in Ericsson R&D, which includes aligning the project portfolio to the organisation's strategic plans, balancing it, and continuously monitoring and controlling it.

The project portfolio management model includes processes, decisions and documents

needed for efficient and successful portfolio management. In the model, a number of terms and concepts are used. • A project portfolio is a collection of projects or programs and other work that are

grouped together to facilitate effective management of that work to meet strategic business objectives. The projects or programs of the portfolio may not necessarily be interdependent or directly related. A prerequisite for successful portfolio management is that the project portfolio is clearly defined and that the accountability for the projects and programs in it is clarified.

• A project portfolio scenario is an alternative project portfolio used for analysis purposes. Portfolio scenarios consist of different combinations of ongoing, planned and potential projects that are set up in an environment that is not affecting the currently approved project portfolio plan.

The project portfolio management model consists of the following processes: • Project Portfolio Aligning. • Strategic Portfolio Balancing. • Project Portfolio Monitoring and Controlling.

The activities presented in the paper are mainly related to Project Portfolio Aligning

process.

2.1. Project Portfolio Aligning Project Portfolio Aligning is a process that will ensure that all incoming product

assignments and other potential projects are identified, categorised and evaluated before a decision can be made whether to include them in the project portfolio or not.

In the project portfolio aligning process, all new, ongoing and potential projects and programs are evaluated from their strategic fit. Based on the outcome of the portfolio aligning, a decision can be made on which projects to include in (and exclude from) the portfolio. This will ensure that the strategic alignment of the project portfolio is optimised, given the resource and financial constraints of the project portfolio.

The main activities in Portfolio Aligning are: • Identification and preliminary analysis. • Portfolio scenario management.

The purpose of the activity Identification and Preliminary Analysis is to register,

categorise and analyse all new and potential projects and programs that can be considered for inclusion in the portfolio. Registration and categorisation of all potential projects is a prerequisite for evaluation of their strategic fit and impact on the current project portfolio.


51

Portfolio Scenario Management is the activity of creating and evaluating project portfolio scenarios. Portfolio scenarios consist of different combinations of ongoing, planned and potential projects that are set up in an environment that is not affecting the currently approved project portfolio plan. In this environment, each scenario is evaluated. It is a critical success factor for this type of analysis that estimation of high level project cost as well as high level cost resource demand (per unit, competence / job profile) and spread over time are as much as possible precise and accurate, otherwise results could not be effective.

To make scenarios analysis more effective and to avoid over commitment in the roadmap, we have started working on improving early estimation techniques by using historical data to provide planning constants and effort distribution for each competence type used in projects.

3. High level Cost/Schedule Project Estimates: computation based on

historical data When analysing a new project to prepare scenarios, detailed plans do not exist and rough

estimation of cost, effort and schedule shall be developed from the estimator responsible based on the proposed project scope. In this case the comparison against the historical data from previous similar projects can be useful to refine and calibrate the rough estimates for the new project provided early in the project lifecycle.

In the context of project planning and estimation Monte Carlo simulation technique

(MCS) has been used to analyse historical project duration/cost data and to provide key resulting information to estimator responsible for revising their estimates.

Monte Carlo technique is a method used to estimate a probable outcome using multiple simulations with random input variables.

In our case study the Monte Carlo method has been used to simulate the expected total project cost and total project duration on the basis of project cost/duration data of each project phase derived from closed projects.

In Ericsson Microwave Project Office the whole project lifecycle consists of three main

phases: project analysis phase (pre-study), project planning phase (feasibility) and project execution phase. As input variables to MCS, the probability distribution functions of duration and cost for each project phase have been obtained through different statistical tools (Normality test, Individual Distribution Identification) applied to the historical database to find the optimal distribution fitting our data. Statistical parameters for each input variables, including mean, variance, standard deviation, location and scale have been also calculated. The simulations of total project cost and total project duration have been repeated until the probability distributions have been sufficiently well represented to achieve the desired level of accuracy (10000 iterations performed).

Figure 1 and 2 show the outcome of MCS, the probability distributions found for the total project duration and cost.


52

Figure 1: Monte Carlo simulation for total project duration.

Figure 2: Monte Carlo simulation for total project cost.

The typical method used for project cost and schedule estimation in Ericsson MW Project

Office is the three-point estimates (minimum, maximum, most-likely estimates) from which a probability distribution can be derived as better alternative to single-number estimates. This approach has the benefits of eliminating potential optimism or pessimism and providing a set of bounds to estimates.

Minimum, maximum, most likely duration/cost estimates for a new project in early phases can be tuned and readjusted based on the cumulative distribution functions (CDF) of total project cost and duration obtained through MCS on historical data.

Monte Carlo Simulation





53

A cumulative distribution plot shows on the y-axis the percentage of data samples that have a value less than the value of the x-axis. In the world of uncertainty, it is common to report values at various probability confidences, particularly p50, p75, and p90 to represent 50%, 75%, and 90% confidence respectively.

Comparing the minimum value estimated for cost/schedule to p50, the maximum value to p90 and the most likely value to p75, the estimator responsible can asses the probability that the new project will be completed within a given period of time and within specific budget. If the proposed three-point estimates are higher or lower than the outcome from historical data, the estimator responsible can adjust them properly. This is helpful to management because it can show the level of certainty of achieving a specific cost/schedule objective.

Figure 3 reports the Cumulative Distribution plot obtained for the total project duration. It shows that there is a 50% chance of a project to be completed within or less than 52,7 weeks, a 75% chance to be completed within or less than 66,9 weeks, a 90% chance of developing the project within or less than 84,5 weeks. Figure 4 shows the Cumulative Distribution plot obtained for total project cost.

Figure 3: CDF for total project duration.

Figure 4: CDF for total project cost.


54

4. High level Cost Resource Demand When providing early stage effort estimation for a new project, the main issue is to

understand which type and which amount of effort is required in the execution of the project and how the effort shall be distributed over the entire project life cycle.

Finding the optimum way to allocate resources to a project is crucial: if a project ramps up too quickly to maximum load, then resources are wasted before work is completed; if a project ramps up too slowly then budget overruns occur late in the project as time lines are compressed. If a constant level of manpower is used over all the phases of a project, some phases would be overstaffed and other phases would be understaffed causing inefficient use of effort, leading to schedule slippage and increase in cost.

To provide an initial, indicative effort estimate for a new project a data driven approach can be used by analysing patterns of resource expenditure from past projects.

In our approach the historical data trend analysis has been performed in two steps: • Estimation of the average effort distribution ratio, expressed as percentage of total

effort, per competence type • Estimation of the effort profile distribution spread over the project duration, per

competence type (staffing build-up rate)

4.1. Average effort distribution ratio estimate For each past project the actual effort data employed for products development have been

collected and broken down according to five competence types: Project effort, System effort, HW development, SW development, Integration and Verification effort.

Exploratory Data Analysis (EDA) using Box-plot has been used to primarily explore historical data and to estimate and compare the average effort distribution among different competence types. Unusual large or small observations (outlier) in the historical data have been investigated and removed from the analysis.

In the box-plot representation the top of the box represents the third quartile Q3 (75% of the data are less than or equal to this value), the bottom of the box the first quartile Q1 (25% of the data are less than or equal to this value) and the line in the middle of the box is placed at the second quartile or median (50% of the data are less than or equal to this value). The gray box represents the interquartile range (Q3-Q1) which spans the middle 50% of the data.

Outcome from this analysis is reported in figure 5. For example, considering the “System “ competence type, the 50% of the historical observations show an allocation to System discipline between 4,8 % and 12,4% of the total employed effort during the entire project lifecycle. This representation can be used to provide insights for future projects on where the effort resources are globally being spent weighted upon different competences.


55

Figure 5: Box-plot for average effort distribution among competence types.

4.2. Effort patterns distribution over project time frame estimate

Estimation of the effort build up rate for each competence type over the project duration has been obtained using multiple regression analysis to understand trends and relationships as they exist in the historical project effort data. In this way descriptive models of the effort distribution patterns have been developed for each competence profile.

As first step, exploratory analysis of historical data has been made through statistical distribution analysis of each effort profile in order to find out a suitable set of explanatory variables (predictors) depending on the elapsed project time “t” expressed in months. Correlation analysis among available predictors has been performed and obvious redundancies have been eliminated. The regression models for each competence type have been developed by applying “best subset” regression to screen a suitable subset of explanatory variables, before running the multiple regression analysis. Unnecessary terms with poor statistical significance have been removed and fitted models have been validated by residual plots analysis, R-sq adjusted figure evaluation and considering the impact of possible outliers.

As example, the resulting regression equations for HW and SW effort profiles are reported in figures 6 and 7 with the analysis of validity of the models based on residuals.

I&VHWSWSystemProject

50,0%

40,0%

30,0%

20,0%

10,0%

0,0%

Dat

a

15,9%

10,0%

24,3%

35,1%

13,4%14,0%

8,9%

23,7%

32,2%

14,4%

Boxplot of Project; System; SW; HW; I&V

18,7%

6,9%

12,4%

4,8%

25,4%

21,6%

45,8%

17%

18,8%

7,6%

I&VHWSWSystemProject

50,0%

40,0%

30,0%

20,0%

10,0%

0,0%

Dat

a

15,9%

10,0%

24,3%

35,1%

13,4%14,0%

8,9%

23,7%

32,2%

14,4%

Boxplot of Project; System; SW; HW; I&V

18,7%

6,9%

12,4%

4,8%

25,4%

21,6%

45,8%

17%

18,8%

7,6%


56

Figure 6: HW Effort Profile regression analysis.

Figure 7: SW Effort Profile regression analysis.

To summarise, the fitted models found for each effort profile are adequate as it is proved

by the high R-sq adjusted values (R-Sq (adj.)) obtained. Based on the results of the regression models, we explained between 84% to 96% of the variation for each competence profile built-up over project time. Validation of regression models has been based on the assumption that the errors, i.e. the residuals, are normally distributed, independent, with a constant variance and mean value equals to 0. The results of our model validation indicate that fitted models perform acceptably as the assumptions on residuals are met for each effort profile.

4.3. Estimates limitation, estimates usage and way forward

Early estimate generally involves usage of incomplete information. Estimation is likely to improve by performing two steps: basic preliminary estimates refined and readjusted based on comparison with historical project data and re-estimating as more information becomes available during the project lifecycle.

Equation based estimation must be only used to produce a rough, initial project effort estimate as it is not sufficiently accurate to produce an estimate that can be used to take final project commitment.

0,020,010,00-0,01-0,02

99

90

50

10

1

Residual

Per

cent

0,160,120,080,040,00

0,02

0,01

0,00

-0,01

-0,02

Fitted Value

Res

idua

l

0,020

0,015

0,010

0,005

0,000

-0,00

5

-0,01

0

-0,01

5

4,8

3,6

2,4

1,2

0,0

Residual

Freq

uenc

y

18161412108642

0,02

0,01

0,00

-0,01

-0,02

Observation Order

Res

idua

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for HW Profile Avg

0,020,010,00-0,01-0,02

99

90

50

10

1

Residual

Per

cent

0,160,120,080,040,00

0,02

0,01

0,00

-0,01

-0,02

Fitted Value

Res

idua

l

0,020

0,015

0,010

0,005

0,000

-0,00

5

-0,01

0

-0,01

5

4,8

3,6

2,4

1,2

0,0

Residual

Freq

uenc

y

18161412108642

0,02

0,01

0,00

-0,01

-0,02

Observation Order

Res

idua

l



Residual Plots for HW Profile Avg

0,020,010,00-0,01-0,02

99

90

50

10

1

Residual

Per

cent

0,120,100,080,060,04

0,01

0,00

-0,01

-0,02

Fitted Value

Res

idua

l

0,0100,0050,000-0,005-0,010-0,015

6,0

4,5

3,0

1,5

0,0

Residual

Freq

uenc

y

16151413121110987654321

0,01

0,00

-0,01

-0,02

Observation Order

Res

idua

l



Residual Plots for SW Profile Avg

0,020,010,00-0,01-0,02

99

90

50

10

1

Residual

Per

cent

0,120,100,080,060,04

0,01

0,00

-0,01

-0,02

Fitted Value

Res

idua

l

0,0100,0050,000-0,005-0,010-0,015

6,0

4,5

3,0

1,5

0,0

Residual

Freq

uenc

y

16151413121110987654321

0,01

0,00

-0,01

-0,02

Observation Order

Res

idua

l



Residual Plots for SW Profile Avg


57

Parametric models built for estimation purposes can be inaccurate if not properly calibrated and validated, moreover historical data used for calibration may not be relevant to new projects, so model refinement must be continuously done using newly collected metrics. The developed model is meant to be used in the early phase project planning, for this reason the predictability of the model has not been assessed as it is outside of the scope of our case study.

Given a new project, once the expected project duration has been estimated and the total

percentage of effort allocated to different competence types has been deduced from similar past projects, the effort break down according to competence types and its spread over expected project duration can be based on the effort regression equations derived from the regression analysis of historical data. This type of information may be useful as starting point for the organisation when improving their effort estimation process and assessing organisation resource capability.

Figure 8 shows an application of regression equations used to provide a preliminary staffing build-up rate of a new project with expected project duration equal to 9 months.

Figure 8: Effort distribution spread over the expected project duration

5. Conclusion

In the paper we have presented how the inputs to Project Portfolio scenario management can be improved by using quantitative techniques. The first application of those techniques have shown better adherence to planned roadmap.

To make analysis more accurate, next step will be to create different estimation groups for the different category of projects (i.e. new feature development, maintenance) as actually the analysis has been made only using historical data coming from new product development projects. In this way routinely collecting new data from new project categories and analysing relevant historical data will be part of a portfolio management process and will be the basis of effort distribution model refinement needed to improve estimation accuracy.

Effort profile distribution vs time

0,00%10,00%20,00%30,00%40,00%

50,00%60,00%70,00%80,00%

1 2 3 4 5 6 7 8 9

elapsed time (month)

Project System SW HW I&V

Project Profile Avg = - 0,390 + 0,0519 t - 0,00160 t^2 - 11,4 exp(-t^2) + 8,93 1/t^2 - 11,7 exp(-t)

System Profile Avg = 0,0636 - 0,000113 t^2 + 0,832 1/t^2 - 3,42 exp(-t) + 1,31 texp(-t)

SW Profile Avg = 0,161 - 0,00692 t - 0,231 exp(-t)

HW Profile Avg = 0,0617 - 0,000179 t^2 - 2,61 exp(-t^2) + 1,95 1/t^2 - 2,58 exp(-t)

I&V Profile Avg = 0,0567 + 0,631 exp(-log10(t))/t - 1,71 exp(-t)

Tot Prj Time(month) 9

(t) 1 2 3 4 5 6 7 8 9 Sum % allocation

Project 10,26% 15,80% 16,99% 14,61% 11,82% 9,32% 7,68% 6,87% 6,66% 100%System 11,52% 15,89% 17,64% 14,31% 11,11% 8,96% 7,63% 6,77% 6,17% 100%SW 6,87% 11,55% 12,84% 12,87% 12,45% 11,86% 11,20% 10,52% 9,83% 100%HW 10,25% 15,18% 14,83% 13,38% 11,81% 10,33% 9,07% 8,02% 7,13% 100%I&V 8,03% 8,04% 12,37% 13,34% 12,95% 12,24% 11,55% 10,97% 10,51% 100%

Effort profile distribution vs time

0,00%10,00%20,00%30,00%40,00%

50,00%60,00%70,00%80,00%

1 2 3 4 5 6 7 8 9

elapsed time (month)

Project System SW HW I&V

Project Profile Avg = - 0,390 + 0,0519 t - 0,00160 t^2 - 11,4 exp(-t^2) + 8,93 1/t^2 - 11,7 exp(-t)

System Profile Avg = 0,0636 - 0,000113 t^2 + 0,832 1/t^2 - 3,42 exp(-t) + 1,31 texp(-t)

SW Profile Avg = 0,161 - 0,00692 t - 0,231 exp(-t)

HW Profile Avg = 0,0617 - 0,000179 t^2 - 2,61 exp(-t^2) + 1,95 1/t^2 - 2,58 exp(-t)

I&V Profile Avg = 0,0567 + 0,631 exp(-log10(t))/t - 1,71 exp(-t)

Tot Prj Time(month) 9

(t) 1 2 3 4 5 6 7 8 9 Sum % allocation

Project 10,26% 15,80% 16,99% 14,61% 11,82% 9,32% 7,68% 6,87% 6,66% 100%System 11,52% 15,89% 17,64% 14,31% 11,11% 8,96% 7,63% 6,77% 6,17% 100%SW 6,87% 11,55% 12,84% 12,87% 12,45% 11,86% 11,20% 10,52% 9,83% 100%HW 10,25% 15,18% 14,83% 13,38% 11,81% 10,33% 9,07% 8,02% 7,13% 100%I&V 8,03% 8,04% 12,37% 13,34% 12,95% 12,24% 11,55% 10,97% 10,51% 100%


58

6. References [1] Miranda, E., “Running the successful hi-tech project office”, Artech House, Publisher, Inc., 2003. [2] Leach, P., “Modeling uncertainty in project scheduling”, Decision Strategies, Inc., Gallaher Court,

Missouri City, TX, USA, 2005. [3] Knafl G., Gonzales C., “An evaluation of Software Cost Models”, DePaul University, 1998. [4] Ohlsson M.C., Wohlin C. and Regnell B., “A project effort estimation study”. Information and Software

Technology, Volume 40, Number 14, pp. 831-839(9), 1998.


59

Defect Density Prediction with Six Sigma

Thomas Fehlmann

Abstract Can we predict defect density in advance for software that’s going into production? – The

answer is “yes”, if you apply statistical methods to requirements and run measurement programs for functional size and defect cost. The Six Sigma toolbox provides statistical methods for building such prediction models.

This paper explains the basics of these statistical methods without going into the details of Design for Six Sigma (DfSS) and Quality Function Deployment (QFD).

1. Introduction Tester would love to know how many defects remain undetected when they deliver

software to a customer or user. For capacity planning and in service management, knowing in advance how many people will be needed for application support would be welcome.

Intuitively it seems impossible to predict the unknown future and how many defects customers and users are able to detect – and consequently have to remove – when starting to use the new application. However, statistical methods exist for predicting the probability of finding defects by calculating the expected defect density in requirements. It works similar to weather forecast, where predicting humidity levels and temperature based on measurable physical entities leads to rain or snowfall forecasts.

We apply statistical methods to requirements and semantics of language statements rather than to physical entities. Statistical methods for requirements engineering are well known in Quality Function Deployment (QFD). While the models are mathematically strict and exact, the application scope is not. Few standards exist for measuring requirements; the best we know are the software sizing standards that meet the ISO/IEC 19761 International Standard on Functional Size Measurement. We’ll focus on two: ISO/IEC 20926, and ISO/IEC 19761. These international standards are better known under their native names of IFPUG Function Points V4.1, unadjusted, and COSMIC Full Function Points V3.0.

The prediction model uses a multidimensional vector space representing requirements to the software product with a topology that describes defect density during the various stages of software development. In mathematics, this structure is called a “Manifold”. The defect density function changes from project inception stage to requirements definition stage, to design and implementation stage. These stages correspond to different views on the software requirements definition and implementation process.

2. What are defects in Software? 2.1. A mistake, a bug, or a defect?

"Defects, as defined by software developers, are variances from a desired attribute. These attributes include complete and correct requirements and specifications as drawn from the desires of potential customers. Thus, defects cause software to fail to meet requirements and make customers unhappy.

And when a defect gets through during the development process, the earlier it is diagnosed, the easier and cheaper is the rectification of the defect. The end result in preven-tion or early detection is a product with zero or minimal defects.” This statement of belief in the Six Sigma approach is expressed in [1].


60

The English language provides three different notions depending from where nonconformity originates:

• Developers find “Mistakes” and eliminate them. • Testers find “Bugs”, and Developers eliminate them. • Customers find “Defects”, Supporters assess them, Developers eliminate them, and

Testers repeat testing. With that many roles and people involved, the cost of rectification of the nonconformity

raises quite a bit. Thus it is desirable if developers find all mistakes, before they become bugs, or defects at the end.

Although this is a very good idea, reality is that developers typically do not know enough to recognise mistakes before mistakes become defects – even if they had the time to do it.

2.2. Software development is knowledge acquisition about requirements

Software development is unlike civil engineering – you cannot expect building according a detailed plan; not even by combining best practices. Most software projects consist of ongoing knowledge acquisition – both in depth as in scope. Developers and users – both are involved into knowledge acquisition. Thus, requirements are not a static entity. Requirements change: from business needs to professionally stated requirements, then to technical requirements, architecture requirements and required features.

Unfortunately, requirements change also for other reasons: because they were stated

badly, or not understood, or simply not stated at all. These kind of defects affect software development badly. For coding errors, even for logical mistakes, tools exist to detect them; missed requirements – and missing requirements – are much harder to predict.

2.3. Missed Requirements and Missing Requirements

In Six Sigma for Software, we distinguish two kinds of defects: • A-Defects; missing or incomplete requirements, or badly stated requirements; • B-Defects, missed requirements that had been correctly stated but not understood by

developers, or not implemented. [2] A-Defects can cause a software product being rejected in an organisation, or not used at

all, or becoming a commercial failure. A-Defects are difficult to find, but they have a natural priority weighting. Some A-Defects are less important than others.

B-Defects, in turn, are less difficult to identify: you can compare the software product with

the correctly specified requirements in order to identify gaps. Throughout testing will do. However, B-Defects often have no “natural” priority assigned that reflects customer’s needs. This may lead to ineffective testing or to unacceptable delays.

Defect prediction works for both types of defects the same, if B-Defects have a priority

weighting similar to A-Defects.


61

2.4. Origins of defects

We concentrate on the following processes (in parenthesis: references to CMMI V1.2) [3]: • Voice of the Customer (VoC, see Quality Function Deployment) • Requirements Elicitation of Business Objectives (RD, Requirements Development) • Architecture and Design (TS, Technical Solution) • Coding and Unit Testing (VER, Verification) • Integration Testing (PI, Product Integration) • Application Testing (VAL, Validation) • Operations (Ops, see following note on IT Service Management)

A-Defects originate typically from VoC or RD; B-Defects from TS, PI, or from VER and

VAL. We assume that processes such as REQM (Requirements Engineering) or CM (Configuration Management) do not produce significant defects, because they can be fully automated. Also, we exclude impact from SAM (Supplier Agreement Management), RSKM (Risk Management), and from the project management and organisational process areas.

On the other hand, we also look at the Voice of the Customer Process that precedes Requirement Elicitation (RD), and the IT Service Management processes that collect issues and identifies software problems. This class of processes – which are detailed in ITIL [4] – is summarised as “Ops”, for Operations. The only output of the process that is of interest for this scope are defects found during operations of the software product.

Planning and DesignInception

Detected in Peer Reviews:

Detected in Mgmt Reviews:

Detected in Tests:

Defectsremoved:

Customer’sNeeds(VoC)

Customer’sNeeds(VoC)

11 Defects

Defects introduced:

Detected during productive use:

63 Defects

7 VoC defects7 VoC defects7 VoC defects

BusinessObjectives

(RD)

BusinessObjectives

(RD)

52 Defects

2 VoC defects49 RD defects2 VoC defects2 VoC defects49 RD defects49 RD defects

TechnicalSolution

(TS)

TechnicalSolution

(TS)

Testing and Acceptance

0 VoC defects0 RD defects0 TS defects

10 VER defects1 PI defect

2 VAL defects

0 VoC defects0 VoC defects0 RD defects0 RD defects0 TS defects0 TS defects

10 VER defects10 VER defects1 PI defect1 PI defect

2 VAL defects2 VAL defects

14 Defects

1 VoC defect1+7 RD defect

5+48 TS defects92 VER defects

1 VoC defect1 VoC defect1+7 RD defect1+7 RD defect

5+48 TS defects5+48 TS defects92 VER defects92 VER defects

IntegrationTesting

(PI)

IntegrationTesting

(PI)

2 Defects

1 VoC defect7 RD defects4 TS defects

22 VER defects13 PI defects

1 VoC defect1 VoC defect7 RD defects7 RD defects4 TS defects4 TS defects

22 VER defects22 VER defects13 PI defects13 PI defects

ApplicationTesting(VAL)

ApplicationTesting(VAL)

CustomerOperations

(Ops)

CustomerOperations

(Ops)

Coding &Unit Tests

(VER)

Coding &Unit Tests

(VER)

0 VoC defects

7 RD defects

48 TS defects

0 VoC defects0 VoC defects

7 RD defects7 RD defects

48 TS defects48 TS defects

Implementation

124 Defects

Customer

Figure 1: Origins of Defects.

The display of Figure 1 shows nonconformities that leave the process where they were

created in. Mistakes and bugs do not show up. In each process, defects are introduced and propagated to the next process. Sample defect numbers for such numbers are shown on the second line below the process names.


62

Quality assurance is able to detect some defects, which are consequently removed; other defects are not detected and therefore passed to the next process. The numbers of defects removed – together with their origin – are displayed in the “Defects removed” area.

From the defects introduced during the software development processes and defect removal, we can calculate defect density as shown in Figure 2.

2.5. The Problem of Measuring Defects

Counting errors as shown in Figure 1 and Figure 2 are indicative for selecting a testing strategy, and adopting the best tactics for some software development project, but they do not allow defect density prediction, unless all defects are equally sized. This assumption is wrong, as shown in section 2.3. For measuring and predicting density, we need sizing, both sizing of software and sizing of defects.

Defect Density Graph – assuming all defects are equally sized

(Numbers according Figure 1)

Defect Density

0

20

40

60

80

100

120

140

160

180

VoC RD TS VER PI VAL Ops

Num

ber o

f Def

ects

VAL

PI

VER

TS

RD

VoC

Figure 2: Simple Defect Density Calculation.

3. Sizing and Measuring Defects 3.1. Sizing of Business Requirements

The IFPUG Functional Size (ISO/IEC 20926:2003) [10] is a good choice for understan-ding and sizing user and business requirements. An IFPUG count identifies input, logical data, and output expected from the software product, and reflects user’s view and business needs.

The sizing unit for Business Requirements are IFPUG Unadjusted Function Points. (UFP = Unadjusted FP).

3.2. Sizing of Technical Requirements

It is possible to count business requirements with the ISO/IEC 19761:2003 (COSMIC FFP) approach [9] as well, if the count is executed adopting a user viewpoint; however, this standard needs more in-depth analysis of the Use Cases.

For counting requirements such as solution architecture or quality criteria according ISO/IEC 9126 (modularity, extensibility, maintainability, performance, encapsulation, etc.), the IFPUG standard is less helpful.


63

For some cases, sizing of a software solution only makes sense when adopting a suitable COSMIC FFP viewpoint; the IFPUG would probably return a size of near to zero when no persistent data exists, such as for instance a converter between different data standards, or formats.

The COSMIC FFP approach offers big advantages when sizing architectural decisions and the solution architecture since it provides a simple-to-use framework for functional sizing based on a suitable viewpoint. However, you can combine both sizing methods for defect prediction; it matters only that business and technical viewpoints are identified correctly, and that counts reflect the difference.

3.3. Identifying defects

Counting the number of defects that made their way into some “bug list” is not a suitable measurement method. First of all, it is very difficult to distinguish one defect from another. Especially for A-Defects, a missing requirement may show up several times under different premises. For instance, if the database model lacks an entity, or a relation between entities, testers might observe different bugs that are difficult to consolidate.

On the other hand, testers sometimes observe several bugs at once but interpret it as evidence for one bug only – not being familiar with the code. Fixing this bug may reveal other bugs – or hide them. Thus, the bug count is misleading and cannot serve as a base for defect prediction.

3.4. Measuring defects as Learning Opportunities

We can avoid counting defects if we ask for the effort needed for fixing defects rather than for a count. Since we consider both A-Defects and B-Defects, we don’t distinguish between Bug Fixes and Change Requests. The latter are as well indicative for a lack of our processes to identify appropriate business or technical requirements, as the former are indicative for lack of validation and verification (VER and VAL) process capability.

In order to avoid the insipid after touch that “Bug Fixing Effort” means for developers, we prefer the term “Learning Opportunities”. We don’t want to miss learning opportunities by biased reporting, and we have no other means to distinguish effort spent on defects from effort spent on development than by asking the developers what they are doing.

For defect density, we measure effort in terms of Person Days (PD). By dividing effort spend on Learning Opportunities by the total effort spent on some work package we get the Learning Opportunities Ratio (LeOR) metrics that factors skills level away.

Under the assumption that work effort depends linearly from functional size, the LeOR is the percentage of the Functional Size that is affected by bug fixes and change request implementations. Thus, functional sizing provides metrics for measuring defect density. We need neither the issue list nor requirements management (REQM) for such measurement. This makes it possible to effectively predict defect density.

3.5. LeOR and the Sigma Scale

If you cannot get time sheet reports of reasonable quality, the Sigma scale is of practical help to measure LeOR. Most often you simply can ask developers for the Sigma level they are experiencing with their work, and the result is good enough for defect prediction.

The Sigma scale translates as follows into the LeOR: e.g., for Sigma = 2.0, the success

rate is 69.1%, thus the LeOR is 100% - 69.1% = 30.9%. Experienced developers can easily identify Sigma values for LeOR between 0.0 and 3.5.


64

Sigma 0.0 means total rework; 0.5 partial rework; a Sigma of 1.0 or 1.5 is indicative for refactoring; values between 2.0 and 3.0 stand for typically development and bug-fixing cycles without (major) issues, and above 3.0 means it was right the first time. Values above Sigma 4.0 don’t matter much for defects prediction.

4. Transfer Function between Different Views 4.1. A Formal Structure for Defining Knowledge Acquisition

Let’s look again at the CMMI processes listed in section 2.4. For forecasting defects, the problem is that requirements – and consequently A-Defects and B-Defects – depend from the process area where they belong. You cannot say hat a business requirement, belonging to process area RD, is violated if TS is missing some particular solution requirement. The missing solution requirement may cause a violation of that business requirement; however, whether the particular missing TS requirement actually causes a violation, or many, or none at all, this is the question that needs to be addressed when predicting defect density. This distinction is one of the big challenges for software engineering and software project management. Software engineers do not necessarily share business managers’ viewpoints.

This is why the Six Sigma Transfer Function plays a central role for defect prediction. A

Transfer Function for Software Development transforms requirements from one process area, i.e., Requirements Development (RD), into requirements in some other process area, for instance Technical Solution (TS), or Product Integration (PI). Six Sigma Transfer Functions for software development are well known from the Quality Function Deployment Practice, used in Design for Six Sigma (DfSS), see [2] and [7].

4.2. Defining Transfer Functions

Transfer Functions work on Knowledge Acquisition Spaces, which we explained in [5]. Requirements for the various process areas are expressed by formal knowledge about that process area’s requirements. Such formal knowledge we measure as Profiles. Profiles are a normalised valuation of requirements, representing their relative priorities. A Transfer Function transforms a priority profile from business views into technical views. Thus a Transfer Function answers the question which technical requirements affect which business needs.

The logical structure for defining knowledge acquisition can be described using statistical

methods. For additional information about using statistical methods for requirements engineering, we refer to [6].


65

Cause/Effect Transfer Function

Goal Goal Profile / Effective ProfileSolution Requirements

= strong relationship (9)= medium relationship (3)= weak relationship (1)

g1

g2

g3

g4

x1 x2 x3 x4 x5Cause’s Profile – measured by its functional size and LeOR

Effective Profile – effects of defects observed

Goal Profile –Defect Prediction

Figure 3: Sample Transfer Function.

4.3. A Sample Transfer Function

Let x be the cause’s profile and y the effects’ profile. Then the transfer function that maps causes into effects is y = T(x). As an example, let y be the desired priority profile for the business needs, and x be the requirements profile for the Use Cases. The transfer function T maps the use cases to those business needs that they fulfil. Such mappings can be represented as linear matrices, therefore suggesting that transfer functions are actually linear mappings between the vector space involved. Given the technical requirements profile x = <ξ1,…,ξn>, response profile to business needs is y = T(x) = <ϕ1(x), …, ϕm(x)> where ϕi(x) represent the components of the vector profile y = T(x). The technical requirements profile x = <ξ1,…,ξn> is knows as the Critical Parameters in the Six Sigma practice.

T-1 is the inverse Transfer Function. T-1 predicts the solution x that yields y = T(x), given the goal profile y. In this case, x = T-1(y) is the prediction. For a matrix representation of T, TT is the transpose of the transfer function. TT approximates T-1 if x is an Eigenvector of T.

4.4. Eigenvector of a Transfer Function T

An Eigenvector is a solution of the equation [TT • T](x) = λx. λ a real number; usually set to λ = 1. Note that [TT • T] needs not to be the identity function, which means, cause/effect cannot be reversed!

We need to know how good the solution x is: || [TT • T](x) – λx || is called the Convergence Gap. A small Convergence Gap means TT approximates T-1 pretty good, or in other words: x’ = TT(y) is a good prediction for x = T-1(y).


66

4.5. The Impact of Linearity

The impact of this “linearity” statement is quite important and has been discussed in many research papers. For instance, we refer to Norman Fenton’s well-known Bayesian Networks that he uses for defect prediction, see [8]. The difference to the Six Sigma approach is that it does not refer to transfer functions between different requirement spaces, it rather addresses a software engineering model framework representing the different “modules” of a software product for gathering data needed for defect prediction. Our approach has the advantage that defect prediction starts with the first early and quick functional sizing, it also has a disadvantage: statistical analysis alone will not do, you need to understand and identify the transfer functions from business to implementation that impact your software development project.

5. The Defects Prediction Model 5.1. Setting up the model

Now we have all in place to effectively build the Defects Prediction Model. The model is an excerpt from the Deming Process Chain for Software Development presented in [5] and [6], with the difference, that the Transfer Functions are not only used to map requirements from one process area onto the other, but also to map the corresponding defect densities from one area into another.

The model is build by connecting the defect density profile from the initial VoC process area to the RD process area and then to VER & PI, which stand for the implementation process areas. The Transfer Functions tell us the size of the requirements on the levels VoC, RD, TS. The VER and the VAL testing, and finally customer operations (Ops) experiences, provide the actual LeORs that allow calculating the defect density for each level. Once the VoC profile is validated, we can work on Business Requirements (RD) and Technical Solutions (TS) until we get small Convergence Gaps, i.e., Eigenvectors for our profiles. If the Convergence Gaps of our Transfer Functions are small, then our Transfer Functions are almost linear in the neighbourhood of our profiles, and we can predict defect density from the beginning, i.e., from the first functional sizing onwards.


67

Integration & Tests(VER & PI)

Application Test (VAL)

Acceptance Test (Ops)

Customer’sNeeds (VoC)

Use Cases(RD)

Technical Features(TS)

RD → VoC

TS → RD

VER → TS

VAL → RD

Ops → VoC

#Csfu

#LeOR

#LeOR

#LeOR

VoCInitial Sigma = Initial VoC

- Learning EffectivenessFinal Sigma = Target VoC

RDInitial Sigma = Initial RD

- Review Effectiveness - Removed VoC LeOR's

Final Sigma = Target RD

TSInitial Sigma = Initial TS

- Review Effectiveness - Removed VoC, RD LeOR's

Final Sigma = Target TS

VER & PIInitial Sigma = Initial VER

- Unit Test / Peer Review Effectiveness - Removed VoC, RD, TS, VER LeOR's

Final Sigma = Target PI

VALInitial Sigma = Initial VAL

- Testing Effectiveness - Removed VoC to VER LeOR's

Final Sigma = Target VAL

CustomerInitial Sigma = Target Customer!

#UFP

Figure 4: Defect Prediction Model.

5.2. Calibrating the Model

Our model deals with profiles. Profiles work well with transfer functions; however, a profile describes the relative distribution of defects, and not the absolute defect numbers we need to calculate defect density.

To get defect density from the profiles, we have two options: • Use historical data for defect removal efficiency and review & test effectiveness. • Use historical data from previous projects with similar scope in the same market.

Calibration of the model can occur at every level; for instance, using measurements at the

TS / VER level.

5.3. The Starting Point The LeOR for VoC we can also infer from previous experiences and their final success. It

tells us, how many A-Defects we had to remove before getting successful feedback from market or customer. The Sigma scale presented in section 3.5 is applicable for marketers as well. If everything fails, we rely on the assumption that our initial LeOR is Sigma = 1.0.

6. Conclusions

Although defect prediction is not simple, it is easily adoptable for an organisation that regularly uses Design for Six Sigma (DfSS) – or Quality Function Deployment – to control their software development. Sensitivity analysis proves usefulness of the model even with minimum calibration. The better calibration data is, the more reliable becomes prediction.

Experience reports have not been published so far; however, calculation tools and examples are available.


68

7. References [1] Soni, M. (2009), “Defect Prevention: Reducing Costs and Enhancing Quality”, published on

http://software.isixsigma.com/, online in March 2009 [2] Fehlmann, Th. (2005), “Six Sigma in der SW–Entwicklung“, Vieweg–Verlag, Braunschweig–

Wiesbaden [3] “CMMI® for Development, Version 1.2”, Carnegie-Mellon University, Software Engineering Institute,

Pittsburgh, PA, August 2006. [4] “IT Infrastructure Library (ITIL®) V3”, published on http://www.itil.org/, June 2007 [5] Fehlmann ,Th. (2006), “Statistical Process Control for Software Development – Six Sigma for Software

revisited”, in: EuroSPI 2006 Industrial Proceedings, pp. 10.15, Joensuu, Finland [6] Fehlmann ,Th., Santillo L. (2007), “Defect Density Prediction with Six Sigma”, in: SMEF Roma 2007,

Roma, Italy [7] Fehlmann, Th. (2004), “The Impact of Linear Algebra on QFD”, in: International Journal of Quality &

Reliability Management, Vol. 21 No. 9, Emerald, Bradford, UK [8] Fenton, N., Krause, P., Neil, M. (1999), “A Probabilistic Model for Software Defect Prediction”, IEEE

Transactions on Software Engineering, New York, NY [9] Abran, A. e.a.,”COSMIC FFP Measurement Manual 3.0”, 2007, www.lrgl.uqam.ca/cosmic-fpp. [10] International Function Points User Group IFPUG (2004), “IFPUG Function Point Counting Practices

Manual”, Release 4.2, Princeton, NJ


69

Using metrics to evaluate user interfaces automatically

Izzat Alsmadi, Muhammad AlKaabi

Abstract User interfaces have special characteristics that differentiate them from the rest of the

software code. Typical software metrics that indicate its complexity and quality may not be able to distinguish a complex GUI or a high quality one from another that is not. This paper is about suggesting and introducing some GUI structural metrics that can be gathered dynamically using a test automation tool. Rather than measuring quality or usability, the goal of those developed metrics is to measure the GUI testability, or how much it is hard, or easy to test a particular user interface. We evaluate GUIs for several reasons such as usability and testability. In usability, users evaluate a particular user interface for how much easy, convenient, and fast it is to deal with it. In our testability evaluation, we want to automate the process of measuring the complexity of the user interface from testing perspectives. Such metrics can be used as a tool to estimate required resources to test a particular application.

Keywords: Layout complexity, GUI metrics, interface usability.

1. Introduction The purpose of software metrics is to obtain better measurements in terms of risk

management, reliability forecast, cost repression, project scheduling, and improving the overall software quality. GUI code has its own characteristics that make using the typical software metrics such as lines of codes, cyclic complexity, and other static or dynamic metrics impractical and may not distinguish a complex GUI from others that are not. There are several different ways of evaluating a GUI that include; formal, heuristic, and manual testing. Other classifications of user evaluation techniques includes; predictive and experimental. Unlike typical software, some of those evaluation techniques may depend solely on users and may never be automated or calculated numerically. This paper is a continuation of the paper in [21] in which we introduced some of the suggested GUI structural metrics. In this paper, we elaborate on those earlier metrics, suggested new ones and also make some evaluation in relation to the execution and verification process. We selected some open source projects for the experiments. Most of the selected projects are relatively small, however, GUI complexity is not always in direct relation with size or other code complexity metrics.

2. Related Work

Usability findings can vary widely when different evaluators study the same user interface, even if they use the same evaluation technique. Melody et al surveyed 75 GUI usability evaluation methods and presented a taxonomy for comparing those various methods [1]. GUI usability evaluation typically only covers a subset of the possible actions users might take. For these reasons, usability experts often recommend using several different evaluation techniques [1]. Many web automatic evaluation tools were developed for automatically detecting and reporting ergonomic violation (usability, accessibility, etc) and in some cases making suggestions for fixing them [4, 5, 6, and 7].

Highly disordered or visually chaotic GUI layouts reduce usability, but too much regularity is unappealing and makes features hard to distinguish.


70

Balbo [4] had a survey about GUI evaluation automation. He classified several techniques for processing log files as automatic analysis methods. Some of those listed methods are not fully automated as suggested.

Some interface complexity metrics that have been reported in the literature include: • The number of controls in an interface (e.g. Controls’ Count; CC). In some

applications [3], only certain controls are counted such as the front panel ones. Our tool is used to gather this metric from several programs (as it will be explained later).

• The longest sequence of distinct controls that must be employed to perform a specific task. In terms of the GUI XML tree, it is the longest or deepest path or in other words the tree depth.

• The maximum number of choices among controls with which a user can be presented at any point in using that interface. We also implemented this metrics by counting the maximum number of children a control has in the tree.

• The amount of time it takes a user to perform certain events in the GUI [19]. This may include; key strokes, mouse clicks, pointing, selecting an item from a list, and several other tasks that can be performed in a GUI. This performance metric may vary from a novice user on the GUI to an expert one. Automating this metric will reflect the API performance which is usually expected to be faster than a normal user. In typical applications, users don’t just type keys or click the mouse. Some events require a response and thinking time, others require calculations. Time synchronisation is one of the major challenges that tackle GUI test automation.

It is possible, and usually complex, to calculate GUI efficiency through its theoretical

optimum [2, 14]. Each next move is associated with an information amount that is calculated given the current state, the history, and knowledge of the probabilities of all next candidate moves.

The complexity of measuring GUI metrics relies on the fact that it is coupled with metrics related to the users such as: thinking, response time, the speed of typing or moving the mouse, etc. Structural metrics can be implemented as functions that can be added to the code being developed and removed when development is accomplished [13].

Complexity metrics are used in several ways with respect to user interfaces. One or more complexity metrics can be employed to identify how much work would be required to implement or to test that interface. It can also be used to identify code or controls that are more complex than a predefined threshold value for a function of the metrics. Metrics could be used to predict how long users might require to become comfortable with or expert at a given user interface. Metrics could be used to estimate how much work was done by the developers or what percentage of a user interface’s implementation was complete. Metrics could classify an application among the categories: user-interface dominated, processing code dominated, editing dominated, or input/output dominated. Finally, metrics could be used to determine what percentage of a user interface is tested by a given test suite or what percentage of an interface design is addressed by a given implementation.

In the GUI structural layout complexity metrics area, there are several papers presented.

Tullis studied layout complexity and demonstrated it to be useful for GUI usability [11]. However, he found that it did not help in predicting the time it takes for a user to find information [12]. Tullis defined arrangement (or layout) complexity as the extent to which the arrangement of items on the screen follows a predictable visual scheme [13]. In other words, the less predictable a user interface is the more complex it is expected to be.


71

The majority of layout complexity papers discussed the structure complexity in terms of visual objects’ size, distribution and position. We will study other layout factors such as the GUI tree structure, the total number of controls in a GUI, or the tree layout. Our research considers few metrics from widely different categories in this complex space. Those selected user interface structural layout metrics are calculated automatically using a developed tool. Structural metrics are based on the structure and the components of the user interface. It does not include those semantic or procedural metrics that depend on user actions and judgment (that can barely be automatically calculated). These selections are meant to illustrate the richness of this space, but not to be comprehensive.

3. Goals and Approaches

Our approach uses two constraints: • The metric or combination of metrics must be computed automatically from the user

interface metadata. • The metric or combination of metrics provides a value we can use automatically in

test case generation or evaluation. GUI metrics should guide the testing process in aiming at those areas that require more focus in testing. Since the work reported in this paper is the start of a much larger project, we will consider only single metrics. Combinations of metrics will be left to future work. A good choice of such metrics should include metrics that can relatively be easily calculated and that can indicate a relative quality of the user interface design or implementation.

Below are some metrics that are implemented dynamically as part of a GUI test

automation tool [18]. We developed a tool that, automatically, generates an XML tree to represent the GUI structure. The tool creates test cases from the tree and executes them on the actual application. More details on the developed tool can be found on authors’ references.

• Total number of controls or Controls’ Count (CC). This metric is compared to the simple Lines of Code (LOC) count in software metrics. Typically a program that has millions of lines of code is expected to be more complex than a program that has thousands. Similarly, a GUI program that has large number of controls is expected to be more complex relative to smaller ones. Similar to the LOC case, this is not perfectly true because some controls are easier to test than others.

It should be mentioned that in .NET applications, forms and similar objects have GUI

components. Other project components such as classes have no GUI forms. As a result, the controls’ count by itself is irrelevant as a reflection for the overall program size or complexity. To make this metric reflects the overall components of the code and not only the GUI; the metric is modified to count the number of controls in a program divided by the total number of lines of code. Table1 shows the results.


72

Table 1: Controls /LOC percentage. AUTs CC LOC CC/LOC

Notepad 200 4233 4.72% CDiese Test 36 5271 0.68% FormAnimation App 121 5868 2.06% winFomThreading 6 316 1.90% WordInDotNet 26 1732 1.50% WeatherNotify 83 13039 0.64% Note1 45 870 5.17% Note2 14 584 2.40% Note3 53 2037 2.60% GUIControls 98 5768 1.70% ReverseGame 63 3041 2.07% MathMaze 27 1138 2.37% PacSnake 45 2047 2.20% TicTacToe 52 1954 2.66% Bridges Game 26 1942 1.34% Hexomania 29 1059 2.74%

The controls’/LOC value indicates how much the program is GUI oriented. The majority

of the gathered values are located around 2%. Perhaps at some point we will be able to provide certain data sheets of the CC/LOC metric to divide GUI applications into heavily GUI oriented, medium and low. A comprehensive study is required to evaluate many applications to be able to come up with the best values for such classification.

The CC/LOC metric does not differentiate between an organised GUI from a distracted

one (as far as they have the same controls and LOC values). The other factor that can be introduced to this metric is the number of controls in the GUI front page. A complex GUI could be one that has all controls situated flat on the screen. As a result, we can normalise the total number of controls to those in the front page. 1 means very complex, and the lower the value, the less complex the GUI is. The last factor that should be considered in this metric is the control type. Controls should not be dealt with as the same in terms of the effort required for testing. This factor can be added by using controls’ classification and weight factors. We may classify controls into three levels; hard to test (type factor =3), medium (type factor =2), and low (type factor=1). Such classification can be heuristic depending on some factors like the number of parameters for the control, size, user possible actions on the control, etc.

3.1. The GUI tree depth

The GUI structure can be transformed to a tree model that represents the structural relations among controls. The depth of the tree is calculated as the deepest leg or leaf node of that tree.

We implemented the tree depth metric in a dynamic test case reduction technique [20]. In

the algorithm, a test scenario is arbitrary selected. The selected scenario includes controls from the different levels. Starting from the lowest level control, the algorithm excludes from selection all those controls that share the same parent with the selected control. This reduction shouldn’t exceed half of the tree depth. For example if the depth of the tree is four levels, the algorithm should exclude controls from levels three and four only. We assume that three controls are the least required for a test scenario (such as Notepad – File – Exit).


73

We continuously select five test scenarios using the same previously described reduction process. The selection of the number five for test scenarios is heuristic. The idea is to select the least amount of test scenarios that can best represent the whole GUI.

3.2. The structure of the tree

A tree that has most of its controls toward the bottom is expected to be less complex, from a testing viewpoint, than a tree that has the majority of its controls toward the top as it has more user choices or selections. If a GUI has several entry points and if its main interface is condensed with many controls, this makes the tree more complex. The more tree paths a GUI has the more number of test cases it requires for branch coverage.

Tree paths’ count is a metric that differentiate a complex GUI from a simple one. For

simplicity, to calculate the number of leaf nodes automatically (i.e. the GUI number of paths), all controls in the tree that have no children are counted.

Table 2: Tree paths, depth and average tree edges per level metrics.

AUTs Tree paths’ number Edges/tree depth Tree max depth Notepad 176 39 5 CDiese Test 32 17 2 FormAnimation App 116 40 3 winFomThreading 5 2 2 WordInDotNet 23 12 2 WeatherNotify 82 41 2 Note1 39 8 5 Note2 13 6 3 Note3 43 13 4 GUIControls 87 32 3 ReverseGame 57 31 3 MathMaze 22 13 3 PacSnake 40 22 3 TicTacToe 49 25 3 Bridges Game 16 6 4 Hexomania 22 5 5

3.3. Choice or edges/ tree depth

To find out the average number of edges in each tree level, the total number of choices in the tree is divided by the tree depth. The total number of edges is calculated through the number of parent-child relations. Each control has one parent except the entry point. This makes the number of edges equal to the number of controls-1. Table2 shows the tree paths, depth and edges/tree depth metrics for the selected AUTs.

The edges/tree depth can be seen as a normalised tree-paths metric in which the average of tree paths per level is calculated.

3.4. Maximum number of edges leaving any node

The number of children a GUI parent can have determines the number of choices for that node. The tree depth metric represents the maximum vertical height of the tree, while this metric represents the maximum horizontal width of the tree. Those two metrics are major factors in the GUI complexity as they decide the amount of decisions or choices it can have.


74

In most cases, metrics values are consistent with each other; a GUI that is complex in terms of one metric is complex in most of the others.

To relate the metrics with GUI testability, we studied applying one test generation algorithm developed as part of this research on all AUTs using the same number of test cases. Table3 shows the results from applying AI3 algorithm [17, 20] on the selected AUTs. Some of those AUTs are relatively small, in terms of the number of GUI controls. As a result many of the 50 or 100 test cases reach to the 100 % effectiveness which means that they discover all tree branches. The two AUTs that have the least amount of controls (i.e WinFormThreading and Note2); achieve the 100% test effectiveness in all three columns.

Comparing the earlier tables with Table3, we find out that for example for the AUT that is most complex in terms of number of controls, tree depth and tree path numbers (i.e. Notepad), has the least test effectiveness values in the three columns; 25, 50, and 100.

Table 3: Test case generation algorithms’ results.

Effectiveness (using AI3) AUT 25 test cases 50 test cases 100 test cases Notepad 0.11 0.235 0.385 CDiese Test 0.472 0.888 1 FormAnimation App 0.149 0.248 0.463 winFomThreading 1 1 1 WordInDotNet 0.615 1 1 WeatherNotify 0.157 0.313 0.615 Note1 0.378 0.8 1 Note2 1 1 1 Note3 0.358 0.736 1 GUIControls 0.228 0.416 0.713 ReverseGame 0.254 0.508 0.921 MathMaze 0.630 1 1 PacSnake 0.378 0.733 1 TicTacToe 0.308 0.578 1 Bridges Game 0.731 1 1 Hexomania 0.655 1 1

Selecting the lowest five of the 16 AUTs in terms of test effectiveness (i.e. Notepad,

FormAnimation App, WeatherNotify, GUIControls, and ReverseGame); two are of the highest five in terms of number of controls, tree paths’ number, and edges/tree depth; three in terms of controls/LOC, and tree depth. Results do not match exactly in this case, but give a good indication of some of the GUI complex applications. Future research should include a comprehensive list of open source projects that may give a better confidence in the produced results.

Next, the same AUTs are executed using 25, 50, and 100 test cases. The logging verification procedure [18] is calculated for each case. Some of those values are averages as the executed to generated percentage is not always the same (due to some differences in the number of controls executed every time). Table4 shows the verification effectiveness for all tests.

Test execution effectiveness using AI3 [17,20]


75

Table 4: Execution effectiveness for the selected AUTs. AUT 25 test cases 50 test cases 100 test cases Average Notepad 0.85 0.91 0.808 0.856 CDiese Test 0.256 0.31 0.283 0.283 FormAnimation App

1 1 0.93 0.976

winFomThreading 0.67 0.67 0.67 0.67 WordInDotNet 0.83 0.67 0.88 0.79 WeatherNotify NA NA NA NA Note1 0.511 0.265 0.364 0.38 Note2 0.316 0.29 0.17 0.259 Note3 1 0.855 0.835 0.9 GUIControls 0.79 1 1 0.93 ReverseGame 0.371 0.217 0.317 0.302 MathMaze NA NA NA NA PacSnake 0.375 0.333 0.46 0.39 TicTacToe 0.39 0.75 0.57 0.57 Bridges Game 1 1 1 1 Hexomania 0.619 0.386 0.346 0.450

The logging verification process implemented here is still in early development stages. We

were hoping that this test can reflect the GUI structural complexity with proportional values. Of the five most complex AUTs, in terms of this verification process (i.e. Note2, CDiese, ReverseGame, Note1, and PacSnake), only one is listed in the measured GUI metrics. That can be either a reflection of the immaturity of the verification technique implemented, or due to some other complexity factors related to the code, environment, or any other aspects. The effectiveness of this track is that all the previous measurements (the GUI metrics, test generation effectiveness, and execution effectiveness) are calculated automatically without the user involvement. Testing and tuning those techniques and testing them extensively will make them powerful and useful tools.

We tried to find out whether layout complexity metrics suggested above are directly

related to GUI testability. In other words, if an AUT GUI has high value metrics, does that correctly indicate that it is more likely that this GUI is less testable?!

4. Conclusions and Future Work

Software metrics supports project management activities such as planning for testing and maintenance. Using the process of gathering automatically GUI metrics is a powerful technique that brings the advantages of metrics without the need for separate time or resources. In this research, we introduced and intuitively evaluated some user interface metrics. We tried to correlate those metrics with results from test case generation and execution. Future work will include extensively evaluating those metrics using several open source projects. In some scenarios, we will use manual evaluation for those user interfaces to see if those dynamically gathered metrics indicate the same level of complexity as the manual evaluations. An ultimate goal for those dynamic metrics is to be implemented as built in tools in a user interface test automation process. The metrics will be used as a management tool to direct certain processes in the automated process. If metrics find out that this application or part of the application under test is complex, resources will be automatically adjusted to include more time for test case generation, execution and evaluation.


76

5. References [1] Ivory, Melody, Marti Hearst. The state of the art in automating usability evaluation of user interfaces.

ACM Computing Surveys (CSUR). Volume 33, issue 4. Pages: 470 – 516. 2001. [2] Goetz, Philip. Too many clicks! Unit-based interfaces considered harmful. Gamastura.

<http://gamasutra.com/features/20060823/goetz_01.shtml>. 2006. [3] LabView 8.2 Help. User interface metrics. National Instruments. http://zone.ni.com-

/reference/enXX/help/371361B01/lvhowto/user-interface_statistics/. 2006. [4] 4. Balbo, S. Automatic evaluation of user interface usability: dream or reality. In proceeding of the

Queensland computer-human interaction symposium. QCHI’95. Bond University. Australia. 1995. [5] Farenc, Ch., Palanque, Ph.. A generic framework based on ergonomic rules for computer-aided design of

user interface. In proceeding of the 3rd international conference on Computer-Aided Design of User Interfaces CADUI’99. <http://lis.univtlse1fr/farenc/papers/-cadui-99.ps>.1999.

[6] Farenc, Ch., Palanque, Ph., Bastide, R. Embedding ergonomic rules as generic requirements in the development process of interactive software. In proceeding of the 7th IFIP conference on human-computer interaction interact’99.UK. <http://lis.univtlse1fr/farenc/-papers/interact-99.ps>. 1999.

[7] Beirekdar, Abdo, Jean Vanderdonckt, and Monique Noirhomme-Fraiture. KWARESMI1; Knowledge-based web automated evaluation with reconfigurable guidelines optimization. <"citeseer.ist.psu.edu/article/beirekdar02kwaresmi.html">. 2002.

[8] Robins, Kay. Course Website. User interfaces and usability lectures. http://vip.cs.utsa.edu/-classes/cs623s2006/. 2006.

[9] Ritter, Frank, Dirk Van Rooy, and Robert St. Amant. A user modeling design tool for comparing interfaces. <citeseer.ist.psu.edu-/450618.html>. In proceeding of the 4th international conference on Computer-Aided Design of User Interfaces CADUI'2002. Pages 111-118. 2002.

[10] Deshpande, Mukund, and George Karypis. Selective Markov models for predicting web-page accesses. ACM Transactions on Internet Technology (TOIT). Volume 4 , Issue 2. Pages 163 – 184. 2004.

[11] 11. Tullis T. S. Screen design. Handbook of human-computer interaction. Elsevier science publishers. The Netherlands. Pages 377-411. 1988.

[12] Tullis, T.S. A system for evaluating screen formats, in proceeding for advances in human-computer interaction. Pages 214-286. 1988.

[13] Tullis, T. S. The formatting of alphanumeric displays: A review and analysis. Human Factors., Pages 657-683. 1983.

[14] Comber, T and Maltby, J. R. Investigating Layout Complexity. 3rd International Eurographics Workshop on Design, Specification and Verification of Interactive Systems. Belgium. Pages 209-227. 1996.

[15] Raskin, Jeff. Humane Interface: New Directions for Designing Interactive Systems. Addison-Wesley, Boston, MA, USA. 2000.

[16] Thomas, C.; & Bevan, N. Usability Context Analysis: A practical guide, Version 4. National Physical Laboratory, Teddington, UK. 1996.

[17] Alsmadi, Izzat and Kenneth Magel. GUI path oriented test generation algorithms. In Proceeding of Human-Computer Interaction conference. IASTED HCI, Chamonix, France. 2007.

[18] Alsmadi, Izzat and Kenneth Magel. GUI test automation framework. In proceeding of the International Conference on Software Engineering Research and Practice (SERP'07). 2007.

[19] Pastel, Robert. Human-computer interactions design and practice. Course. http://www.csl.mtu.edu/cs4760/www/. 2007.

[20] Alsmadi Izzat and Kenneth Magel. GUI path oriented test case generation. In proceeding of the international conference on Software Engineering Theory and Practice (SETP07), 2007.

[21] GUI Structural Metrics and Testability Testing. K. Magel and I. Alsmadi (USA), IASTED SEA 2007. Boston USA.


77

Estimating Web Application Development Effort Employing COSMIC: A Comparison between the use of a Cross-Company

and a Single-Company Dataset

Filomena Ferrucci, Carmine Gravino, Sergio Di Martino, Luigi Buglione

Abstract Due to its composite and particular nature, Web projects (and related applications) are

quite difficult to be defined (and measured) by a single size unit, and the subsequent effort estimation process for a web project still remains nowadays a critical activity for a project manager, that’s in charge to deliver the final application on time, on quality, and within budget. 1st generation of FSM (Functional Size Measurement) methods, as IFPUG FPA, produced case studies and interpretative guidelines trying to “capture” all the particularities of a web project into their counting guidelines, but with no success. COSMIC, the 2nd generation FSMM seems instead to better capture and size also those kinds of projects and several studies have been carried out during last years, with results that would confirm such hypothesis.

This paper will present a further case study in such direction, presenting and discussing results from an empirical study carried out using data from an Italian single company dataset as well as from the public benchmarking repository ISBSG r10, analysed at the light of OLSR (Ordinary Least Square Regression) and CBR (Case-Base Reasoning) estimation techniques.

1. Introduction

Early effort estimation is a critical activity for planning and monitoring web project development as well as for delivering the product on time, quality, and within budget. Several studies have been conducted applying 1st generation FSM (Functional Size Measurement) methods (i.e. IFPUG FPA) as the (product) size unit for building an effort estimation model. COSMIC [9][8] 1, that’s the latest FSM method created, definable as a 2nd generation method, can be applied to several domains, also for the prediction of web application development efforts, and interesting results were recently reported [13].

This paper will present and analyse the results from an empirical study on the

effectiveness of COSMIC, carried out using data from a single company dataset and from the public benchmarking repository ISBSG r10 [15]. Regarding the employed datasets, in particular the single-company dataset was obtained collecting information about the total project effort and the product functional size (sized with COSMIC) of 15 web applications developed by an Italian software company. The second dataset was obtained by selecting the 16 ISBSG r10 projects sized with COSMIC v2.2 [9] and classified as ‘web’ ones.

As for the estimation methods, two widely used techniques (OLSR - Ordinary Least

Square Regression [26] and CBR – Case-Based Reasoning [1]) were applied in order to construct the prediction models, while a leave-one-out cross validation was exploited to assess the accuracy of estimates obtained with the models. In the following we highlight in detail the differences between the two datasets, discussing the issues related to the use of a cross-company dataset by a software organisation, and derive suggestions about the way public data could be exploited in order to get more accurate estimates of development effort. 1 Yet an ISO standard (19761:2003) in its version 2.1.


78

The paper is organised as follows: Section 2 proposes related works on web estimation, from which this experience moves on. Section 3 introduces the descriptions for the two observed datasets, with information on the filtering criteria adopted for choosing the sample to be analysed in a (possible) homogeneous manner. Section 4 describes the research method adopted, providing high-level details on the two chosen techniques (OBS and CBR) and the rationale for this choice. Section 5 proposes the results from the analysis with a discussion of obtained results. Finally, Section 6 reports conclusions from this experience and prospects for next work on this issue.

2. Related Works 2.1. Sizing and Estimating Web Application with COSMIC

COSMIC method has been applied to Web applications by some researchers in the last years [1][22][28]. In particular, the difficulties of applying the IFPUG method to size an Internet bank system, first motivated Rollo to use COSMIC in the context of Web applications [28]. However, he did not present any empirical result supporting his thesis. Subsequently, Mendes et al. applied the COSMIC approach to Web sites, i.e., without server-side elaborations [22]. Using data from 37 Web systems developed by academic students, an effort estimation model was built applying OLSR. Unfortunately, this model did not provide good estimations and replications of the empirical study were highly recommended to find possible biases in the collection of the data and/or in the application of the method. Subsequently, the observation that dynamic Web applications are mainly characterised by data movements (from a Web server to the client browser and vice-versa) suggested to apply the principles of the COSMIC method to size this type of Web applications [1]. An empirical study based on 44 Web applications developed by academic students, was performed to assess the COSMIC approach [1]. The effort estimation model obtained by employing the OLSR provided encouraging results. Recently, three of the authors of this paper performed a case study aiming at assessing the effectiveness of a COSMIC base model in estimating web application development effort by exploiting a single-company dataset (obtained from a set of 15 Web applications developed by an Italian software company) [13]. The Web Objects size measure, proposed by Reifer for the web [27], was also applied. Web Objects are characterised by the introduction of four new Web-related components together with the five function types of the Function Point Analysis method, namely Multimedia Files, Web Building Blocks, Scripts, and Links. The estimation models were built by applying OLSR and were validated by using a hold-validation approach. In particular, the performance of the obtained models was evaluated by using a dataset of further 4 Web applications developed by the same software company some time after the first 15 Web applications. The results revealed that both COSMIC and Web Objects were good indicators of the development effort.

2.2. Effectiveness of Web Estimation Models (single-company vs cross-company

datasets) The problem of analysing and comparing the effectiveness of single-company and cross-

company datasets in estimating development effort is particularly relevant and few case studies were conducted in the recent years. A discussion of these works was provided in [3]. To the best of our knowledge, the only empirical analyses that addressed this problem in the context of web applications are those reported in [21][23]. The datasets used in these analyses were obtained from the Tukutuku2 database.

2 http:// www.metriq.biz/tukutuku


79

Observe that the study reported in [21] include the discussion and comparison of the results of two studies that exploited different cross-company and single-company datasets from the Tukutuku database.

The authors applied the OLSR and the CBR as estimation techniques and the results revealed that, as expected, the models obtained with the single-company dataset provided much better estimation than those obtained with the cross-company dataset. A replication of these studies was reported in [23], where a different and larger cross-company dataset was obtained from the Tukutuku database. As for single-company dataset the set of 15 web applications developed by an Italian software company was employed. Observe that this is the same dataset exploited in [13] that will be also used in this paper. The results reported in [23] confirmed those showed in the replicated study.

2.3. Sizing and Estimating Web Application with other Size Measures

Other measures than COSMIC have been proposed in the literature to size web applications and to estimate web application development effort. After Web Objects [27] yet introduced in Section 2.2, in particular the OOmFPWeb method maps the FPs concepts into the primitives used in the conceptual modelling phase of OOWS [2]. OOWS is a method for producing software for the Web. In a recent work [3], a preliminary evaluation of the OOmFPWeb has been provided. Two empirical studies based on a dataset of 12 Web applications from industrial world were performed to establish whether Web Objects can be affordably used to estimate the development effort of Web applications [29]. In particular, in the first study OLSR was employed, while in second one Web-COBRA was exploited, gathering in both cases encouraging results.

Some authors investigated the usefulness of size measures specific for Web applications

such as number of Web pages, media elements, internal links, etc… [24]. Those measures were used to provide effort estimations, by employing several techniques, such as CBR, Linear and Stepwise Regression, and Regression Tree and using Web applications developed by academic students and by companies. The results of these studies showed that Stepwise Regression provided better estimations with the respect to the other techniques. Two set of size measures were compared in [10], the first included some length measures (such as number of Web pages, number of server-side script and applications, etc.) while the second contained some functional measures (in particular, the components used to evaluate the Web Objects.

The results of this study revealed that the first set provided better estimates when using

CBR while the second set provided better estimates when using OLSR. However, the analysis also suggested that there were no significant differences in the estimations and the residuals obtained with the two sets of measures. The empirical study presented in [12] compared the following sets of measures: Web Objects, the two sets used in [10], the set of measures proposed in [24]. The empirical results showed that all the measures provided good predictions in terms of MMRE, MdMRE, and Pred(0.25) statistics and the study largely confirmed the results of previous works. Baresi and Morasca defined several measures on the basis of attributes obtained from design artifacts, automatically generated with W2000, and conducted an empirical study and two replications for identifying the attributes that may be related to the effort required for designing Web applications (i.e., they did not focus on the total development (effort) [4]. These studies involved the students of advanced university classes on modelling Web applications and the authors argued that the designers’ background impacted the achieved results and further analysis is required.


80

3. Data Sets Descriptions

The availability of reliable project data coming from the industrial world is a crucial aspect in any empirical software engineering study. Indeed, often the lack of publicly available datasets is the main problem in the empirical software engineering field that prevents researchers to effectively and efficiently verify hypothesis and ideas, and to transfer in the industrial world proposals conceived in the academic field. In the following, we provide the description of two industrial datasets. The first is a single-company dataset since it includes information on web applications developed by a single software company, while the second is a cross-company dataset since it consists of information on a set of web applications from a public repository that includes data on software systems developed by several software companies.

3.1. Single-company dataset

The single-company dataset employed in the empirical study reported in this paper has been provided by an Italian software company, with about 50 employees, and whose core business is the development of enterprise information systems, mainly for local and central government [13]. Among its clients, there are also health organisations, research centres, industries, and other public institutions. The company is specialised in the design, development, and management of solutions for Web portals, enterprise intranet/extranet applications (such as Content Management Systems, e-commerce, work-flow managers, etc.), and Geographical Information Systems. It is certified ISO 9001:2000, and it is also a certified partner of Microsoft, Oracle, and ESRI. This company provided us information on 15 Web applications (e-Government, e-Banking, Web portals, and Intranet applications). They have been developed by exploiting a wide range of Web-oriented technologies, such as J2EE, ASP.NET, etc… Oracle has been the commonly adopted DBMS, but also SQL Server, Access and MySQL have been employed in some cases. About the people involved in the development process, 12 applications were carried out by teams of 6 elements, while the other 3 were developed by 7 persons. In all the cases, the teams worked in the same building.

The 15 web applications were sized using COSMIC v2.2 [9] 3. The basic idea underlying this approach is that, for many kinds of software, the most of development efforts are devoted to handle data movements from/to the persistent storage and the users. Thus, their number can provide a meaningful sight of the system size. To identify data movements, COSMIC requires defining a proper context model for the specific application, bounding the software from its operating environment. Basically each data movement crossing a boundary has to be counted. The method considers four kinds of data movements, i.e., Entry, Exit, Read, and Write, and the functional size of the software, in terms of COSMIC functional size units (cfsu) [15] 4, is given by the sum of all these data movements.

The collection of information about the application measure and actual development effort represents the main difficulty to carry out this kind of study. Concerning the effort collection, the software company used timesheets to keep track of this information, where each team member annotates the information about his/her development effort, and weekly each project manager stores the sum of the efforts for the team. In order to collect all the significant information to calculate the values of the size measures, the authors defined a template to be filled in by the project managers. All the project managers were trained on the use of this form. They took into account the COSMIC Measurement manual, version 2.2 [8] and the rules provided in [1] for the collection required to apply COSMIC to their projects. 3 Current version of the Measurement Manual is v3.0 [6]. 4 With v3.0, ’cfsu’ are now called CFP (COSMIC Function Points).


81

One of the authors analysed the filled templates and the analysis and design documents, in order to cross-check the provided information. The same author calculated the values of the size measures.

3.2. Cross-company dataset

About the second dataset, the cross-company one, the most known and used repository including FSM-sized data is the one by the International Software Benchmarking Standards Group (ISBSG) 5. ISBSG is a no-profit organisation, whose members are only National Software Metrics/Measurement Associations (SMAs), currently including (in alphabetical order): AEMES (Spain), CSBSG (China), DASMA (Germany), FISMA (Finland), GUFPI-ISMA (Italy), KOSMA (Korea), IFPUG (USA), NASSCOM (India), NESMA (Netherlands), QESP (Australia), SwiSMA (Switzerland), and UKSMA (United Kingdom).

It started its software benchmarking initiative in software benchmarking more than 10

years ago, in the mid ‘90s. Current full version of the repository (r10) includes 4106 projects and more specific repositories have been yet created (the ones for Software Development & Enhancement projects, Software Maintenance & Support; Software Package Acquisition & Implementation).

The repository takes into account more than 100 project factors/attributes6 and from a statistical viewpoint it could be harder to identify trends in a more varied dataset than in a more homogeneous one. Therefore, there is the need to create homogeneous subsets. Several steps were run for filtering the project attributes of interest, as summarised into Table 1.

Table 1: ISBSG r10 dataset: filtration for COSMIC projects (n=117).

Step Attribute Filter Projects Excluded

Remaining Projects

1 Count Approach7 = COSMIC-FFP 3989 117 2 Data Quality Rating (DQR) = {A | B} 7 110 3 Web Development = {Non-blanks} 84 26 4 Application Type = {New Development} 10 16

= {Enhancement} 20 6 = {Re-Development} 0 0

The ones in bold are the ‘remaining projects’ to be analysed and compared with the single-

company datasets.

3.3. Descriptive statistics Table 2 reports the descriptive statistics of the variable denoting the development effort

expressed in terms of person-hours (namely, EFH), for both for the datasets. It can be observed that the development effort refers to the total sum of all the time that members have spent for requirements analysis, design, implementation, and testing of the Web application. About the kind of applications, there are of course many differences between the two datasets.

5 URL: http://www.isbsg.org/ 6 In order to look at what is the list of gathered attributes, it is possible to download the ‘Data

CollectionQuestionnaires’ for the FSMM of interest from the ISBSG website [7]. A new initiative with COSMIC data, just launched on January 2009. Full details @ http://www.isbsg.org/isbsgnew.nsf/WebPages/5B209139C303A799CA25754B0075E601?open

7 No filter was applied for the different COSMIC-FFP versions.


82

The product functional size of the applications is quite different, with the average cross-company sizes in a 7:1 ratio against those from the single-company dataset, in terms of person/hours.

Table 2: Descriptive statistic. VAR OBS MIN MAX MEAN MEDIAN STD.DEV

Single-company Dataset EFH 15 1176 3712 1389 2677.867 827.115 CFP 15 264 1022 2792 602.4 249.113 Prod. 15 1.09 2.36 1.77 1.75 0.313

Cross-company Dataset EFH 16 147 46687 10276.6 5603 13719.535 CFP 16 79 1670 503.6 338.5 418.586 Prod. 16 0.1 40.9 3.5 1.3 10.000

In order to properly analyse dataset samples, some premises on productivity are needed.

First, a whatever functional size unit (fsu) is a product (and not a project) size, and it measures only the Functional User Requirements (FUR), not the whole set of possible requirements (thus excluding the non-functional ones), while the effort reported is related to the ‘project’ entity, including all those organisational and support processes along the Software lifecycle, not directly contributing to the creation of the product functional size. Therefore the common formula applied for software projects could be referred as a ‘nominal’ productivity to be analysed in deeper detail, since it’s fundamental to understand at least the balancing between the % of effort derived from FURs and the remaining one (that can be simply referred as ‘non-functional’-related effort8. Second, 1st generation FSM methods such as IFPUG FPA measure only the application layer, while COSMIC allows to declare and count multiple software layers (further info on this issue has been proposed in the new simplified ISBSG data questionnaire on projects sizes with COSMIC). Third, well-known and visible outliers must be removed from the dataset in order to maintain a consistency of the obtained results.

4. Research Method 4.1. Modelling techniques

Ordinary Least Squares Regression (OLSR) is a statistical technique that explores the relationship between a dependent variable and one or more independent variables [22], providing a prediction model described by an equation:

(1) y = b1x1 + b2x2 + ... + bnxn + c where y is the dependent variable (the effort), x1, x2, ..., xn are the independent variables

(the cost drivers) with coefficient bi, and c is the intercept. In our empirical study we have exploited the OLSR analysis to obtain linear regression models that use the variable representing the effort as dependent (namely EFH) and the variable denoting the COSMIC size measure (namely CFP) as independent.

Several crucial indicators have been taken into account to evaluate the quality of the

resulting models [20]. In particular, to determine the goodness of fit of the regression models we used R2, which measures the percentage of variation in the dependent variable explained

8 For a wider discussion on this issue, please refer to [17].


83

by the independent variable. The R2 value is an indicator of the goodness of the prediction model. Other useful indicators are the F value and the corresponding p-value (denoted by Sign F), whose high and low values, respectively, denote a high degree of confidence for the prediction.

Moreover, we performed a t statistic and determined the p-value and the t-value of the coefficient and the intercept for each model in order to evaluate its statistical significance. p-value provides the probability that the coefficient of the variable is zero, while t-value can be used to evaluate the importance of the variable for the generated model. A p-value less than 0.05 shows that we can reject the null hypothesis that the variable is not a significant predictor, with a confidence of 95%. As for the t-value, a variable is significant if the corresponding t-value is greater than 1.5.

As for CBR, the idea behind the use of this technique is to predict the effort of a new

project by considering similar projects previously developed. In particular, completed projects are characterised in terms of a set of p features and form the case base. The new project is also characterised in terms of the same p attributes and it is referred as the target case. Then, the similarity between the target case and the other cases in the p-dimensional feature space is measured, and the most similar cases (or projects) are used, possibly with adaptations to obtain a prediction for the target case [30]. To apply the method, we had to select: the relevant project features, the appropriate similarity function, the number of analogies to select the similar projects to consider for estimation, and the analogy adaptation strategy for generating the estimation.

4.2. Evaluation Criteria

To evaluate the accuracy of the derived effort estimations, we used some summary measures, namely MMRE and Pred(0.25)9 [6]. In the following we briefly recall the main concepts underlying the MMRE and Pred(0.25). The Magnitude of Relative Error [6] is defined as:

(2) MRE = |EFHreal — EFHpred | / EFHreal where EFHreal and EFHpred are the actual and the predicted efforts, respectively. MRE

has to be calculated for each observation in the dataset. To have a central tendency measure of the error, all the MRE values are aggregated across all the observations using the mean or the median, giving rise to the Mean of MRE (MMRE), and the Median MRE (MdMRE), where the latter is less sensitive to extreme values [16].

The prediction at level 0.25 [6], defined as: (3) Pred (0.25) = k / N where k is the number of observations whose MRE is less than or equal to 0.25, and N is

the total number of observations. In other words, Pred(0.25) is a quantification of the percentage of predictions whose error is less than 25%. According to Conte et al., a good effort estimation model should have a MMRE≤0.25 and Pred(0.25)≥0.75, meaning that the mean prediction error should be less than 25%, and at least 75% of the predicted values should fall within 25% of their actual values [6]. 9 This is a well-known and referenced threshold in Software Engineering. Anyway, in the everyday practice the

quest is for lower acceptable ranges, till ±10-15%.


84

5. Results & Discussions 5.1. Obtaining estimates with OLSR

In order to apply the OLSR analysis we verified the following assumptions for each training set: linearity (i.e., the existence of a linear relationship between the independent variable and the dependent variable); homoscedasticity (i.e., the constant variance of the error terms for all the values of the independent variable); and residual normality (i.e., the normal distribution of the error terms). As for the cross-company dataset, the analysis revealed that the variables EFH and CFP were highly skewed and homoscedasticity and residual normality assumptions were not verified satisfied by applying the Breush-Pagan test and the Shapiro test, respectively. Thus, in order to perform the OLSR analysis we transformed the variables by applying the natural log (Ln) on them, which allow to obtain data values closer to each other [22].

Table 3 shows the results of the OLSR analysis with some useful indicators to verify the

quality of the model. We can observe that the OLSR analysis was successfully applied to the single-company dataset. Indeed, the model is characterised by a high R2 value (0.824) a high F value and a low p-value, indicating that the prediction is available with a high degree of confidence. The t-values and p-values for the corresponding coefficient and the intercept present values greater than 1.5 and less than 0.05, respectively. On the other side, the performed analysis suggested that it was not possible to apply OLSR to the ISBSG dataset since the hypotheses (and the key indicators of goodness of the obtained model) underling the OLSR did not hold (also applying data transformation). Indeed, the dataset is not homogeneous both in effort and size variables and the productivity varies considerably among the considered projects, as previously introduced in Section 2.3.

Table 3: The results of the OLSR analysis.

Value Std. Err t-value p-value R2 Std Err F Sign F Single-company dataset

Coefficient 3.014 0.386 7.801 0.000 0.824 360.099 60.861 0.000 Intercept 862.28

1 250.613 3.441 0.004

Cross-company dataset Coefficient 0.126 0.4611 2.778 0.015 0.005 1.621 0.074 0.7889 Intercept 7.608 2.739 0.273 0.789 To obtain effort estimation models that satisfy the hypothesis underlying OLSR analysis

and the thresholds for the indicators R2, F-value and corresponding Sign F value, and t-statistics of coefficient and intercepts, we tried to cluster the cross-company dataset, by grouping the observations according to a specific parameter. Taking into account the information available in the ISBSG dataset, we decided to exploit the productivity to classify the observations. In particular, we observed that the productivity of the cross-company dataset range from of 0.1 to 2.64. We did not consider one project with a productivity value equals to 40.9,since it can be considered an outlier taking into those data. We decided to consider two ranges to group the observations in the cross-company dataset: [0, 1.21] and [1.21, 3], where 1.21 represents the median value of productivity. The descriptive statistics of these two datasets are reported in Table 4. Then, by exploiting OLSR analysis we built two estimation models by considering the training set (Set1) formed by the observations characterised by a productivity in the range [0,1.21[ and the training set (Set2) formed by the observations characterised by a productivity in the range [1.21,3] (see Table 5).


85

Table 4: Descriptive statistics of cross-company Set1 and Set2. VAR OBS MIN MAX MEAN MEDIAN STD.DEV

Cross-company Set1 EFH 7 6188 46787 19711.4 11165 16515.5 CFP 7 94 678 286.3 294 202.9

Cross-company Set2 EFH 8 408 10206 3287.8 2590.5 3169.9 CFP 8 79 1670 662.8 708.5 505.2

Concerning the regression model obtained with cross-company Set2, it is characterised by

a high R2 value (0.920), a high F value, and a low p-value (<0.001), indicating that the prediction is indeed possible with a high degree of confidence. The t-values for the corresponding coefficient, 1.001, and the intercept, 1.521, present t-values greater than 1.5. Furthermore, the coefficient is significant at level 0.05, while the intercept at level 0.09 (very close to 0.05), as from the t test.

Table 5: The results of the OLSR analysis applied on cross-company Set1 and Set2.

Value Std. Err t-value p-value R2 Std Err F Sign F Cross-company Set1

Coefficient 0.837 0.366 2.286 0.071 0.511 0.644 5.225 0.071 Intercept 5.030 2.007 2.507 0.054

Cross-company Set2 Coefficient 1.001 0.121 8.288 0.000 0.920 0.313 68.69 0.000 Intercept 1.521 0.752 2.023 0.089 Moreover, we assessed the effectiveness of the model obtained with Set2 by verifying

whether the model could be exploited to estimate the effort of the applications in the single-company dataset, that are characterised by productivity values in the range [1,3]. The results in terms of summary measures MMRE, MdMRE, and Pred(0.25) are reported and discussed in Section 5.3.

5.2. Obtaining estimates with CBR

In our case study, to apply the CBR technique, we have exploited the ANGEL tool (see Figure 1) [30]. To calculate the estimation, ANGEL allows to choose the relevant predictors, the similarity function, the number of analogies, and the analogy adaptation technique. The predictor used in this study is the variable CFP while as similarity measure the Euclidean distance is used as this is the measure used by ANGEL. In addition, all the project attributes considered by the similarity function had equal influence upon the selection of the most similar project(s). Selecting the number of analogies is a key task, since it refers to the number of similar cases to use for estimating the effort required by the target case. Since we dealt with a small data set, we used 1, 2, and 3 analogies, as suggested by many researchers (see e.g., [5][21][30]). In order to select similar projects for the estimation, we employed the mean of k analogies (simple average) [30]). In particular, completed projects are characterised in terms of a set of p features and form the case base. The new project is also characterised in terms of the same p attributes and it is referred as the target case.


86

Figure 1: ANGEL tool: a snapshot. We applied ANGEL by considering as case base the 16 observations of the initial cross-

company dataset reported in Table 2 in order to obtain the effort estimations for the 15 observations in the single-company dataset. Furthermore, to be consistent with the analysis performed with the OLSR, we also applied ANGEL on the 7 observations of the cross-company Set2 reported in Table 4. The accuracy of these estimations has been evaluated in terms of MMRE, MdMRE, and Pred(0.25), which are reported and discussed in Section 5.3.

5.3. Comparing the accuracy of the obtained estimates

Table 6 contains the results we obtained in terms of the summary measures MMRE, MdMRE, and Pred(0.25). We have reported the results obtained by applying OLSR and CBR on the cross-company Set2, containing observations characterised by productivity similar to the productivity of the observations in single-company dataset, as described in the previous section. Moreover, Table 6 also contains the results achieved by applying OLSR and CBR on the initial cross-company dataset (i.e., without exploiting information on productivity to classify the observations).

We can observe that the best results have been obtained by using OLSR with the cross-company Set2. Indeed, this results highly fit the acceptable threshold defined in [6], since MMRE and MdMRE values are less than 0.25 and Pred(0.25) value is greater than 0.75. Observe, that the CBR technique on the cross-company Set2 does not provide good results. As expected, the predictions obtained with OLSR and CBR by exploiting all the cross-company dataset are not good to estimate actual development effort.

Thus, the use of productivity seems to be useful to classify observations of a cross-company dataset with the aim of deriving accurate estimate of observations contained in a single-company dataset, when the estimation technique employed is the OLSR.


87

Table 6 Descriptive accuracy evaluation VALIDATION MMRE MDMRE PRED(0.25)Estimating the effort of single-company observations using the cross-company Set2 with OLSR

0.12 0.05 0.80

Estimating the effort of single-company observations using the cross-company Set1 with OLSR

10.98 11.04 0.00

Estimating the effort of single-company observations using the cross-company Set2 with CBR

0.59 0.68 0.07

Estimating the effort of single-company observations using the cross-company dataset with OLSR

0.84 0.61 0.00

Estimating the effort of single-company observations using the cross-company dataset with CBR

1.84 0.77 0.00

6. Conclusions and Prospects

We have presented the results of an empirical study carried out using data from a single company dataset and from the public benchmarking repository ISBSG r10 [15]. COSMIC was used as size measure, while two widely used techniques, OLSR (Ordinary Least Square Regression) and CBR (Case-Based Reasoning) were used as estimation techniques.

The aim of the case study was to highlight the differences between the two datasets,

discussing the issues related to the use of a cross-company dataset by a software organisation, and derive suggestions about the way public data could be exploited in order to get more accurate estimates of development effort. The analysis has revealed the difficulties of using cross-company dataset to obtain effort prediction models using OLSR and CBR. Furthermore, we have showed that some premises on productivity are needed to properly analyse dataset samples.

Some notes and thoughts after running this study, possibly useful for a next experiment on a larger dataset in the near future, follow:

• Unfortunately right now the amount of web projects sized with a FSM method is not so high to allow a further analysis of statistically significant sub-datasets. A decomposition of the data sample by size ranges is due in order to do not compare ‘apples with oranges’.

• Productivity in software projects as currently calculated considers not-homogeneous quantities, since the upper part of the formula is a product-level entity taking care of the solely FURs, while the lower part refers to the overall project effort, including also that one derived from the Non-Functional Requirements. This solely information can lead to under-estimation.

• A cross-company database therefore has in heritance some of these elements/issues within its data, but repositories as ISBSG have been, are and will be fundamental for software organisations in order to achieve higher and higher awareness about their own data gathering process, shaping them according to some well-known and used characteristics and attributes (from the organisational till the product level). A further initiative started during last years was PROMISE , born as a yearly event and evolved also with a series of datasets gathered from technical papers and provided also from external sources and shared across the Software Engineering community.

• A more and more interesting element to be considered for benchmarks is the quality of the data to be analysed. The new ISO/IEC 25012:2008 standard [17] can represent a new landmark, as yet represented by ISO/IEC 9126-x, for evaluating such fundamental issue.


88

• Aside analysis on specific datasets, from a process-oriented viewpoint, this issue can be analysed as a continuous improvement mechanism where the increasing maturity in collecting data within single organisations can improve – when they feed also ISBSG repository or other cross-organisational datasets – the cross-organisational dataset level. And it will influence those software organisations firstly using the benchmarking data and possibly stimulating a new data gathering process that could – hopefully – contribute to the overall community.

The more experience you gather during your process the less the project cost you have to

play. The crucial issue still remain to plan (and not simply to execute) the data gathering process from the start and share data within the organisation to spread awareness before and knowledge after.

7. References [1] Aamodt A. and Plaza E., "Case-Based Reasoning: Foundational Issues, Methodological Variations, and

System Approaches" Artificial Intelligence Communications Vol. 7, No. 1 (1994): 39-52. [2] Abrahão S. and Pastor O., “Measuring the functional size of web applications”, International Journal of

Web Engineering and Technology, 1(1):5–16, 2003. [3] Abraho S., Pastor O., and Poels G.. “Evaluating a functional size measurement method for web

applications: An empirical analysis”, in Proceedings of International Software Metrics Symposium, pages 358–369. IEEE press, 2004.

[4] Baresi L. and Morasca S., Three empirical studies on estimating the design effort of web applications. Transactions on Software Engineering and Methodology, 16(4), 2007.

[5] Briand L., El-Emam K., Surmann D., Wiekzorek I., and Maxwell K., “An Assessment and Comparison of Common Software Cost Estimation Modeling Techniques”,, in Proceedings of International Conference on Software Engineering, (ICSE’99), Los Angeles, USA, 1999.

[6] Buglione L., Some thoughts on Productivity in ICT projects, WP-2008-02, White Paper, version 1.2, July 2008, URL: www.geocities.com/lbu_measure/fpa/fsm-prod-120e.pdf

[7] Conte D., Dunsmore H.E., Shen V.Y., Software engineering metrics and models, Benjamin/Cummings Publishing Company, Inc., 1986, ISBN 0805321624

[8] COSMIC, COSMIC v.3.0, Measurement Manual, 2007, Common Software Measurement International Consortium, URL: www.cosmicon.com; www.gelog.etsmtl.ca/cosmic-ffp/

[9] COSMIC, COSMIC Full Function Points v.2.2, Measurement Manual, 2004, Common Software Measurement International Consortium, URL: www.cosmicon.com; www.gelog.etsmtl.ca/cosmic-ffp/

[10] Costagliola G., Di Martino S., Ferrucci F., Gravino C., Tortora G., Vitiello G., “A COSMIC-FFP: Approach to Predict Web Application Development Effort”, Journal of Web Engineering 5(2), 2006, pp. 93-120.

[11] Costagliola G., Di Martino S., Ferrucci F., Gravino C., Tortora G., and Vitiello G.. “Effort estimation modeling techniques: A case study for web applications”. In Proceedings of the International Conference on Web Engineering, pp 161–165. ACM press, 2006.

[12] Di Martino S., Ferrucci F., Gravino C., and Mendes E.. Comparing size measures for predicting web application development effort: A case study. In Proceedings of Empirical Software Engineering and Measurement, pp. 324–333. IEEE press, 2007.

[13] Ferrucci F., Gravino C., Di Martino S., A Case Study Using Web Objects and COSMIC for Effort Estimation of Web Applications, Proceedings of the 34th Euromicro Conference / Software Engineering and Advanced Applications (SEEA 2008), 3-5 September 2008, Parma (Italy), ISBN 978-0-7695-3276-9, pp. 441-448

[14] IFPUG, Function Points Counting Practices Manual (release 4.2), International Function Point Users Group, Westerville, Ohio, January 2004, URL: www.ifpug.org

[15] ISBSG, ISBSG Data Repository r10, January 2007, URL: www.isbsg.org [16] ISBSG, Data Collection Questionnaire New development, Redevelopment or Enhancement Sized Using

COSMIC-FFP Function Points, version 5.10, International Software Benchmarking Standards Group, 26/09/2007


89

[17] ISO/IEC IS 25012:2008, Software Engineering – Software Product Quality Requirements and Evaluation (SQuARE) – Data Quality Model, International Organisation for Standardization, 03/12/2008.

[18] Kitchenham B. A., Pickard L. M., MacDonell S. G., Shepperd M. J., “What accuracy statistics really measure”, IEE Proceedings – Software, 148(3), 2001, 81-85.

[19] Kitchenham, B., Mendes, E., and Travassos, “Cross versus within-company cost estimation studies: A systematic review”, IEEE Transactions on Software Engineering 33(5), pp. 316–329

[20] Maxwell, K. Applied Statistics for Software Managers. Software Quality Institute Series, Prentice Hall, 2002, ISBN 0130417890

[21] Mendes E., Kitchenham B., “Further Comparison of Cross-company and Within-company Effort Estimation Models for Web Applications”, in Proceedings of International Software Metrics Symposium (METRICS’04), 2004, pp. 348-357.

[22] Mendes E., Counsell S., Mosley N., Triggs C., Watson I., “A Comparative Study of Cost Estimation Models for Web Hypermedia Applications”, Empirical Software Engineering 8(2), 2003, pp. 163-196.

[23] Mendes E., Di Martino S., Ferrucci F., Gravino C., “Cross-company vs. Single-company Web effort models using the Tukutuku Database: an Extended Study”, Journal of System and Software, Special Issue on Software Process and Product Measurement, 81(5) (May 2008), pp. 673-690, URL: http://doi:10.1016/j.jss.2007.07.044

[24] Mendes E., Counsell S., Mosley N., Triggs C., Watson I., “A Comparative Study of Cost Estimation Models for Web Hypermedia Applications”, Empirical Software Engineering 8(2), 2003, pp. 163-196.

[25] Mendes E., Kitchenham B., “Further Comparison of Cross-company and Within-company Effort Estimation Models for Web Applications”, in Proceedings of International Software Metrics Symposium (METRICS’04), 2004, pp. 348-357.

[26] Montgomery D., Peck E., Vining G., “Introduction to Linear Regression Analysis”, John Wiley & Sons, Inc., 4/e, 2006, ISBN 978-0-471-75495-4

[27] Reifer D., “Web-Development: Estimating Quick-Time-to-Market Software”, IEEE Software, 17(8), 2000, pp. 57-64

[28] Rollo T., “Sizing E-Commerce”, in Proceedings of the ACOSM 2000 - Australian Conference on Software Measurement, 2000, URL: http://www.gifpa.co.uk/library/Papers/Rollo/sizing_e-com/v2a.pdf

[29] Ruhe M., Jeffery R., Wieczorek I., “Cost estimation for Web applications”, in Proceedings of the International Conference on Software Engineering (ICSE’03), 2003, pp. 285–294.

[30] Shepperd M., Schofield C., “Estimating software Project Effort using Analogies”, IEEE Transactions on Software Engineering, 23(11), pp. 736-743, 2000.


90

SMEF 2009


91

From performance measurement to project estimating using COSMIC functional sizing

Cigdem Gencel, Charles Symons

Abstract This paper introduces the role and importance of the measurement of software sizes when

measuring the performance of software activities and in estimating effort, etc, for new software projects. We then describe three standard software sizing methods (the Albrecht/IFPUG, MkII FPA and COSMIC methods) that have claimed to be functional size measurement methods and examine the true nature of their size scales. We find that sizes produced by the Albrecht/IFPUG and MkII FPA methods are actually on a scale of ‘standard (development) effort’. Only the COSMIC method gives a true measure of functional size and is thus best-suited for performance comparisons across projects using different technologies.

We conclude that in principle, different types of scales, one measuring functional size and the other standard effort, may be desirable for the different purposes of performance measurement and for estimating, respectively. It then follows that for estimating purposes it might be attractive to define a COSMIC-enabled estimating process in which weights are assigned to the components of a COSMIC functional size dependent on the technology to be used by the project being estimated and maybe on other factors. The resulting COSMIC ‘standard effort’ scale for the given technology, etc., might be expected to yield more accurate project effort estimates than using the standard COSMIC functional size scale. Initial promising results for such an approach are also presented.

1. Introduction The ability to measure a size of a piece of software in a meaningful and reliable way is the

foundation-stone for the quantitative management of software activities. The two most important and common uses of software sizes are (a) as a measure of the work-output when determining the performance of software development and enhancement activities and (b) as the starting point for estimating the effort for a software activity. These two primary uses may be illustrated by the following equations.

As regards use (a), one the most important ways of measuring the performance of a software project is to determine the project productivity, defined as follows,

(1) where ‘software size’ represents the work-output of the project. Re-arranging and re-phrasing this equation (1), we get a simplest equation (2) for

estimating the effort to develop a new piece of software, i.e. for use (b).

(2)


92

The size for the new piece of software must be estimated in some way and the ‘assumed project productivity’ is obtained either by using equation (1) to measure productivity on comparable internal projects, or by using external benchmark productivity data.

This simplest equation (2) for estimating effort can then be refined by taking into account a variety of cost factors to give a more general top-down estimating equation (3).

(3)

The cost factors can be project, product or development organisation related. Among the cost factors investigated in a number of studies such as [1][2][3][4][5][6][7]; team size, programming language type, organisation type, business area type, application type and development platform have been found to affect the relationship of software product size to project effort at different levels of significance.

All so-called ‘top-down’ estimating methods such as COCOMO II [8], Putnam’s Model/SLIM [9], SoftCost [10], Price-S [11] are derived from this basic equation (3), with varying degrees of sophistication. Whatever the approach, all require as primary input an estimate of the size of the software to be developed.

It would seem obvious that for the above three equations to be used successfully, the same software sizing method must be used for measurements of performance on completed projects or for the external benchmark data, as will be used for estimating the size of a new project.

The selection of a credible size measurement scale is therefore extremely important. An error in estimating the size of the software to be built usually translates directly into an error in the estimate of project effort. All the other cost factors are, of course, important and data on these should be collected to fine-tune the estimation process, but research has shown that size is usually by far the biggest driver of effort.

Measures of software size can be broadly divided into two categories, namely ‘functional sizes’ and ‘physical sizes’.

Functional sizes are supposed to be a measure of the ‘Functional User Requirements’ (or FUR) of the software. FUR are defined broadly as ‘a sub-set of the user requirements that describe what the software shall do, in terms of tasks and services; they exclude technical and quality requirements’ [12]. The key point is that functional sizes should be completely independent of all non-functional requirements such as the technology used for the software, project and quality constraints, etc.

Physical sizes measure some characteristic of the actual piece of software such as a count of the number of ‘source lines of code’ (SLOC) used for the program, a count of the number of modules or object-classes, etc. Clearly, physical sizes depend on the technology used to develop the software.

In principle, functional sizes have two major advantages over physical sizes for our two common uses. First, the technology-independence of functional sizes means that they are ideal for performance measurement, for example when the task is to compare productivity across projects developed using different technologies. Second, estimating a functional size (from the requirements) of a new piece of software can normally be carried out much earlier in the life-cycle of a new project than when a physical size can be realistically estimated to the same accuracy. Functional sizes should therefore be more useful for project estimating purposes, especially early in a new project life-cycle.


93

In this paper, we elaborate how a single total functional software size figure is obtained using the attribute-definition models of three widely-used Functional Size Measurement (FSM) methods, namely the Albrecht/IFPUG, MkII FPA and COSMIC methods. Then, we discuss whether these methods really measure the functional size or something else, and the consequences of our conclusions. Finally, we suggest possible ways in which measurements of COSMIC functional sizes could be used for more accurate effort estimation.

2. What do FSM methods actually measure?

Perhaps resulting from the fact that there is no useful definition of software ‘functionality’, there are many ‘Functional Size Measurement’ (FSM) methods, all assuming different models of software functionality. Five methods have been published as international standards [13][14][15][16][17] under an ISO policy of ‘let the market decide’ which method to use. We will describe our three chosen methods in the sequence in which they were developed and only in sufficient detail to illustrate our argument.

FSM methods define two phases for the measurement of software [18]. In the first phase, the FUR of software are extracted from the available artifacts and expressed in the form suitable for measuring a functional size. FUR represent a set of ‘Transactions’ and ‘Data Types’ [19]. A Transaction takes Data Type(s) as input, processes them, and gives them as outputs as a result of the processing. FSM methods use Transactions and Data Types as their Base Functional Components (BFC’s)10. Then each BFC is categorised into BFC Types and the attributes of BFC Types relevant for obtaining the base counts are identified. In the second phase, the functional size of each BFC is calculated by applying a measurement function to the BFC Types and the related attributes. Then the results are aggregated to compute the overall size of the software system.

In the next sub-sections, we discuss the attribute definition models of the Albrecht/IFPUG,

MkII FPA and COSMIC methods. FSM methods all define some form of compound measure [20][21] derived by aggregating BFC Types. We therefore discuss what we really measure by these FSM methods.

2.1. The Albrecht/IFPUG methods

The first proposal that one could measure a functional size from its requirements came in a brilliant piece of original thinking from Allan Albrecht of IBM. He named his method ‘Function Point Analysis’ [22]. The definition of this FSM method has been refined over the last 30 years and is now maintained by the International Function Point Users Group (IFPUG), but the underlying principles of the method have not changed.

The first full account [23] of the model that is still the basis of the IFPUG method

proposed that the FUR of a piece of software can be analysed into three types of transactions, called ‘elementary processes’ (External Inputs (EI), External Outputs (EO) and External Inquiries (EQ)) and two types of ‘Data files’ (Internal Logical Files (ILF) and External Interface Files (EIF)). Each of these BFC’s is classified as simple average or complex depending on various criteria. To size a piece of software, its BFC’s must be identified, counted and assigned sizes which we call ‘weights’. The sum of the weighted counts of the BFC’s gives the size of the software in units of ‘Unadjusted Function Points’ (UFP). The weights of an elementary process can range from 3 to 7 UFP and a file from 7 to 15 UFP.

10 BFC: an elementary unit of FUR defined by and used by an FSM Method for measurement purposes. BFCs

are categorized into different types, called BFC Types [12]


94

The IFPUG method also still has a further factor, the ‘Value Adjustment Factor’ (VAF), which attempts to measure the contribution to ‘size’ of 14 technical and quality requirements. The product of the size in UFP and the VAF gives the size of the software in Function Points (FP).

The important question for this paper is: what basis did Albrecht use to determine the

weights of the various BFC types and of the components of the VAF? Some form of weighting is clearly necessary since the total size measure depends on counting disparate types of BFC. Further, the weights of the BFC types depend in turn on various measures (number of DETs, number of FTRs, number of RETs – see [24]), so you can’t simply add the counts of BFC’s together across all types.

Albrecht’s classic paper [22] focuses on productivity measurement. He describes his size

as a ‘relative measure of function value delivered to the user that (is) independent of the particular technology or approach used’. He mentions that ‘the basis for this method was developed. from the DP Services estimating experience’, but is vague on how the weights were actually derived (‘they have given us good results’). However, it is clear from a later paper [24] that the weights were derived from correlations between measures of his ‘size and complexity factors’ and development effort, for projects developed in the period 1974 - 83. This paper refers to several internal IBM estimating guidelines.

We conclude that the IFPUG method is not, strictly-speaking, a true FSM method. Abran

and Robillard also reached a similar conclusion in [25]. The weights were originally derived from measurements on 22 IBM projects; mostly (16) developed using COBOL, all from the domain of business application software. (Note: the VAF is clearly technology-dependent, and has been omitted in the ISO standard for the IFPUG method). The IFPUG method should therefore be correctly termed an ‘estimating method’, since its size scale was designed to predict effort – see further below.

2.2. The Mark II Function Point Analysis method

The ‘MkII FPA’ sizing method was developed by Charles Symons who was also working in the domain of business application software [26]. He had used the Albrecht method but found that its size measurements did not adequately cope with very complex transactions that had to navigate through complex file structures. He therefore proposed the MkII FPA method which aims to account for the use of files by transactions, rather than, in part, counting them for their existence within the software boundary as in the IFPUG method [24].

For the MkII FPA method, Symons proposed that the FUR of a piece of software can be

analysed into transactions, called ‘logical transactions’ (roughly equivalent to Albrecht’s elementary processes) which each consist of three components: input, processing and output. To measure the size of the input and output phases of a logical transaction, the number of data element types (DET’s) on the input and output are counted, respectively. For the size of the processing phase, the number of logical entity-types that are referenced in the processing is counted (ET’s). Note that an important difference from the IFPUG method is that there is no upper limit to the size of an individual logical transaction, thus making it a better measurement scale, it is claimed, for transactions that can vary from extremely simple to extremely complex.


95

Since we have two attributes defined relevant to measure the sizes of BFCs in this method (DET’s and ET’s), again they must be weighted before they can be added to obtain the size of a logical transaction. Symons determined the weights to be proportional to the relative average effort to develop one DET and one ET.

The ‘relative average effort’ was determined using data from 64 projects from 10 different

development groups, with the intention that the weights should be truly independent of any particular technology or development approach, etc. In spite of this intention, the MkII FPA method is also not a true FSM method but is, strictly-speaking, an ‘estimating method’. (The MkII FPA method also originally included a ‘Technical Complexity Adjustment’, the equivalent of the IFPUG VAF, but this was discarded as unhelpful.) 2.3. The COSMIC FSM method

In 1998, a group of software metrics experts formed COSMIC, the Common Software Measurement International Consortium. Their aim was to develop a true FSM method, i.e. one that conforms fully to the principles of the ISO standard for FSM [12], and which would be applicable to size software from the domains of business application, real-time and infrastructure software. In view of this cross-domain goal, the COSMIC method was defined using concepts and terminology that are neutral across all the target domains.

The COSMIC method [27] proposed that the FUR of a piece of software can be analysed

into transactions, called ‘functional processes’ (the equivalent to MkII FPA logical transactions). Each functional process can be analysed into a set of ‘data movements’ of which there are four types. Entries and Exits move data into and out of the software respectively. Writes and Reads store and retrieve data to and from persistent storage respectively. Each data movement is presumed to account for the associated data manipulation. (No standard FSM method can properly account for data manipulation, so all are unsuitable for sizing mathematically-rich software).

The COSMIC method BFC type, a data movement, is measured by convention as one

COSMIC Function Point (CFP). The smallest possible functional process has size 2 CFP; there is no upper limit to its size. Since the method has only one BFC type, its four sub-types (Entries, Exits, Writes and Reads) are considered to account for the same amount of functionality. And since each has the same unit of size, these can be added without the need for weights. The COSMIC method is a true, maybe the only true, FSM method.

3. Functional Sizes or Estimating Method Sizes?

Although neither the IFPUG nor the MkII FPA methods are true FSM methods, this does not mean they are useless for either performance measurement or for estimating.

As we have seen, both methods in fact measure a form of ‘standard effort’ with an

arbitrary size scale. We do not know how Albrecht established his size scale in absolute terms. The origin seems to be in a weighted questionnaire designed to produce a ‘size and complexity factor’ for an estimating method; the weights were adjusted so that this factor correlated with effort [24] and the factor was then taken as the size measure. Symons simply adjusted his MkII FPA scale so that it produces sizes similar to Albrecht’s in the range 200 – 400 FP. Both then confused matters by claiming their scales measured a functional size and naming their size scales as ‘function points’. With hindsight this was a mistake.


96

In spite of all this, it is perfectly valid, subject to one proviso, to monitor productivity across different projects by comparing actual effort (e.g. in units of person-hours) against a standard effort (on an arbitrary size scale). And, subject to the same proviso, it is valid to use sizes measured on a scale of standard effort as input to a method to estimate actual effort. The proviso is that for the productivity comparison all the projects should really be carried out under the same common conditions (e.g. the same software type, technology, etc) and for estimating the same conditions apply for the project being estimated as for the projects whose performance was used for calibrating the estimating method. That is why it has been perfectly valid, in principle, to use these two methods for the last 20 - 30 years for performance measurement and estimating. The concept of ‘standard effort’ for a given standard task came from work-study practices pioneered by Frederick Taylor over a century ago [31].

Still, it does place a question mark over whether these measures, whose weights were

calibrated on a limited number of projects, over 20 years ago using technology of that vintage, are still best suited for both purposes in today’s circumstances.

For example, would one expect a size measurement method calibrated 30 years ago on a set of largely COBOL projects to be ideal for comparing productivity across projects developed recently using widely different technologies? Would one expect an estimated effort determined by such a method to be the best input for estimating the effort for a software development project that will use modern technology? In fact, some current studies such as [28] discuss how to calibrate IFPUG FP weights to reflect recent software industry trends.

The COSMIC FSM method, being totally independent of technology and measuring a pure

functional size, is ideal for our first use (a) of comparing performance of projects using different technologies, etc (assuming it is agreed to be a good measure of software functionality in all other respects). But this does not necessarily make it ideal for our second use (b), to measure a size for input to a new project effort estimate where, perhaps, the size should be related to the technology to be used.

We conclude that ideally, two different measurement scales may be required for our two

uses. For use (a) of comparing performance of projects using different technologies, a technology-independent true FSM size scale is needed. For use (b) of project estimating, a standard effort scale that is related to the technology to be used should, in theory, give more accurate results.

It is notable that several well-known estimating methods will accept FP sizes as their

primary input, but first convert these to SLOC sizes, which are clearly technology-dependent. These estimating methods have been calibrated using SLOC sizes. But it is also well known that FP/SLOC conversion ratios are fraught with uncertainty [29][30]. Can we do better by having two size scales, one a true standard FSM size scale for performance measurements and the other a locally-calibrated size scale for estimating, where both scales use the same BFC’s?

4. Using COSMIC size measurement data for project estimating

Referring to equation 3 above, the factor in the first braces {Estimated software size divided by Assumed project productivity} is actually a measure of ‘standard effort’ to develop the project according to the sizing method used and given the assumed (standard) productivity; its unit of measurement would typically be ‘standard person-hours’.


97

Now the size of a functional process according to the COSMIC method is computed from the following equation.

(4)

The terms refer to the number N of Entries (E), Exits (X), Reads (R) and Writes (W) in the

process respectively. The functional size of a piece of software is determined by summing the size of all its functional processes.

However, for estimating purposes, there is no reason to assume that the (standard) effort to analyse, design, develop and test one Entry is the same as that for any of the three other data movement types. And these standard efforts might vary relatively depending on any particular set of ‘common conditions’ (‘C’). Such conditions might include the technology to be used for the software, the ‘functional profile’ of the software itself (different types of software may have quite different proportions of E’s, X’s, R’s and W’s) and maybe other factors. But if, for each condition ‘C’, we could determine the standard productivity ‘SCE’, in units of standard person-hours to develop the amount of functionality for one Entry and for each of the three other data movement types, then we can compute the standard effort (in standard person-hours) to develop an amount of functionality (in CFP) for each functional process directly from the following equation;

(5)

Again, the standard effort required to develop the whole software for the conditions C is

obtained by summing the standard effort values for all its functional processes. Our estimating equation (3) can then become:

(6)

Provided the standard effort (in standard person-hours) per CFP for each data movement

type are calibrated locally, for the local productivity and for the conditions to be used for the new development, this equation (6) is expected to give more accurate effort estimates than equation (3).

In fact, in our previous studies [32][33][34], we investigated whether effort estimation

models based on COSMIC BFC types rather than those based on a single total functional size value would improve estimation reliability by making statistical analyses on the ISBSG dataset [36]. The results of these preliminary studies showed that the improvement in accuracy from using equation (6) rather than (3) is likely to be significant.

In [35], the concept of software ‘functional profile’ was defined as the relative distribution

of its four BFC Types for any particular project. This study investigated whether or not the size-effort relationship was stronger if a project had a functional profile that was close to the average for the sample studied. It was observed that identifying the functional profile of a project and comparing it with the average profile of the sample from which it was taken can help in selecting the best estimation models relevant to its own functional profile. The findings of these studies therefore lend weight to the idea that it is important to take into account a piece of software’s functional profile for the best estimating accuracy

In the next paragraphs, we shall give an example approach of how such local calibrations might be done to estimate effort from COSMIC functional size in a software organisation.


98

We will also explore how sensitive effort estimation might be, in extreme cases, to using as input counts of COSMIC BFC’s weighted by ‘standard effort weights’ as opposed to using pure COSMIC functional sizes.

5. Calibrating COSMIC Standard Effort weights

In this paper, we explore two calibration methods for COSMIC. These are discussed in the following sections.

5.1. By statistical analysis

First and most obviously, given a set of project data where the counts of BFC Types (data movements of each type, i.e. Entry, Exit, Read and Write) and the actual development effort are available for each project, a statistical correlation of the effort figures versus the corresponding four BFC counts should produce the standard effort weights directly.

Such an analysis depends on having data from a large-enough set of projects that were developed under the same or very similar ‘common conditions’ (e.g. the same technology, application type of software, etc) as described above. The resulting weights would then be valid for those ‘common conditions’. Unfortunately, although plenty of data are available to us, at the moment there are not enough projects with any one set of common conditions for us to use so as to obtain meaningful standard effort weights by statistical analysis. The maximum data points that we could have obtained after forming a subset of similar projects to investigate from the projects data in ISBSG dataset [36] is 11 which is very low to make any meaningful conclusion. Therefore, until we have statistically significant amounts of data to derive the weights, we need another approach.

5.2. By modelling the functional process profile for a particular domain

A second approach is to construct a model, by expert judgment or by using real project data, of the functional processes of the software from a particular domain of interest, and to use the result of the model to calculate the standard effort weights directly. First we describe the general process to construct the model and then we illustrate the process as far as we are able at this stage

A model may be constructed in two Steps, as follows. For each Step, there are two possible approaches, namely either using expert judgment or using real project data.

Step 1 (using expert judgment) a) Define a set of functional process types that are typical of the domain concerned, and

define the relative frequency of occurrence of those functional process types 11 b) For each functional process type, define the average number of data movements by type

(E, X, R and W), i.e. the BFC’s c) Sum the number of data movements over all functional process types for the software,

weighted by the frequency of occurrence of the functional processes in which they occur d) Scale the numbers of data movements by type from the previous step to produce a

relative distribution of data movements by type for an average functional process over the domain

11 Functional process type: Different types of functional processes might occur in a system depending on the

pattern of information processing logic of the software domain. For this study we use the well-known pattern of. functional processes (Create, Read, Update, Delete types) of the business application software domain


99

Step 1 (using real project data) e) For a real set of software representative of the domain, measure all the sizes using the

COSMIC method. f) Sum the number of data movements by type over all the software. g) Scale the numbers of data movements by type from the previous step to produce a

relative distribution of data movements by type for an average functional process for the software

Step 2 (using expert judgment to obtain relative standard effort per BFC) a) Determine, from experience or intelligent guesswork, the relative effort per data

movement type (this will almost certainly vary according to the functional process type as determined in Step 1)

b) Weight the output of Step 1d) above, i.e. the relative distribution of data movements by type, by the ‘relative effort per data movement type’ from the previous step 2a. The result is the standard effort by data movement type for the functional processes of the model

c) Scale the standard effort numbers from the previous step to obtain a relative standard effort by data movement type for the domain

Step 2 (using real project data to obtain calibrated, i.e. actual, standard effort

weights per data movement type) d) Determine, by collecting real project effort data, the distribution of effort over the four

data movement types. (This may be difficult to obtain by direct effort measurement on projects; it will probably be necessary to use a ‘Delphi’ approach, i.e. to ask project team members for their assessment of the proportions of total project effort needed for each data movement type.)

e) Using the total effort for each project, the distribution of effort over the data movement types from the previous step 2d and the distribution of numbers of data movement types from Step 1g, determine the standard effort per data movement type

5.3. An Illustrative Model for determining standard effort weights for COSMIC

BFC’s We next constructed a model using both expert judgment and real project data, where

available, to illustrate the application of the above process to software in the domain of business applications. We carried out Step 1 by both approaches but, lacking real project data, we were only able to calculate relative (as opposed to absolute) standard effort weights at this stage in Step 2.

Step 1 (using expert judgment) We suppose a large portfolio of business applications which has been developed using

common conditions. We would expect from the ‘CRUD’ rule (create, read, update, delete) for the life-cycles of primary entity types the following pattern of functional processes and data movements. (We ignore delete functional process types in this first attempt at a model). There will also be some functional processes to maintain non-primary entity types

Numbers of functional processes to maintain primary entities • There will be more enquiry (‘read’) functional processes than ‘inputs’ (creates and

updates), because for all data that is entered it must be possible to enquire about it and, in addition, there are all the enquiries on derived and aggregated data (data that is not stored).


100

• There will be more updates than creates because many primary entities have several stages to their life-cycle.

Numbers of data movements of these functional processes • For every ‘input’ functional process there will be one or more Entries. For every

Entry in the functional process there will be: o the same number of Writes. o some Reads (usually more than the number of Entries) for validation of entered data. o an Exit for error messages.

• For every ‘output’ functional process there will be o One Entry, one or more Reads and one or more Exits, probably more Exits than Reads, especially if an error message is included.

Functional processes to maintain secondary entities • A few simple functional processes of simple data movements.

Now we can build (guess) a model of an ‘average functional portfolio’ to determine the

relative number of functional processes (FP) and data movements (DM) (see Table 1). Step 1 (using real project data). In order to show an example for the approach using real

project data, we obtained data from the ISBSG database on 22 business application projects measured using the COSMIC method and added data from 4 more business application projects we measured..

We obtained the relative number of data movements based on the average number of E, X,

R and W and normalised the counts with respect to the number of E. In Table 2, we provide the results from both approaches.

Table 1: Average functional portfolio model.

Type of FP Relative # FP’s* #E’s #W’s #R’s #X’s FP’s to maintain primary entities Create 1 1.5 1.5 2 1 Update 1.5 1.5 1.5 2 1 Read 2 1 0 2 3 FP’s to maintain secondary entities Create 0.1 1 1 1 Read 0.1 1 1 1 Delete 0.1 1 1 1 #DM’s weighted by relative #FP’s 6.05 3.95 9.1 8.8 Relative # DM’s (normalised) 1.00 0.65 1.50 1.45 * (# = number of)


101

Table 2: The relative number of data movement types by the two approaches. Data Source Relative number of data movements

#E’s #W’s #R’s #X’s Expert judgment model 1.00 0.65 1.50 1.45 26 business application projects 1.00 0.53 1.22 1.37

Considering that the ‘expert judgment model’ is an unrefined first attempt and that the 26

projects are a far-from-homogeneous set, the agreement between these two sets of figures is encouraging. The reader is welcome to substitute his own expert judgment model.

Step 2 (expert judgment) We assume the following effort per data movement (see Table 3). • The effort for a Read is the same, on average, as that for a Write • Every Entry of an Input functional process requires more effort than every Exit of an

Output functional process, since the former involves validation as well as formatting. We assume the following for the relative effort needed to develop each DM type. Maintain primary entities • ‘V Hi’ for E’s in create and update FP’s • ‘Hi’ for X’s in Read FP’s, except for all error messages which are ‘Lo’ • ‘Lo’ for E’s in Read FP’s • ‘Med’ for all other DM types

Maintain secondary entities • ‘Lo’ for all DM’s

where we judged that the effort proportions of V Hi, to Hi, to Med to Lo are: 1.33 / 1.00 /

0.5 / 0.1, respectively. We initially classified the relative efforts as ‘V Hi’, ‘Hi’, ‘Med’ and ‘Lo’ for ease of understanding, but this is an ordinal scale. We therefore devised a ratio scale based on the relative proportions of effort.

Applying these proportions to the above table, where ‘Eff’ = Effort

Table 3: The relative standard effort weights for the DMs. Type of FP Relative # FP’s Eff. E’s Eff. W’s Eff. R’s Eff. X’s

FP’s to maintain primary entities Create 1 2.00 0.75 1.00 0.1 Update 1.5 2.00 0.75 1.00 0.1 Read 2 0.10 0 1.00 2.1 FP’s to maintain secondary entities Create 0.1 0.1 0.1 0.1 Read 0.1 0.1 0.1 0.1 Delete 0.1 0.1 0.1 0.1 Total standard effort per DM type weighted by #FP’s

5.23 1.90 4.51 6.28

Relative # DM’s (from table above) 1.00 0.65 1.50 1.45 Relative av effort per DM type 5.23 2.90 3.00 4.32 Relative av. effort per DM type (normalised) 1.74 0.97 1.00 1.44


102

For Step 2, using this expert judgment model for the (rounded) relative standard effort

weights for the four types of data movement, we conclude that a single Entry requires 1.7 times as much effort to develop, and that an Exit requires roughly 1.4 times as much effort to develop, as does a single Read or Write.

We then tested how sensitive an estimation method would be to using equation 3 above

(i.e. using a pure COSMIC functional size as input) versus using equation (6), (i.e. using a standard effort figure as input). We did this by comparing the effort estimate for two pieces of software, one dominated by ‘Create’ and ‘Update’ functional processes (see Table 4) versus a piece of software dominated by ‘Read’ functional processes (see Table 5). This is a fairly extreme assumption.

Table 4: Case 1- A software project dominated by create and update transactions to maintain

primary entities (ignoring table maintenance). Parameter E’s W’s R’s X’s Total

# of C or U DM’s on primary entities 1.5 1.5 2 1 Functional size (FS) per DM type 1 1 1 1 Relative av. effort per DM type using FS as the measure of effort

1.5 1.5 2 1 6

Relative av. effort per DM type from above

2.00 0.75 1.0 0.10 3.85

Table 5: Case 2 - A software project dominated by read transactions on primary entities.

Parameter E’s W’s R’s X’s Total # R DM’s on primary entities 1 0 2 3 Functional size (FS) per DM type 1 1 1 1 Relative av. effort per DM type using FS as the measure of effort

1.5 1.5 2 1 6

Relative av. effort per DM type from above

0.10 0.00 1.00 2.10 3.20

The relative error in estimated effort from using a functional size versus using a standard

effort measure for the software systems in these two tables is therefore 3.85 / 3.2 = 1.20, i.e. the relative error could be 20 per cent.

We must emphasise that these results are very preliminary, for a particular software domain. We would not expect to obtain the same results for e.g. telecommunications or process control software, which might have quite different distributions of functional process types and data movement types than we have assumed in our illustrative model.

6. Conclusions

We have explored three main methods of sizing software from their requirements and have concluded that two methods (Albrecht/IFPUG and MkII FPA) actually produce sizes on a scale of relative standard effort. Whilst these size scales can be used for performance measurement and for estimating, they are not ideal for either use given they were calibrated using data from relatively small sets of projects from one particular software domain, some 20 – 30 years ago.


103

The COSMIC method, on the other hand, measures a pure functional size and is thus ideal for carrying out performance measurement comparisons of projects developed using different technology, etc. Functional sizes measured using the COSMIC method can also be used as input to estimating methods. However, our first observation above that it seems ‘obvious’ that the same size measurement scale should be used for both performance measurement and for estimating purposes turns out to be sub-optimal. We now conclude that the use of pure COSMIC functional sizes as input to an estimating method could be improved by first weighting the BFC counts-per-BFC-type for the piece of software being estimated by a different ‘standard effort’ per BFC type.

The standard effort figures would be expected to be quite different for different software domains and even different when using different technologies within those domains. Standard effort figures should therefore be calibrated separately (and ideally locally) for each set of ‘common conditions’, i.e. for each domain and technology combination that is important to the measurer.

This first exploratory study indicates that using weighted rather than un-weighted COSMIC BFC’s could increase the relative accuracy of estimated effort (other factors being equal) across different types of software within the same domain and using the same technology by a range of up to 20% in the extreme, which is very significant. The size of the improvement in estimating accuracy would be expected to vary relatively for estimates for software from different domains and/or when using different technologies within the same domain. We are therefore encouraged to continue this line of research.

7. References [1] Angelis L., Stamelos, I. Morisio, M., “Building a Cost Estimation Model Based on Categorical Data”,

7th IEEE Int. Software Metrics Symposium (METRICS 2001), London, April 2001. [2] Forselius, P., “Benchmarking Software-Development Productivity”. IEEE Software, Vol. 17, No. 1,

Jan./ Feb. 2000, pp.80-88. [3] Lokan, C., Wright, T., Hill, P.R., Stringer, M., “Organisational Benchmarking Using the ISBSG Data

Repository”. IEEE Software, Vol. 18, No. 5, Sept./Oct. 2001, pp.26-32. [4] Maxwell, K.D., “Collecting Data for Comparability: Benchmarking Software Development

Productivity”, IEEE Software, Vol. 18, No. 5, Sept./Oct. 2001,pp. 22-25. [5] Morasca, S., Russo, G., “An Empirical Study of Software Productivity”, In Proc. of the 25th Intern.

Computer Software and Applications Conf. on Invigorating Software Development 2001, pp. 317-322. [6] Premraj, R., Shepperd, M.J., Kitchenham, B., Forselius, P., “An Empirical Analysis of Software

Productivity over Time”, 11th IEEE International Symposium on Software Metrics (Metrics 2005), IEEE Computer Society, 2005.

[7] Card, D.N., “The Challenge of Productivity Measurement”, Proc. Of the Pacific Northwest Software Quality Conference, 2006.

[8] Boehm, B.W., Horowitz, E., Madachy, R., Reifer, D., Bradford K.C., Steece, B., Brown, A.W., Chulani, S., Abts, C., “Software Cost Estimation with COCOMO II”, Prentice Hall, New Jersey, 2000.

[9] Putnam, L.H., “A general empirical solution to the macro software sizing and estimating problem”, IEEE Transactions on Software Engineering, July 1978, pp. 345-361.

[10] Tausworthe, R., “Deep Space Network Software Cost Estimation Model”, Jet Propulsion Laboratory Publication 81-7, 1981.

[11] Park, R. E., “PRICE S: The calculation within and why”, Proceedings of ISPA Tenth Annual Conference, Brighton, England, July 1988.

[12] ISO/IEC 14143-1:2007, “Software and Systems Engineering - Software measurement - Functional size measurement - Definition of concepts”, The International Organisation for Standardization, 2007.

[13] ISO/IEC 19761:2003, Software Engineering – COSMIC-FFP: A Functional Size Measurement Method, International Organisation for Standardization, 2003.

[14] ISO/IEC 20926:2003, Software Engineering-IFPUG 4.1 Unadjusted Functional Size Measurement Method - Counting Practices Manual, International Organisation for Standardization, 2003.


104

[15] ISO/IEC 20968:2002, Software Engineering - Mk II Function Point Analysis Counting Practices Manual, International Organisation for Standardization, 2002.

[16] ISO/IEC 24570:2005, Software Engineering - NESMA functional size measurement method version 2.1 – Definitions and counting guidelines for the application of Function Point Analysis, International Organisation for Standardization, 2005.

[17] ISO/IEC 29881:2008, Software Engineering - FiSMA functional size measurement method version 1.1, International Organisation for Standardization, 2008.

[18] Demirors, O., and Gencel, C. “Conceptual Association of Functional Size Measurement Methods”, scheduled for publication in IEEE Software, 2009.

[19] ISO/IEC TR 14143-5: Information Technology- Software Measurement - Functional Size Measurement - Part 5: Determination of Functional Domains for Use with Functional Size Measurement, 2004.

[20] Habra, N., Abran, A., Lopez, M., Sellami, A., “A framework for the design and verification of software measurement methods”, Journal of Systems and Software, Vol.81 , Issue 5, May 2008, pp. 633-648.

[21] Kitchenham, B., Pfleeger, S.L., Fenton, N., “Towards a Framework for Software Measurement Validation”, IEEE Transactions on Software Engineering, Vol.21, No.12, Dec. 1995, pp.929-944.

[22] Albrecht, A.J., “Measuring Application Development Productivity”, IBM Application Development Symposium, Monterey, California, October 1979.

[23] Albrecht, A.J. and Gaffney, J.E., “Software Function, Source Lines of Code, and Development Effort Prediction: a Software Science Validation”, IEEE Transactions on Software Engineering, Vol SE-9, No. 6, November 1983.

[24] Albrecht, A, J., “Where Function Points (and Weights) Came From”, (hand-written slides from a presentation dated 19th February 1986).

[25] Abran, A., and Robillard, P.N., “Function Points: A Study of Their Measurement Processes and Scale Transformations”, Journal of Systems and Software, Vol. 25, 1994, pp.171-184.

[26] Symons, C.R., “Function Point Analysis: Difficulties and Improvements”, IEEE Transactions on Software Engineering, Vol 14, No. 1, January 1988, pp. 2-11.

[27] COSMIC, “The COSMIC Functional Size Measurement Method”, version 3.0, Measurement Manual’, obtainable from www.gelog.etsmtl.ca/cosmic-ffp, 2007.

[28] Xia, W., Capretz, L.F, Ho, D., Ahmed, F., “A new calibration for Function Point complexity weights”, Information and Software Technology, Vol. 50, Issue 7-8, June 2008, pp. 670-683.

[29] Santillo, L., “Error Propagation in Software Measurement and Estimation”, in IWSM/Metrikon 2006 conference proceedings, Potsdam, Berlin, Germany, 2-3 November 2006.

[30] Dekkers, C. and Gunter, I., “Using Backfiring to Accurately Size Software: More Wishful Thinking Than Science?”, IT Metrics Strategies, Vol. VI, No.11, 2000, 1-8.

[31] Taylor, F.W., ”Principles of Scientific Management”, Harper & Bros., 1911. [32] Gencel, C., and Buglione, L., “Do Different Functionality Types Affect the Relationship between

Software Functional Size and Effort?, MENSURA 2007, LNCS 4895, Springer-Verlag Berlin Heidelberg 2008, pp. 72–85.

[33] Buglione, L., and Gencel, C., “Impact of Base Functional Component Types on Software Functional Size based Effort Estimation”, 9th Intern. Conf. on Product Focused Software Process Improvement (Profes 2008), 23-25 June, Rome, Italy, LNCS 5089, Springer-Verlag Berlin Heidelberg 2008, pp. 75–89.

[34] Gencel, C., “How to Use COSMIC Functional Size in Effort Estimation Models?” Selected papers of IWSM / MetriKon / Mensura 2008, LNCS 5338, 2008, pp. 205–216.

[35] Abran, A., Panteliuc, A., “Estimation Models Based on Functional Profiles. III Taller Internacional de Calidad en Technologias de Information et de Communications, Cuba, February 15-16, 2007.

[36] ISBSG Dataset 10, http:// www.isbsg.org, 2007.


105

A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation for Service Management: Case Study in Off-

shore IT-Service Delivery

Srinivasa-Desikan Raghavan, Monika Sethi, Dayal Sunder Singh, Subhash Jogia

Abstract In this paper, we describe a BSC approach to Service Management for a portfolio of

projects at Tata Consultancy Services Ltd. (TCS), India, with one of its valuable customers, a Leading Global Financial Service Company. During the growth of their business relationship, there was a need to manage a critical portfolio of projects in ‘Straight-Through-Processing’ (STP) services, with special reference to customer feedback and KPI management. We chose BSC approach to manage and control this flagship program, for the ease of design and for the clarity of communication amongst its stake holders. The design characteristics for the scorecard and experience of its implementation are highlighted here.

1. Introduction and background The Balanced Scorecard (BSC, henceforth) has been in practice for Corporate

Performance Management and for strategy deployment purposes since early 1990s [1, 2, 3]. Since then numerous cases of its usage, both as success and as failure abound the corporate case history. From the example of a corporate scorecard getting cascaded to individual teams’ level, there are cases where BSC had also been used for ‘Project focused IT Organisation’ [4]. From the design view point, there are many organisations that specialise in both BSC tools (there are many tools in the market, besides in-house developed usages) and in BSC practices and training [“Performance Management & 3rd Generation Balanced Scorecard”; ‘www.2gc.co.uk’][5].

In this paper, we describe our attempt to design and implement BSC, which can be a

‘middle-out’ approach compared to the traditional top down way of arriving at scorecards.

1.1. STP highlights Tata Consultancy Services Ltd, (TCS, henceforth), is India’s the largest IT Services firm,

a US$ 5.7 billion global software and services company and is part of the well known Tata Sons group; it has many Fortune-10 and Fortune-100 organisations in its customer base.

In this paper, we will be describing our experience in implementing BSC for the purpose

of moving up the value chain in Vendor – Customer Relationship, (Relationship, henceforth), for a specific customer.

The customer is one of the largest multinational corporations (Customer, henceforth) in Financial Services industry. The following figure summarises the Relationship status, as per last quarter (January – March 2009) data.


106

Figure 1: Global Financial Services Firm - TCS Relationship Overview. During the year 2008, a new portfolio of projects in ‘Straight-Through-Processing’

systems (STP, henceforth), was launched with the following objectives: • To Establish a Decision Framework and Roadmap to Optimise the STP Operating

Model. • To Profile the STP portfolio, grouping work into the appropriate delivery categories,

viz. • Work best delivered in the prevalent out-tasked model. • Work best delivered in a proposed managed services model. • For work appropriately delivered in the current model. • Identify opportunities for improvement (OFI’s). • Develop specific plans of action to achieve near-term and longer term improvements.

2. Balanced Scorecard Implementation – The Challenge

Toward the objective of ‘establishing a new decision framework’, BSC approach was selected due to the facts that there were already multiple practices in existence on measurements-based project monitoring and in-house tools were in vogue at individual projects’ level; also the idea of using BSC for monitoring and appraisal purposes (from Human Resource management division) was in practice amongst project teams. These facts obviated the need for creating an awareness of BSC concept. But the challenge was more from view point of designing scorecards for the STP projects and implementing it successfully across the STP projects’ horizon.


107

2.1. Prevailing Governance mechanism

Given the complexity of the projects’ affiliation to various Customer sponsorship units, spanning across the globe with multiple sets of IT Services on different platforms, the prevailing governance mechanism was focused more on individual sponsorship units and the monitoring method driven by Service Level Agreements (SLAs, henceforth) across individual contracts. The sources of data for this purpose were derived from multiple systems, both at TCS and from Customer systems. A generic governance structure, as shown below, was in practice for monitoring and controlling the specific sets of projects.

Table 1: Governance mechanism.

Frequency Participants Agenda Bi-weekly • Joint Steering

Committee • Program

Management Team • Working Groups

STP wide governance • Review overall program progress and set

directions • Review overall program Key Performance

Indicators (KPIs) • Review exit criteria • Jointly assess checkpoints for switch over

different project modes Bi-weekly • TCS Steering

Committee • Core Team (TCS) • Working Groups

• Review overall program progress and set directions

• Review overall program Risks / Issues

Weekly • Working groups • Core Team (TCS)

Application groups governance • Progress review at application group level • Discuss challenges, risks & issues, exit criteria • Review and detail Knowledge Management

(KM) processes and Service Delivery processes for application groups

Weekly • Program Management Team

• Overall progress • Plan updates, Issues, Risks • Issues that need escalation

2.2. BSC Implementation Plan

A new program to launch BSC was sanctioned for the STP portfolio and a core program management team was announced, with members from TCS and from Customer teams. A new reporting structure was drawn up and apart from the governance mechanism present at that point, it was decided to create a data system exclusively for BSC reports.

For deploying the BSC, five projects were chosen as ‘pilots’ that have their individual

SLAs closely aligned to STP objectives. Milestones were identified for achieving the STP objectives with BSC implementation being a primary one across these five pilot projects.


108

Figure 2: STP Program Roadmap.

3. BSC design – the ‘Middle out’

There are cases in literature [4], where BSC was used as a pure Project Management element, complementing the traditional project management and control mechanism. But the design of BSC was attempted from a top down approach. Goold et al. [6] describes three types of ‘Parenting Styles’, viz. strategic planning, strategic control and financial control, for the roles and responsibilities between corporate and organisational units. These types of styles also influence the role the corporate would adopt in the design and usage of BSC across corporate and business units [7]. We have adopted a method that has parallel to ‘strategic control’ model in our situation, wherein the corporate (the Relationship, in our case) would influence the design of scorecard, but it would be the business units (it is the STP – BSC Program and the constituent projects) that influence the usage of it.

As mentioned before, when the program was sanctioned, there were projects with their

own measures to monitor, but they existed as disparate systems. After having discussions with the program steering committee, the stake holders and the project teams, a first cut BSC was drawn for the program, much akin to a Corporate BSC, but with the focus aimed at customer service levels and KPIs. In fact, we found that the financial measure was more of a derived benefit (Total Cost of Ownership) rather than a starting point!

The program core team worked out multiple iterations, to arrive at individual scorecards

across the pilot projects (re-using many prevailing measures) and connecting them to the Program BSC, to arrive at a consensus that was aligned with the proposed governance requirements. We were able to retain many measures that were used at projects’ level, while choosing the ones that would get ‘aggregated’ at program level scorecard. Thus, Relationship expectations were typically ‘cascaded’ downwards as BSC measures (from Program BSC to Projects’ BSC), while re-used individual project measures were ‘aggregated’ upwards.


109

3.1. Characteristics of BSC design – the ‘Middle out’ The design process was typically recursive at each time when a new project was added to

the program portfolio and we found that the participating projects contributed to the design more, by way of carrying forward their set of measures; and thus we would prefer to call the design approach the ‘Middle-out’, compared to top down mode of designing scorecards.

The following steps would describe the process of this design approach: • Start up / or from a previous steady state phase: Existing islands of projects in the

Relationship portfolio, (with independent SLAs, KPIs and measures) focus on their operational efficiency, project management and control, besides monitoring for Risk management.

• Coalescence phase: When a new program is sanctioned, driven by the goals and changes in the objectives of the Relationship, coalescence comes into play. The steps in this phase are - o Select pilot projects that have similar and comparable SLAs and KPIs. o Derive ‘tactical themes’ as opposed to Corporate Strategic Themes. (the example

is - “Move maximum number of projects to ‘Managed Services’ mode(MSM) from ‘Time and Materials’ mode (MTM)”)

o Develop Strategy Map from the new business goals, and the identified program benefits and derive the new set of KPIs and measures.

o Assign targets with tolerance ranges (Green / Amber / Red) for the finalised measures that would drive the SLAs to fruition.

o Apply Data Quality Framework (explained in later section), to individual measures and identify support projects and initiatives required to achieve it.

o Re-draw (or edit) the program and project plans. o Analyse new risk profiles and mitigation plans. o Derive the new governance model and get approval for the same.

• Communication Phase: Publish the Scorecard to stake holders and draw up communication and change management plans. (Town hall meetings, Training, Kiosks for Demonstrations, etc. as required).

• Implementation Phase: Go Live and monitor. (Closure / start steady state phase). • Iterate from ‘Coalescence’, when new projects join.


110

The following figure depicts the design elements.

Figure 3: Strategy Maps for BSC Design Middle-Out.

We can compare the traditional Balanced Scorecard approach (the first generation) with

the middle-out approach in the following way.

Table 2: BSC Top down Vs Middle-Out. BSC Top Down BSC Middle-Out

Starts at Organisational top; Corporate Vision driven SLAs / KPI

Focus on Customer – Vendor Relationship, Portfolio / Program Objectives; Benefits driven SLAs

Long Term planned (3-5 years) Short Term focused (1-2 years) Start from Financial goals (Perspective) and derive other Perspectives. Identify Strategic Initiatives (as relevant).

Start from Customer Expectations on Portfolio Benefits and distribute SLAs across relevant BSC Perspectives

Usually top-down approach to BSC design ‘Middle-out’ design’; iterative process of top-down (from Portfolio SLAs) and bottom-up, where the quantum of contribution is more from Projects’ level (operational parameters for arriving at measures and targets).

Strategy Maps are enablers for BSC design; they validate the Strategic Themes.

Strategy Maps drive the design

Changes to Dash board measures are generally minimal at Corporate level BSC.

Flexible to changes to measures or their targets both at Projects’ level and at ‘Internal Processes’ Perspective of individual Scorecards.


111

The advantages of this middle-out approach can be summed up as follows: • The Program Scorecard can evolve from the vendor – customer relationship, while

contributing to the respective organisational scorecards, at specific KPIs and at individual Perspectives of scorecard.

• The scorecard structure (parent – children scorecards) can be extended to more projects, at different ‘coalescence’ phases, as the maturity of vendor – customer relationship grows.

3.2. Data Quality Framework

During the ‘coalescence’ phase of getting all the individual scorecards and lists of measures, in order to derive a consistent set of scorecards at program and at individual projects’ level, the following set of parameters are important for ensuring high data integrity, viz.-

• Single source of data (the Customer or TCS) Vs. Disparate Sources • Atomic data Vs. Derived one • Manual data entry Vs. Automatic updating • Testing is done (one time activity) for BSC data at individual measures, for logic and

expected result, with the Customer managers Vs. testing not done The 16 combinations of situations for these parameters can then be grouped into 5 ratings

(assuming a normal distribution) for, what we call ‘certainty factor’ (CF), so that appropriate actions or initiatives can be taken at the Program level. These are given in following 2 tables.

Table 3: Combinations of selected parameters into CF ratings.

Type of Data Data Entry Data Testing (Yes / No)

Single source of data

Disparate sources of data

Atomic Automatic Yes High High NO Medium-High Medium

Manual Yes Medium Medium NO Low Low

Derived Automatic Yes High Medium-High NO Medium-High Medium

Manual Yes Medium-Low Low NO Low Low

Final CF ratings


112

Table 4: Recommendations for Improvements. (With specific reference to ratings and combinations).

CF Rating Approximate CF Percent range

Things to improve

High >90 • Do nothing Medium-High 80-90 • Do one system test

• Move from disparate to one data source Medium 70-79 • Reduce, avoid manual data entry

• Move from disparate to one data source

Medium-Low 60-69 • Move from disparate to one data source

Low <60 • Do one system test • Move from disparate to one data source • Reduce, avoid manual data entry

While re-planning after data quality analysis, we had identified special projects that would

feed the benefits of data quality improvements across the program data sources. These were added to the program plan, were budgeted accordingly and were monitored along, through the governance mechanism.

3.3. Performance Index

For the purpose of monitoring performance, as well for the purpose of rewards recognition, the individual measures were given ‘weights’ (though, during the time of piloting, the weights were set to a value of 1) and their performance deviation was measured at regular intervals. The individual measures’ performance values were then aggregated for specific BSC perspectives, as well as at individual scorecard levels. Thus we had various ‘weighted performance of measures’, which were called Performance Indices (PI) on the scorecards. This idea helped the STP-program in a significant way, by comparing PIs across various perspectives, across scorecards as well as across individual projects.

Given below is a simplified version of PI formula (exceptions and other indeterminate

results were given separate heuristics in the system).

PI = ∑ (MP*W) / ∑ (W) • Primary Parameters (for design of PI)

o Target (‘T’) o Actual (‘A’) o Directionality (‘D’) o Weight (‘W’)

• Derived Parameter (for design of PI) o Metric Performance (MP)

MP= A/T (if D is ‘>=‘) MP= T/A (if D is ‘<=‘)

• The MP gives the ‘% performance’, viz., the extent of target getting achieved, based on the Directionality.

• Directionality focuses on the goal of, either maximisation or minimisation of the ‘Metric intent’ (for example, measures like ‘profit’ will be having a Directionality of ‘>=’ while a ‘metric intent’ for ‘cost’ will be to get minimised (‘<=‘)).


113

4. STP – BSC Program Implementation

The launch preparation phase lasted about 4 weeks, when internal marketing campaign was conducted. The project teams, their managers from TCS and from the Client teams and the program core team had frozen the scorecard elements (that included the measures, initiatives for achieving the desired data quality, negotiated targets and their target deviation zones (for traffic lights metaphor of monitoring)). Detailed launch plan was drawn up (at weekly and daily level of activities) and these activities were executed.

4.1. Go-Live time lines

As mentioned before, some of the activities we executed during the launch preparation phase were to freeze measures, address the program risks and issues. We conducted many town hall meetings with individual teams and stake holders, prepared collaterals for internal marketing purposes besides launched self-running demonstration kits.

The following diagram gives the launch plan we had.

Figure 4: STP – BSC Roll out Plan. In the final mode of governance, we have superposed the new BSC based STP-Program

review, while retaining the (then) existing review mechanisms at the project level, wherever required. This had helped the program to track important program specific measures, while facilitating the need-driven data drill down at individual projects’ level.


114

The following table depicts an interesting observation:

Table 5: BSC Usage. BSC Usage Year # Hits

September 2008 1*October 2008 1*November 2008 2*December 2008 41January 2009 17February 2009 28

(* - the figures denote user accounts; not number of hits) The BSC skeleton system was uploaded in September 2008, on the Customer portal but

was kept with just 1 user account (essentially for the administrator, to manage the design, development and data feed); in November 2008, we had created one more user account for training and demonstration. Actual ‘going live’ happened in December 2008; we found the steady state usage from January 2009 onwards.

Given below is the Program BSC (for the sake of confidentiality, we have masked the actual numbers; data points from April-08 onwards were re-constructed for display).

Sr # Performance Measure Unit KPI Target Frequency Apr '08

Tren

d May '08 Tr

end Jun

'08 Tren

d

Finance1 TCO Savings (Direct / Indirect) $ N Half yearly

Customer 0.80 0.80 0.802 Customer Satisfaction Index (Overall) % Y 90% Half yearly 84% 84% 84%3 CSI - Most important parameters rated low % Y 10% Half yearly 11% 11% 11%

4 CSI - Most important Service & Business Goals parameters rated high % Y 80% Half yearly 86% 86% 86%

5 Customer Appreciations # N Monthly 8 9 66 Customer Complaints # N Monthly 0 0 37 Quality of Service (from annual survey) # N Yearly

Process & Delivery 0.54 0.54 0.668 Post Delivery Defects # Y 5 Monthly 2 4 49 Steering Committee Meeting % Y 100% Half yearly

10 Monthly Governance % Y 100% Monthly 67% 100% 100%11 Outages (severity 1 & 2) # Y 3 Monthly 3 3 112 Projects delivered on time % Y 95% Monthly13 Projects Delivered on budget % Y 95% Monthly14 SLA compliance to response time % Y 95% Monthly 99.7% 99.6% 99.4%15 SLA compliance to resolution time % Y 95% Monthly 97.0% 96.6% 96.7%16 Alerts resolved w/o error % Y 95% Monthly 100% 100% 100%

Learning, People & Competency 0.40 0.90 0.9017 Compliance to minimum competency level % Y 100% Monthly 80% 80% 80%18 Unplanned Attrition in critical phases # Y 0 Monthly 1 0 019 Upload activity of assets into KM system # N Monthly 0 0 020 Reference activity of assets in KM system # N Monthly 0 0 0

STP Performance Index 0.58 0.58 0.72

0

Figure 5: STP – BSC Program scorecard.

4.2. Change Management related activities

It is evident that, for a program trying to achieve new BSC based governance where the constituent projects are using variety of other monitoring methods, the people-related change management initiatives become very vital for program success.


115

TCS has developed a robust framework of what is called ‘3A Model’ for managing change in organisations. From the ‘people dimension’, TCS team helps customer organisations to achieve buy-in and support of their various business change initiatives through three states of people transformation namely,

• Creating Awareness and bringing about common understanding of intended change aligned with individual values, among all involved entities (stakeholders) in the organisation.

• Building Acceptance by creating an environment conducive for changes in the mindset and creating a sense of ownership.

• Accomplishing Adoption of the change initiative (viz. organisation-wide technology implementation, merger of two business entities) through continued visible commitment from Senior Management and competency development.

The following figure explains the generic roadmap of activities that are carried out during

the Change Management engagement. It also depicts sequencing and dependency of various activities.

Figure 6: 3A Framework’ for Change Management.

For this STP – BSC program, as part of ‘Awareness’, collaterals were prepared for the

internal marketing campaign. These posters and electronic presentations highlight the charateristics and the benefits of BSC framework for program monitoring, control, and for keeping everyone aligned, onto a single page.

The town hall meetings for each project team and the support staff were attended by

respective managers from TCS and from the Customer counter-part team. The commitment of the top management had thus ensured that the ‘culture’ had taken a solid root, during the ‘Accept’ phase.


116

The communications around this initiative to the respective teams used phrases like,

‘culture’, ‘vocabulary’, ‘socialising’, besides broadcasting the core message of importance and relevance of the BSC initiative. Though the steady state condition wasn’t reached then, the ‘Adoption’ phase was very hectic, with mock up reviews, before declaring the program going live.

5. Way Forward

We intend to take the lessons learned out of this success story across the Relationship as a value-add to its maturity level. Also as a ‘continuously learning’ organisation, we have generated knowledge artifacts for quicker deployment for a similar initiative in future.

5.1. Critical Success Factors

Some of the critical success factors for this BSC implementation were as follows: • By adopting an effective change management approach to implementation, by

identifying early adaptors and champions amongst the project teams; and by maintaining regular communication through training and town hall meets (we call it ‘socialising’);

• By involving the stake holders and the project managers, through iterative discussions on the objectives of the program, and on the elements from SLA and the KPIs at the Relationship level; this would become the leaven for useful scorecards with well defined project management metrics, well defined delivery performance (quality) metrics, customer satisfaction metrics and knowledge management metrics;

• By designing scorecards with measures that are independent at their scorecard level besides the measures whose values of performance are aggregated from those of lower levels; this has facilitated a quick identification of ‘root causes’ and ‘relationships’ (if any) amongst scorecard elements, while trouble shooting;

• By evolving the PI (weighted average) based method of monitoring measures, for BSC perspectives and Scorecards; this had helped in comparing the projects’ performance across the program;

• By making the scorecards visually ‘pleasing’ (we found this being important, after a quick deployment of proof of concept!) and useful, by having important measures tracked for their trends.

6. References [1] [Robert S. Kaplan and David P. Norton, “Putting the Balanced Scorecard to Work”, Harvard Business

Review, September – October 1993, pp139 [2] Robert S. Kaplan and David P. Norton, “The balanced scorecard: measures that drive performance”,

Harvard Business Review, January-February 1992, pp. 71-79 [3] Robert S. Kaplan and David P. Norton, “The Strategy-Focused Organisation: How balanced scorecard

companies thrive in the new business environment”, Harvard Business School Press, Boston, Mass., 2000

[4] [Glen B. Alleman, “Using Balanced Scorecard to Build a Project Focused IT Organisation”, in Balanced Scorecard Conference, IQPC proceedings, San Francisco, Oct. 28, 29, 30 -2003

[5] “Performance Management & 3rd Generation Balanced Scorecard”; ‘www.2gc.co.uk’ [6] Goold, M., Campbell, A. and Alexander, M., “Corporate Level Strategy: Creating value in the

multibusiness organisation”, Wiley, New York, 1994 [7] Andre de Waal, “Strategic Performance Management: A managerial and behavioural approach”,

Palgrave Macmillan, New York, 2007


117

FP in RAI: the implementation of software evaluation process

Marina Fiore, Anna Perrone, Monica Persello, Giorgio Poggioli

Abstract Since 2006 the RAI Information & Communication Technology Department introduced a

new relationship with suppliers to develop software, moving from the use of time and material, to fixed price contracts that means a fixed total price for a defined product to be provided.

Estimating in advance the software dimension has become the first fundamental step to establish price, cost and time to be negotiated with Suppliers.

RAI ICT adopted the Function Point IFPUG metric to estimate the functional size of projects, because it is a well established method and known as an international standard. As additional metric, RAI ICT used also the Early & Quick Function Point methodology, that seems to be very useful when requirements are not well defined, in case of short term project, of contracting with partners who know very deeply the contest they work in and specially now that FP price is continuously decreasing.

Some data will be presented to show the experience and some useful tips will be suggested to implement a software functional estimate process.

Through a critical analysis of the IFPUG methodology and the direct experience, the paper will review the following questions:

• What are the benefits of implementing a software estimate process? • Are FP still actual and do they fit the requirement of giving a quantitative developing

software measurement? • Why are they not so widely used? • Can FP be used in any kind of software application? If yes, how?

1. Introduction This paper is not a theoretical description of a new application of Function Point

methodology nor an example of how to use it on a particular technology. On the contrary this paper presents the experience gained by four people who have been working for two years using Function Point methodology on software development evaluation. Some lessons learned will be presented and some remarks will be made about the very well known problem of Function Point usability.

The paper describes some solutions adopted to solve the above mentioned open point, but its most important aim is to open a discussion with other FP user groups in order to make FP still actual and really useful.

2. RAI Company and ICT Department

RAI Radiotelevisione Italiana, created in 1924, is the Italian Public Service Broadcaster. It operates three terrestrial television channels, five radio channels, and several satellite and digital terrestrial channels.

RAI is governed by a nine member Administrative Council. Seven of these nine members

are elected by a parliamentary committee, the remaining two (one of whom includes the President) are nominated by the largest shareholder – that is the Finance Ministry.


118

Figure 1: RAI Group. At present RAI is structured into six areas that work to guarantee the development of

editorial plans on the channels and the corporate governance. The editorial areas think up, develop and realise the programs for transmission and delivery (TV, Radio, satellite, terrestrial digital broadcast, new media, …). The staff area rules the management, economical and operational efficiency of the Company, while the other areas represent the reference for realisation of the multimedia, digital and commercial strategy of the Group and on air broadcasting.

The ICT Department is placed into an aggregate of staff departments, named ‘Acquisti e

Servisi’, in which are included also Acquisti Department and Servisi Generali Department. ICT Department mission is: “To provide to the RAI Group the necessary IT

infrastructures, optimising the assigned resources”. The ICT Department provides IT and TELCO products and services to internal customers

and to the RAI Group companies. Therefore it works in an internal and protected market, which is distributed all over Italy , with a dozens of small branches all over the world.

The role and the responsibilities of the ICT Department are related to two linked dimensions, operational dimension and enabling dimension. The operational dimension takes care of the provided ICT services management. Its goal is operational excellence. The enabling dimension is focused on new value creation, business strategic alignment and performance measurement, thus to enable the enterprise to take full advantage of its information, thereby maximising benefits, capitalising on opportunities and gaining a competitive advantage.

The ICT Department has a Project Management Office (PMO) that overviews the project management process; the role of PMO is across the organisation, checking that project management support is assured for all ICT lines units. Its main responsibilities are:

• Managing project’s approving cycle (all projects must be authorised before starting). • Staffing projects with internal and external resources. • Control of project plans (time, cost, resources, risk, …).


119

To supply products and services to all RAI Group, the ICT Department, which is made up of 130 persons distributed over five different organisational units, works with external partners, through outsourcing contracts that change on the base of required services? External partners that are selected through competition for contract become consequently unique suppliers for the specific ICT services object of competition. Contract usually last for three years, and represents the tool that ICT Department uses to guarantee the quality of the required product and service and to regulate partnership with suppliers. For medium/small size projects RAI ICT makes purchases order based on Framework Contracts, while for big projects (more than 200.000 €) the supplier is selected thorough a specific competition for contract.

Since 2007 all the development and maintenance activities on the software estate are

awarded, through competition for contract, to external suppliers that have a favourite partnership with RAI ICT during the contract validity period.

The RAI software Portfolio is very wide and heterogeneous and ranges from market products (SAP, Siebel, ScheduAll,…) to customised products, which are integrated with each other, on different technological platforms and different programming languages. There are consolidated environments, for example the mainframe context, and more recent environments that use heavily the multimedia technologies.

3. Kind of managed projects

Besides the daily activities of existing systems maintenance, every year RAI ICT plans a specific budget for new project realisation; these projects allow new solution developments or existing system enhancement, but also guarantee technological platforms evolution to maintain an adequate infrastructure for required Service Level Agreement.

The first project type is called ‘business project’, the second one ‘infrastructural project’ and on the average every year about a hundred of projects are rolled out.

This paper will analyse only business projects, because these are the initiatives interested in competition for software development and maintenance contracts and involved in software’s evaluating process that is the object of the next paragraphs.

The business projects are usually of small and medium size and deal with development of new solutions or evolution of existing systems, and they last in average for 3-6 months. These systems are web intranet applications developed in Microsoft environment (except for some mainframe parts); considering that RAI is a broadcasting Company, there are also many multimedia applications, that manage video and audio.

As the ICT Department has a matrix organisation, there is not an organisational unit devoted to project management: in this situation the Project Manager is functionally responsible for maintenance activity on existing applications and can be assigned on one or more projects at the same time.

Moreover the project teams aren’t numerous ones, they are composed by one internal representative (the PM of the initiative), and by 2-3 external people.

With respect to the introduction explained above, it is logical to deduce that: • Project requirements are never well detailed. • Projects have a short life cycle. • The PM is also the system analyst.

These considerations are important to understand the process of evaluating software

defined in the outsourcing development context.


120

4. The procurement strategy for ICT software development Later on the company decision to go on with outsourcing of development and maintenance

software’s services, through 2006 RAI ICT has written a request document to address the competition for contract to select new suppliers.

The whole Application Portfolio is divided into six lots that include systems with similar characteristics in respect to context, business area, technologies, size and logical/functional features.

Since January 2007 we have 6 different contracts, one for each lot and all regulated by the

same rules. Each contract includes both Operations and Projects on a specific applications area.

The contracts establish that project dimension is determined through Function Point metric according to the rules defined in the IFPUG (International Function Point User Group) Counting Practice Manual CPM, version 4.2 [1]. Moreover, to guarantee an alignment with ISO/IEC 14143 standard, that requires an independency of counting methods from the technology, it has chosen the Unadjusted Function Point Method which means that no Value Adjustment Factor (VAF) is applied. The contracts also mention the possible use of Early & Quick 2.0 methodology.

Since every lot is characterised by homogeneous features, it has arranged framework contracts that have an average cost for Function Point different from lot to lot.

Especially for projects it has established that the initiatives must be realised with fixed price contracts, that is to say that the entire realisation phase is delegated to external group according to the requirements that RAI ICT provides to supplier.

When a new Project is approved, the Project Manager prepares a specification document

to describe user requirements. The first step is to evaluate if the specific project can be dimensioned using Function Point methodology: actually sometimes we don’t use it. In these cases the for effort estimate is made per professional skills and project cost is calculated using relative rates. Some examples are SAP customising and SAP integration with other applications, re-engineering of application architecture or new design of Database. Otherwise we make our estimate in FP, using SFERA, a tool distributed by DPO.

The RAI ICT Department asks for a Request for Proposal to the supplier who is in charge of that application area, specifying if the estimate must be in FP or FTE, and if there are particular time constraints. In some particular cases it’s possible to dimension the project using both FP and FTE, if particular tasks are required.


121

Figure 2: The process to determine project cost evaluation.

Then the supplier proposal is compared with our estimate and if we’re not satisfied we

discuss it with the supplier, we ask for a proposal review. One proposal review is generally enough to reach an agreement. In that case we make a purchase order. It doesn’t happen very often that we don’t come to an agreement, but in that case we can ask a Request of Proposal to one or more of the other Suppliers.

The Purchase order fixes the objectives, times and cost of the project. User change

requests are managed with a new purchase order. The Specification Document used for Request of Proposal in most of the cases is not a

detailed analysis (this task is charged to the Supplier). In this situation we found the Early & Quick methodology very useful, because we can use, in the same estimate, different levels of Functional Requirement. If more details are available we can use the elementary functional processes (EI, EO, EQ) otherwise we often use the typical processes, and sometimes the general processes. Considering that our project dimensions are small/medium, we never use macro processes, because the FP estimate range is too wide.

5. The process setting

It has been decided to concentrate on FP competence in the PMO team. The group is made up of three people and one group leader and they began learning FP

methodology following classroom training and then continued with training on the job for one year with the help of qualified teachers.

At the end of the training period the PMO has begun measuring estimated functions to be developed in fixed contract projects.

This was not the best nor only way to organise the process, another possibility could have been to share FP knowledge with any project manager, but our decision has two advantages: it’s more economic and faster to realise because it’s easier to train three people than forty and because in this way an independent third party can estimate projects in a more objective way, without knowing anything about technology, developer or customer and without any external influence. More over it introduced the positive practice to clearly fix in advance all user requirements, sharing them with customers.


122

This could seem obvious, but in the past, working with time and materials approach, user requirements was much more variable and undefined. Now project managers are forced to write down a specification document which must be comprehensible to anyone not involved in the specific issues and which describes user functionality and related data in a more explicit way.

In the process of project cost estimate it’s important to have an idea of project dimension

before making a purchase order to establish a fair price. At the beginning the PMO team started working together for measuring projects and this was very useful to build a unique and homogeneous way to measure, sharing knowledge, improving the counter practice in the same way and organising the process in order to be independent by the counter. The adopted approach was to read the available documentation together, interview the project manager to better understand all functional and data aspects, and make FP estimates discussing together any choice. At the beginning documentation was often incomplete, due to a lack of experience of what should be written in order to make an FP estimate, so often the project manager wrote a second documentation version after the interview and the estimate was done subsequently. Considering all the activity mentioned above, an average time of half a day for three people was necessary to complete an estimate at the beginning of the process.

As time passed we became independent and everyone is now able to make estimates by

themselves. Sometimes we ask each other for help only in very complex situations or to support particular estimates.

Even the time needed to make a FP estimate decreased with experience; as it’s easy to

understand, at the beginning we needed more time to make an estimate because of lack of experience in counting, in understanding documentation and functionality to be developed, but as time passed both counters have became more expert and project managers learned to write documentation in a more standard manner and suitable for FP counting. As shown in the following graphic, considering medium size projects, we passed from an average time per estimate of half a day for three persons to one hour for three people up to one hour for one person.

02468

1012

hours

2007 2008 2009years

Figure 3: Average time per FP estimates.


123

6. Analytical data Since 2007 12.100 FP have been estimated, shared into 40 projects. Dividing this

information we can say that average size per project is 302 FP.

Total nr of FP estimated 12.100 FP Total nr of project measured in FP 40 project Average size per project 302 FP

Some projects can have more than one purchase order (for example because they need to

work with two different suppliers or because there are requests for change) and considering this fact we made 45 different contracts in FP, so the average size of every contract was 268 FP.

Total nr of FP measured 12.100 FP Total nr of contract made in FP 45 contract Average size per contract 268 FP

The total number of software development projects (business projects) made by ICT since

2007 is 69, so we can say that 57% of them have been counted in FP.

Total nr of development projects 70 projects Total nr of development projects measured in FP

40 projects

Share of development projects measured in FP

57%

The average size distribution was the following:

02468

10121416

100 200 300 400 500 600 700 800 900

FP

proj

ect n

umbe

r

Figure 4: FP project size distribution.

As mentioned above, the average size of estimated projects was over 300 FP, this means

that we managed small projects with a duration time of three months, and only two projects over the average size were estimated during the analysed period.


124

7. What advantages in FP adoption? Analysing data and experience gathered during two years it should be said that the

introduction of FP use, represented a very useful opportunity to clearly define the project scope, defining in a specification document all functionalities that must be developed according to customer requirements. More over they are an objective method that lets compare estimates made by different people, because you have to classify all the functions and all the data you want to develop and this fact often lets you discover that something has been forgotten. For this reason FP analysis is useful in sharing information with suppliers about functionality to develop in the project, before project cost and purchasing orders are defined.

FPs are also important because they are a standard “de facto” in software estimates and

they permit the sharing of estimates and discussion of function, in particular, as they are focused on user functionality, this aspect forces project managers to bring out the customer point of view instead of technological aspects.

FPs are the way to force internal project managers to write functional documentation in a standard way in order to permit the counting.

Finally FPs prove themselves as a good help to estimate projects when project managers have no idea of how much a project will cost.

8. Some hints to put in evidence

Some points should be highlighted to analyse the FP process adoption in a critical way. FPs don’t cover all aspects of a software project, they represent only a part of all the work

to be done. They give evidence to a user point of view, but there are other different elements which contribute to project dimensions and the measuring effort should be proportional to project duration and cost. To consider all the elements forming the total project cost different components should be considered for example:

• Desired quality. • Management aspects (as project documentation). • Test activities. • Requested activities to install hardware or software components or to deploy product

to a different customer. In these particular cases our project cost is made by two voices, the FP cost, and the effort,

per professional skill, to deliver the above mentioned elements. Independence from technology is a big limit of this metric because it’s completely

different to realise the same functionality in a mainframe or in a web environment, as well it’s different to realise an interface between two systems using a text file or a web service. What is the same for an end user, can be completely different from the system performance point of view, and by total cost. An attempt to solve this problem has been done with the VAF factor, but now IFPUG suggests not using it in order to preserve the original idea that FP must be independent from technology. In addition it should be considered that our contract doesn’t permit any adjustment factor related to technology because it must be fully aligned with IFPUG indications.

FP are not applied to technical IT projects like software deploy or the operative system

installation on server, so they don’t cover every kind of project developed in a IT department.


125

Even if FPs are a consolidated standard for software development, they are not so widely known and on the Italian market there are very few of them that have great experience on the IFPUG methodology. At the beginning of our experience none of the six partners had skilled resources on FP, and it was very difficult to compare the two estimates because generally user functions were classified in a completely different way.

We also found it difficult to share our experience with other companies because everyone

had adopted different criteria to evaluate projects and there is no one way to proceed. Finally it’s difficult to adapt the method to actual technologies. In FP literature there are

very few examples of FP used in ERP [2] or in general in COTS, data warehouse [3], web environment and many papers [4] argue that FP are not suitable to these realities. It’s very hard to define a system that has at least 10 years of life as “new technology” and it’s a pity that the FP manual has not been changed to adapt its terminology to the present environment. Just to give an example, the manual [1] presents examples of mainframe screen and never mentions graphical aspects of software development.

There are two important things to consider when you adopt a software development

process estimate; the first is that the aim of the final process is to produce a project cost estimate and not only a FP number, the former is that CIO’s needs answer in a short time and answers must be realistic, complete and easy to understand even for not technical people. Most CIO’s need to know how much a project will cost and when it will end; they are not interested in complex mathematical models explaining software productivity, in different metrics explaining estimates for different parts of projects, they just need only one piece of information to let them decide if the project can be done or not.

9. Adopted solution

As every other reality that uses FP, we have adopted some solutions to overcome the problems listed above with our suppliers and to come to an agreement on their proposals.

The most important one has been to share with them some template for similar projects. As a matter of fact, in our experience we noticed that projects having the same characteristics and the same technical solutions could also have the same scheme of functionality to be developed. This permitted to estimate FP using the same basis and compare estimates made by different suppliers. In this way we don’t use an international standard but we solve the problem to adapt FP estimates to actual technologies.

To cover the non-functional aspects of software development like quality and management

aspects (see above) we fixed what are the activities not included in FPs and we estimate them in FTE. In particular it has been fixed that FP cover all activities going from detailed analysis, architectural design, code writing and unit test, but they don’t cover the integration tests made on our system, because they vary from one project to another and because they are strongly influenced by end user characteristics, they don’t cover delivery activities and training either. So, depending on projects, we can make mixed contracts with suppliers: part in FP and part in FTE.

To solve the problem of technical aspects in development we decided to consider not only

the initial input and the final output as the theory says, but we also consider the intermediate output necessary to realise certain functionalities. So, if the best technical solution needs to realise more steps (which are not visible for the final user) we consider this additional effort.


126

Just to give an example of this solution let’s consider our data warehouse architecture made by two different data levels in which the first level doesn’t link any user functionality, but it’s fundamental to ensure correct data, high stability and flexibility. In this case we agree with suppliers to consider also the activities necessary to load the first level as we ask suppliers to develop this intermediate step. It’s a sort of intermediate user we have introduced in our process in order to fill the gap between the FP theory and the software development reality.

10. Conclusions

In conclusion let’s try to answer questions asked in the abstract. • What are the benefits of implementing a software estimate process?

The FP estimate process adoption has some advantages as we mentioned in our paper, the most important one is that it forces project managers to fix in the functionality to be developed in advance and so to collect better documentation on projects. This is important in particular when an ICT department realise a lot of small projects in a short time, where the project contest is well known because it’s always the same, users and their needs are known, suppliers know the contest too and where a shortage of time often stops project managers from paying attention to project documentation. Another benefit is due to the way the process has be organised: concentrating FP competence in one single point represents a support to project managers helping them in software estimate by an objective consultancy not linked to any particular system. To concentrate competence is also important to allow the FP technique be used with more frequency than if used by any single project manager few times.

• Are FP still actual and do they fit the requirement of giving a quantitative developing software measurement? FP are important because they cover need of functionality estimate but the actual marketplace is fast changing and you must be ready to face this challenge. FPs must be able to evolve for surviving. Marketplace needs easy and agile standard tools to give an idea of how much a project will cost and IFPUG should be commited to produce guide lines that help all the FP users in every day work. This line should be made by collecting and unifying different experiences. IFPUG can’t leave single FP users defining rules by themselves. When will the new version of the counting practice manual be ready? Will be this new version be more user friendly and easy to consult? When will we see examples of web application in the manual instead of the mainframe print screen?

• Why are they not so widely used? FP were born in 1975, they are almost 35 years old, but not so widely used and known because they hardly fit all the software development characteristics. They were born as a standard method for software dimensioning when software reality was based on mainframe application; they also had the purpose to be as general as possible to be adaptable to every software environment, as a unit standard measure, like a meter for example. But software is not like a room that can be measured in length and width with a meter, it has a lot of particular characteristics that makes it always different and when you measure software using FP as standard measure unit you realise you can’t measure everything with it; so people prefer to follow other ways to give software development estimates. Maybe they are used much less because they don’t fit the software development estimate needs well and because counting all different kind of application an IT department has and keeping the counting up-to-date using FP metric is too hard.


127

• Can the FP be used in any kind of software application? If yes, how? FP can be used in all different technologies because every software development project has to realise functionalities requested by users, the difference is how functionalities have to be developed. FP users need to have some hints on how to move in different situations. To simply discuss with other colleagues or in users groups is not sufficient and does not guarantee that the different adopted methods are standard. When a big company buys services on the market using FP metric, should use standard and shared guide lines and not personal solutions that are not widely known. We think that FP theory should be adapted to different technology maybe making differences between software environments or giving guide lines to fit to different platform technologies from the mainframe one. This evolution should be the only way to help FP theory survive in the next years. But it should be done quickly!

11. References [1] IFPUG, Function Point counting practice manual, ver. 4.2 [2] A. Cavallo, M. Martellucci, F. Stilo, N. Lucchetti, D. Natale, “Impiego della FPA nella stima dei costi di

personalizzasione di sistemi ERP”, in GUFPI-ISMA-Metriche del software, ed. Franco Angeli [3] L. Santillo, “Size and estimation of data warehouse system”, FESMA 2001 [4] CNIPA, “Strategie di acquisisione delle forniture ICT”, ver. 3.2, 2008


128

SMEF 2009


129

Implementing a Metrics Program MOUSE will help you

Ton Dekkers

Abstract Just like an information system, a method, a technique, a tool or an approach is

supporting the achievement of an objective. Following this line of thought, implementing a method, a technique and so on, should in many ways be comparable to the development of an information system. In this paper the implementation approach MOUSE is applied to implement a metrics program. The people, the process and the product get all the attention that is needed to make it success.

1. A Metrics Program It doesn’t matter where a metrics or measurement program is initiated; implementing a

metrics program can be seen as ‘just another’ staged IT project. The IT project brings together software, hardware, infrastructure, organisation and people. An IT project is structured with stages for development, transition to support, run & maintain and implementation. Why not apply the same structure to implementing a metrics program. All items from a ‘real’ IT projects have to be addressed too. However a metrics program is not equal to an information system, this requires different activities. Also the stages are somewhat different.

In figure 1, the baseline for the project lifecycle of the metric program implementation is given and in the following paragraph each stage is explained in detail.

Figure 1

1.1. Preparation Before the decision to implement a metrics program is made goals need to be defined

clearly that should be served by the program. In the preparation phase, the scope and the boundaries of solution space are set. Most of the time the estimating process [1] of the software development and/or performance benchmarking is driving the metrics program to be implemented. A good framework to decide which goals and the metrics are needed for the defined goals is the Goal-Question-(Indicator-)Metric (GQ-I-M) Method [2].


130

These goals are the basis for the organisation specific elements in the implementation of a metrics program.

1.2. Assessment & Planning

As showed in figure 1 this phase has two main activities: inventory and introduction. This phase can be compared with the requirements definition phase: the requirement has to be gathered and agreed with the stakeholders.

During this stage the current working methods and procedures are assessed together with all aspects that might have a relation with the metrics program to be implemented, such as:

• Already implemented measurements: (functional) size, effort, defects, duration.

• Software development methodology: development process(es), stages, milestones, activities, deliverables, guidelines and standards).

• Software development environments (platform). • Measurement process(es) and related data collection procedures(s). • Common project organisation and related procedures. • Generic /organisation specific characteristics and the project specific ones. • Effort and work allocation breakdown. • Risk management procedures.

After the analysis of the current situation, the results have to map on the objectives of the

measurement program. The assessment is executed using the agreed upon solution space and the implementation approach MOUSE that is described later in this document. This ‘gap-analysis’ is used to determine which activities and which stakeholders will be affected by the metrics program. Those stakeholders have to be informed or invited to participate and when necessary trained to work with the metrics program.

At the end of the Assessment & Planning stage there is a documented consensus about what the metrics program is expected to monitor, who will be involved and in what way (implementation plan vs. project plan). This implementation plan describes the transition from the current state to the wanted situation. In addition to the drawing of the current state (including findings and recommendations) and the wanted situation, the plan includes:

• The necessary changes. • Per change the transformation, the steps to make the change. • The identified training needs. • An awareness / communication program. • The agreed upon base data collection set. • Possible pilot and research projects. • Required staff and roles in the implementation phase.

1.3. Implementation Training

Employees in the organisation who are affected by the metrics program will have to be trained to use the metrics in a correct manner. Depending on the role this training can be adapted from an introduction presentation to a multiple day training course.


131

For the introduction of a metrics program typically five categories of employees are identified:

• Management: The management must have and convey commitment to the implementation of a metrics program. The management needs to be informed about the possibilities and impossibilities of the metrics program. They also must be aware of the possible consequences a metrics program might have on the organisation. It is well known that employees feel threatened by the introduction of a metrics program and in some cases react quite hostile. Such feelings can usually be prevented by open and correct communication of the management about the true purposes of the metrics program and the expected benefits.

• Metrics analysts: The specialists who are responsible for analysing and reporting the metrics data. They are also responsible for measuring and improving the quality of the metrics program itself. They might have been trained in the preparation stage, otherwise they have to be trained and the basic and advanced topics. Depending on the implementation, additional tool training is required.

• Metrics collectors: The employees that are actively involved in collecting or calculating the metrics have to know all the details and consequences of the metrics, to assure correct and consistent data. If the metrics that are used in the metrics program come from activities that are already common practice, the training may only take a couple of hours. If the metrics are not common practice or involve specialist training, for instance if functional sizes have to be derived from design documents, the training may take a substantial amount of time. In the last case this involves serious planning, especially in matrix organisations: It will not only consume time of the employee involved, but it will also affect his or her other activities. Depending on the implementation, additional tool training might be required.

• Software developers: Usually most of the employees that are involved in the software development will be affected, directly or indirectly, by the metrics program, because they ‘produce’ the information the metrics program uses. They need to have understanding of the metrics and the corresponding vocabulary. For them the emphasis of the training needs to be on awareness, understanding the use and importance of a metrics program for the organisation, because they usually not expect any benefit from the program for their personal activities. However there are also benefits for them. In addition they may need to change some of their products to make measurement possible or consistent.

• End-users or clients: Although a metrics program is set up primarily for the use of the implementing organisation, end-users or clients can also benefit from it. In particular sizing metrics are useful in the communication between the client and the supplier: how much functionality will the client get for the investment. Whether this audience will be part of the training stage for a metrics program depends on the openness of the implementing organisation: are they willing to share information about the performance of their projects?


132

At the end of the training stage everyone who will be affected directly or indirectly by the metrics program has sufficient knowledge about this program. It may seem obvious, but it is essential the training stage is finished before (but preferably not too long before) the actual implementation of the metrics program starts.

Research

In this stage the metrics to be implemented are mapped on the activities within the organisation that will have to provide the metrics data. The exact process of collecting the metrics data is determined and described in such detail that at the start of the implementation it is unambiguous which metrics data will be collected and how.

In this stage it is useful to determine what the range of the metrics data might be. An

approach that is very helpful is Planguage [3]. The word Planguage is a combination of Planning and Language. The purpose of Planguage is to describe a concept (e.g. a metric), related to stakeholders, scope and qualifiers. In case of a metric the qualifiers are the type and type of measurement and the scale, the base measurement and the level to achieve. A wrong perception of the possible result of metrics data can kill a metrics program at the start. It is also important to establish at least an idea of the expected bandwidth of the metrics data beforehand to know what deviations can be considered acceptable and what deviations call for immediate action.

At the end of the research stage all procedures to collect metrics data are described, target

values for each metric are known and allowable bandwidths are established for each metric in the metrics program.

Organisational Impact

Until now the metrics program has had little impact on the organisation, because only a limited number of employees have been involved in the pilot. The organisational implementation of a metrics program will have an impact on the organisation because the organisation has formulated goals which the metrics program will monitor. These goals may not have been communicated before or may not have been explicitly made visible. Metrics will have to be collected at specified moments or about specified products or processes. This could mean a change in procedures. For the employees involved this is a change process, which can trigger resisting or even quite hostile reactions. Over 10 years of experience show that the most suitable organisational structure for a metrics program is to concentrate expertise, knowledge and responsibilities in an independent body, a separate expertise group or incorporated in the Project Management Office or Audit group. An independent body has many advantages over other organisational structures. For example, when activities are assigned to individuals in projects, many additional measures have to be taken to control the quality of the measurements, the continuation of the measurement activities and the retention of expertise about the metrics program. When responsibilities for (parts of) the metrics program are assigned to projects, additional enforcing measures have to be taken to guarantee adequate attention from the project to metrics program assignments over other project activities. Installing an independent body to oversee and/or carry out the metrics program is essential for achieving the goals the metrics program was set up for. This independent body can be either a person or a group within or outside the organisation. How this body should operate is laid down in the MOUSE approach, which will be described in detail later on.


133

At the end of the implementation stage the metrics program is fully operational and available throughout the organisation.

Pilot

Unless the organisation is very confident that a metrics program will work properly from the start, it is best to start the implementation with a pilot. In a pilot metrics are collected from a limited number of activities and with a limited number of people involved. In a pilot all procedures are checked, experience is built up with these procedures and the first metrics data are recorded. In this way the assumptions about the metrics values and bandwidths from the research stage can be validated.

After the completion of the pilot all procedures and assumptions are evaluated and

modified if necessary. When the modifications are substantial it may be necessary to test them in another pilot before the final organisational implementation of the metrics program can start.

The pilot and the evaluation of the pilot can be considered the technical implementation of the metrics program. After completion of this stage the metrics program is technically ready to be implemented.

1.4. Operation

This stage is actually not a stage anymore. The metrics program has been implemented and is now a part of the day-by-day operations. The metrics program is carried out conform the way it is defined and is producing information that helps the organisation to keep track of the way it is moving towards their goals.

A mature metrics program gives continuous insight in the effectiveness of current working

procedures to reach the organisational goals. If the effectiveness is lower than desirable adjustments to the procedures should be made. The metrics program itself can then be used to track if the adjustments result in the expected improvement. If working procedures change it is also possible that adjustments have to be made to the metrics program.

Organisational goals tend to change over time. A mature metrics program contains regular

checks to validate if it is still delivering useful metrics in relation to the organisational goals. All these aspects are covered in the MOUSE approach.

2. MOUSE Implementing a metrics program is more than just training people and defining the use of

metrics. All the lessons learned from implementations in (global operation) industry, banking and government were the basis for MOUSE, an approach to help to set-up the right implementation and to create the environment the method is fit for purpose. MOUSE describes all activities and services that need to be addressed to get a metrics program set up and to keep running.

MOUSE is a clustering of all the activities and services into groups of key issues,

described in the table below:


134

Table 1: Key issues of MOUSE Market View Operation Utilisation Service Exploitation Communication Application Training Helpdesk Registration Evaluation Review Procedures Guidelines Control Improvement Analysis Organisation Information Investigation Advice Promotion

In the following paragraphs the five groups of MOUSE will be explained, in some cases

illustrated with examples of the implementation within an IT department.

2.1. Market view Communication in the MOUSE approach is an exchange of information about the metrics

program both internally (the own organisation) and externally (metrics organisations, universities, interest groups, …). Internal communication is essential to keep up the awareness about the goals for which the metrics program is set up. For example: Company publications (newsletters) and an intranet website are very useful to share information.

Communication with outside organisations is important to keep informed about the latest developments. Usually an important metric in a metrics program in an IT-environment is the functional size of software The International Function Point User Group (IFPUG [4]) and local organisations like Netherlands Software Measurement Association (NESMA [5]) and United Kingdom Software Measurement Association (Mark II [6]) the platforms for Function Point Analysis. Cosmicon is a platform for COSMIC Function Points [7]. In principle all SMA can help with the use of Functional Size Measurement (FSM) Methods. The requirements of this knowledge transfer depend on the organisational situation and demands. This interaction can be outsourced to a partner / supplier in case the partner has already contacts in these international networks and offers the possibilities to share. Collaboration with universities and specific interest groups (SIG) from other professional organisations (e.g. Metrics SIG of Project Management Institute – PMI [8]) is useful too.

If the independent body is located within the client’s organisation, a direct and open

communication is possible with stakeholders of the metrics program to evaluate whether the metrics program is still supporting the goals initially driving the program. When the independent body is positioned outside the client’s organisation more formal ways to exchange information about the metrics program are required. Regular evaluations or some other form of assessment of the measurement process works well for an open communication about the metrics program.

The findings that the evaluations provide are direct input for continuous improvement of

the metrics program. Depending upon the type of finding (operational, conceptual or managerial) further investigation may be required before a finding can be migrated to measurement process improvement.

Investigation can be both theoretical and empirical. Theoretical investigation consists of

studying literature, visiting seminars and conferences or following workshops. Empirical investigation consists of evaluating selected tools for measurement and the analysis of experience data. Usually these two ways of investigation are used in combination.


135

2.2. Operation Application includes all activities that are directly related to the application of the metrics

program. This includes activities like executing measurements (for example functional size measurements, tallying hours spent and identifying project variables). Within MOUSE the organisation can choose to assign the functional sizing part of the operation either to the independent body or to members of the projects in the scope of the metrics program.

The best way to guarantee quality of the measurement data is to incorporate review steps

into the metrics program. The purpose of reviewing is threefold: • Ensure correct use of the metrics (rules and concepts). • Keep track of applicability of the metrics program. • To stay informed about developments in the organisation that might influence the

metrics program. During the research stage all procedures to collect metrics data are described for each

metric in the metrics program. These procedures are described in a way that they support the organisational goals for which the metrics program was set up. Some metrics data can also be used to support project purposes. The expertise group can advise on and support the usage of the metrics for these purposes. For example an aspect of the metrics program can be the measurement of the scope creep of projects during their lifetime. Functional size is measured in various stages of the project to keep track of the size as the project is progressing. These functional size measures can also be used for checking the budget as a second opinion to the budget based on work breakdown structure for example. The independent body can give advice about the translation of the creep ratio in the functional size to a possible increase of the budget.

During the research stage target values and allowable bandwidth are established for each

metric in the metrics program. The independent body will have to analyse if these target values were realistic at the beginning of the metrics program and if they are still realistic at present. One of the organisational goals might be to get improving values for certain metrics. In that case, the target values for those metrics and/or their allowed bandwidth will change over time. E.g. The effort estimate bandwidth should improve to 15% from 20%.

2.3. Utilisation

Next to the basic training at the start of a metrics program it is necessary to maintain knowledge about the metrics program at an appropriate level. The personnel of the independent body should have refreshment training on a regular basis, referring to new developments (rules, regulations) in the area of the applied methods. The independent body can then decide whether it is necessary to train or inform other people involved in the metrics program about these developments. In the case that the independent body is outsourced, the supplier can be made responsible for keeping the knowledge up-to-date.

To guarantee the correct use of a method, procedures related to measurement activities of

the metrics program are necessary. They are usually initiated and established in the research stage of the implementation.


136

Not only the measurement activities themselves need to be described, but also facilitating processes like:

• Project management. • Change management control. • Project registration. • (Project) Evaluation. • Performance analysis.

After the initial definitions in the research stage the independent body should monitor that

that all the relevant definitions are kept up-to-date. As stated earlier the independent body can reside within or outside the organisation where

the metrics program is carried out. The decision about this organisational aspect is usually combined with the number of people involved in the metrics program. If the metrics program is small enough to be carried out by one person in part-time the tasks of the independent body are usually assigned to an external supplier. If the metrics program is large enough to engage one or more persons full-time the tasks of the independent body are usually assigned to employees of the organisation. Depending on the type of organisation this might not always be the best solution for a metrics program. When the goals the organisation wants to achieve are of such a nature that it involves sensitive information, calling in external consultants might be a bad option, no matter how small the metrics program might be. If employees have to be trained to carry out the tasks of the independent body, they might perceive that as narrowing their options for a career within the organisation. In that case it might be wise to assign these tasks to an external party specialising in these kinds of assignments, no matter how large the metrics program is. Outsourcing these assignments to an external party has another advantage: it simplifies the processes within the client’s organisation. Another advantage of outsourcing the independent body could be political: to have a really independent body to do the measurement or at least a counter measurement.

2.4. Service

To support the metrics program a helpdesk (e.g. focal point within the expertise group) needs to be instated. All questions regarding the metrics program should be directed to this helpdesk. The helpdesk should be able to answer questions with limited impact immediately and should be able to find the answers to more difficult questions within a reasonable timeframe. It is essential that the helpdesk reacts adequately to all kinds of requests related to the metrics program. In most cases the employees that staff the independent body constitute the helpdesk.

Decisions made regarding the applicability of a specific metric in the metrics program

need to be recorded in order to incorporate such decisions into the ‘corporate memory’ and to be able to verify the validity of these decisions at a later date. Usually such decisions are documented in organisation specific guidelines for the use of that specific metric.

The success of a metrics program depends on the quality of the collected data. It is

important that those who supply the data are prepared to provide this data. The best way to stimulate this is to give them information about the data in the form of analyses. This should provide answers to frequently asked questions, such-as: “What is the current project delivery rate for this specific platform?”, “What is the reliability of the estimations?”, “What is the effect of team size?”


137

Usually the historical data can answer the questions related to internal performance per functional size unit. When this is not (yet) available historical data of third parties can be used, e.g. the repositories of the International Software Benchmarking Standards Group – ISBSG [9].

Promotion is the result of a proactive attitude of the independent body. The independent body should market the benefits of the metrics program and should ‘sell’ the services it can provide based on the collected metrics. Promotion is necessary for the continuation and extension of the metrics program.

2.5. Exploitation

The registration part of a metrics program consists of two components: the measurement results and the analysis results. In a metrics program in an IT-environment all metrics will be filed digitally without discussion. Here a proper registration usually deals with keeping the necessary data available and accessible for future analysis. For most metrics programs it is desirable that the analysis data is stored in some form of an experience database. It this way the results of the analyses can be used to inform or advice people in the organisation.

Control procedures are required to keep procedures, guidelines and the like up-to-date. If

they do no longer serve the metrics program or the goals the organisation wants to achieve, they should be adjusted or discarded. Special attention needs to be given to the procedures for storing metrics data. That data should be available for as long as is necessary for the metrics program. This might be longer than the life of individual projects, so it is usually advisable to store data in a place that is independent of the projects the data comes from. 3. Tools

Although a tool is not required to implement a metrics program, it definitely supports the operation of the process and the acceptance. When the introduction of tool(s) is part of the implementation, some of the items in MOUSE need additional attention.

In addition the tool(s) may require additional data collection to maximise the benefits of

the tooling. This will influence the scope and solution space.

3.1. Type of tools The activities described in this document can be supported by tooling in a number of

areas: • (Functional) Sizing. • Data collection. • Historical data storage. • Estimation & control. • (Performance) Analysis. • Benchmarking.

In practise, the most generic solutions are found in series of Excel sheets, sometimes

linked with Access or SQL databases. This might be cheaper and flexible, however from a quality, maintenance, governance and consistency perspective this is not the most appropriate option.


138

In the sizing area, (effective) source lines of code (SLOC) is quite often the easiest choice for a size measure, in that case a tool for counting the lines of code is almost unavoidable. When selecting a tool, the definition of the SLOC counted should be according clear definitions and preferably according well respected public standards (e.g. the standard from the Software Engineering Institute – SEI).

My preference for sizing are the ISO certified functional sizing methods e.g. Function

Point Analysis or COSMIC Functional Sizing Method. In the market space various tools are available for registering detailed counting results. There are good tools in the commercial domain as well as in the freeware domain.

The same applies for tools in the other areas. It will not be surprising that I would like to

inform you that the Galorath SEER suite [10] is a professional solution for integrating the areas historical data storage (SEER HD – Historical Data), Estimating (SEER for Software, SEER IT – Information Technology), Control (SEER PPMC – Parametric Project Monitoring & Control) and benchmarking (SEER Metrics that also supports the ISBSG repository). SEER by comparison supports sizing based on analogy, whatever sizing unit is chosen when of course historical data is available.

3.2. Implementation of tools

When tools are selected to support the measurement program and to utilise the collected data, the groups and items in MOUSE that need special attention are:

• Market View (improvement, investigation). • Operation (application, analysis). • Utilisation (training, procedures, organisation). • Service (helpdesk, guideline). • Exploitation (registration, control).

Initially the basis requirements for the use of the tools has to be set (investigation) in line

with the purpose and goals of the metrics program. Most of the tools offer more functionality than initially required that might become useful when the metric program get more mature (improvement).

A tool has to be integrated in the measurement program application. Part of this is that he

reporting and analysis provided by the tool(s) needs to be tailored or translated to the definition and needs of the organisation.

It’s obvious that a tool requires additional training. Especially when referring to the

metrics program, the tools are mostly expert tools. (Advanced) Training is almost mandatory, at least for the expertise group that needs to support the organisation. It has to be embedded in the procedures and responsibilities have to get in line with other organisational structures.

A focal point for setting the guidelines for the usage of the tool(s) and support (e.g.

helpdesk) has to be organised. And last be not least, practical things like back-up, storage, access and security must be

put in place (registration, control).


139

4. REFERENCES [1] Galorath, Evens, “Software Sizing, Estimation and Risk Management”, 2006, Auerbach Publications,

ISBN 0-8493-3593-0 [2] Van Solingen, Berghout, “The Goal/Question/Metric Method”, 1999. Mc Graw Hill, ISBN

0-07-709553-7 [3] Gilb, “Competitive Engineering”, 2005, Elsevier, ISBN 0-7506-6507-6 [4] International Function Point User Group, “Function Point Counting Practices Manual version 4.2.1”,

2005, IFPUG, www.ifpug.org [5] Netherlands Software Metrics Association, “Definitions and counting guidelines for the application of

function point analysis, NESMA Functional Size Measurement method conform ISO/IEC 24570, version 2.1.”, 2004, NESMA, ISBN 978-90-76258-19-5, www.nesma.org

[6] United Kingdom Software Metrics Association, “MK II Function Point Analysis Counting Practices Manual, Version 1.3.1”, 1998, UKSMA, www.uksma.co.uk

[7] Common Software Measurement International Consortium, “COSMIC Functional Size Measurement Method, version 3.0”, 2007, COSMIC, www.gelog.etsmtl.ca/cosmic-ffp

[8] Project Management Institute, www.pmi.org [9] International Software Benchmarking Standards Group, www.isbsg.org [10] Galorath, “SEER suite”, www.galorath.com


140

SMEF 2009


141

SOFTWARE MEASUREMENT

EUROPEAN FORUM

PROCEEDINGS

28 - 29 May 2009

NH LEONARDO da VINCI, ROME (ITALY)

APPENDIX


142


143

THURSDAY, 28 MAY 2009 08:45 Registration 09:15 Workshop Functional Size Measurement & Modern Software Architectures 11.10 Coffee 11:30 Workshop Software Measurement in Contract Management 13:00 Lunch 14:00 KEYNOTE

How to bring "software measurement” from research labs to the operational business processes Roberto Meli D.P.O. – Data Processing Organisation, Italy

14:35 Results of an empirical study on measurement in Maturity Model-based software

process improvement Rob Kusters, Jos Trienekens, Jana Samalikova Eindhoven University Of Technology, Netherlands

The use of measurement in improving the management of software development is

promoted by many experts and in literature. E.g. the SEI provides sets of software measures that can be used on the different levels of the Capability Maturity Model, see Baumert [1992]. Pfleeger [1995] presents an approach for the development of metrics plans, and also gives an overview of types of metrics (i.e. project, process, product) that can be used on the different levels of maturity.

The goal of this paper is to identify, on the basis of empirical research data, the actual use

of metrics in the software development groups of the multinational organisation Philips. The data have been collected in the various software groups of this organisation. The number of responding software groups was 49 (out of 74).

The key questions in the research project are: • What types of metrics are used in software groups? • What types of metrics are used on the different CMM levels? And is this consistent

with recommendations in literature? In the survey 17 predefined metrics were included by the research group with questions

about their usage. Organisations could also add metrics, resulting in 117 additional metrics. In these metrics, three clear clusters could be identified, respectively Project metrics (progress, effort, cost), Product metrics (quality, stability, size) and Process metrics (quality, stability, reuse). This classification is derived from existing classifications of Baumert [1992] and Pfleeger [1995]. Only a small number of metrics, less than 10, could be classified as process improvement metrics.


144

The main results of the analysis are as follows: • On each of the CMM levels, software groups make use of a quite large number of

project metrics (i.e. to measure progress, effort and cost). • Software groups on CMM level 1 (the Performed process level) make, or try to make,

use of a large diversity of metrics also from the categories product and process. This is contradicting recommendations from literature which stress the usage of project metrics (mainly to provide base line data as a start for further improvement).

• On CMM level 2 ( the Managed process level) software groups are definitely restricting themselves to project metrics; the usage of product metrics shows a sharp decrease. Obviously software groups on this level know better what they are doing, or what they should do, i.e. project monitoring and controlling. As a stepping stone towards level three, also a number of product quality metrics can be found here.

• Software groups on CMM level 3 (the Defined level) clearly show a strong increase in product quality measurement. Obviously, this higher level of maturity reached enables software groups to effectively apply product metrics in order to quantitatively express product quality.

• Remarkable is that the analysis of the metrics showed that software groups on CMM level 3 and above, don't show a strong increase in the usage of process metrics (quality, stability, reuse) and of process improvement metrics. The main instruments in software measurement to reach a satisfactory maturity level seem to be the project control and product quality metrics.

Conclusions: The research shows a clear emphasis on project progress and product quality

metrics in software groups that are striving at a higher level of maturity. Interesting is also to see that on each of the first three CMM levels software groups make use of about the same high amount of project metrics. So project control is important on all levels. Surprisingly, and in contradiction with literature, process metrics are hardly used (even by software groups on the higher CMM levels).

Benefits: This paper is written to provide useful information to organisations with or without a

software measurement program. An organisation without a measurement program may use this paper to determine what metrics can be used on the different CMM levels, and possibly prevent that the organisation makes use of metrics which are, not yet, effective. An organisation with a measurement program may use this paper to compare their metrics with the metrics identified in this paper and eventually add or delete metrics accordingly. 15:10 Combining Measurement with Process Assessment to Align Process Improvement

Efforts with The Company Vision Fabio Bella Kugler Maag CIE, Germany

Withdrawn due to credit crisis.

15:45 Coffee


145

16:15 Functional size measurement - Accuracy versus costs - Is it really worth it? Harold van Heeringen, Edwin van Gorp, Theo Prins Sogeti Nederland B.V., Netherlands

Withdrawn due to credit crisis.

16:50 Practical viewpoints for improving software measurement utilisation

Jari Soini Tampere University of Technology, Finland

Measurement is widely recognised as an essential part of understanding, predicting and

evaluating software development and maintenance projects. However, in the context of software engineering, its utilisation is recognised as a challenging issue. An increasing need for measurement data used in decision-making on all levels of the organisation has been observed when implementing the software process, but on the other hand measuring the software process in practice is utilised to a rather limited extent in software companies. There seems to be a gap between need and use on this issue. Information relating to the software production process is available for measurement during software development work and there are plenty of potential measurement objects and metrics related to this process. Therefore, it must be assumed that the basic problem is not a lack of measurement objects or software metrics. The issue of measurement utilisation is not the lack of measurement data, so in all likelihood there must be some other reasons for this phenomenon. In this paper we present the viewpoints on measurement issues of Finnish software companies and their experiences. Based on the empirical information obtained through interviews, this paper discusses the justification as well as potential reasons for challenging measurement utilisation in the software process. The research is based on a series of interviews and a questionnaire created to collect the experiences of software companies about software measurement in general. The aim of the two-year-long research project was to investigate the current status and experiences of measuring the software production process in Finnish software organisations and to enhance the empirical knowledge on this theme.

This paper presents the viewpoints of the software organisations - advantages and

disadvantages - related to the establishing or use of software measurement in the software process. The key factors are also discussed that, at least from the users’ perspective, should be carefully taken into account in connection with measurement and also some observed points that must be improved in their current measurement practices if they wish to enhance the utilisation of measurement in the software process.

These results can be used to evaluate and focus on the factors that must be noted when

trying to determine the reasons why measurement is not utilised in software production processes in practice. This empirical information give us some hints as to which are the important issues when designing, establishing and implementing measurement in the context of software production. The results also show some of the potential directions for further research on this theme. With the information provided by the research it is possible to take a short step on the road toward increased utilisation of measurement in software engineering.

Benefits: The study that we have executed provides empirical information related to software

measurement. This paper aims to enhance the awareness of the relationship between software


146

processes and process measurement in practice. The research (…described in the paper) is based on a series of interviews created to collect the experiences that software companies have of software measurement in general. The paper discusses the justification as well as potential reasons for challenging the utilisation of measurement in the software process. The empirical information presented highlights some focal issues which must be taken into account when using measurement and its results in the context of software production. The results presented in the paper are based on the thesis “Measurement in the software process: How practice meets theory?” published by the author in November 2008, Tampere University of Technology (TUT), Pori, Finland. 17:25 IFPUG Function Point or COSMIC Function Point?

Gianfranco Lanza Csi Piemonte, Italy

The choice of the more suitable metric to obtain the functional measure of a software

application is not an easy task as it could appear. Nowadays a software application is composed by many modules, each of them with its

own characteristics, its own programming language and so on. In some cases one metric could appear more suitable to measure one particular piece of

software than another, so in the process of measuring a whole application it’s possible to apply different metrics: obviously what you can’t do is to add one measure to the other (you can’t add oranges with apples!).

In our firm we are starting to use Cosmic Function Point to measure “batch processes” and “middleware environments” instead of IFPUG; while we use IFPUG Function Point in all the other situations.

It’s important to observe that our “batch processes” are usually a stream of steps, one linked to the other (from the user point of view it’s a unique elementary process).

One of the possible problem is: “when the management ask you only one number of the functional dimension of an application what can you do?”

It’s possible to obtain only one number using the conversion factors from IFPUG to Cosmic and viceversa, but, in my opinion, it would be better to maintain separate the two measures.

We are starting to compare the two metrics in different environments: in some cases there is a significant difference in the functional measure.

Using a metric is mainly a matter of culture. It’s important to know what a metric can give you and what not.

To use the functional measure to obtain effort estimate is another not simple process. “Can I use the same productivity for the both measures, Cosmic and IFPUG?”, obviously not! In the case of Cosmic it’s important to fix the rules to determine the different layer of counting (for example: in a client/server application do I have to consider two layer or only one?)

We are collecting data to know the productivity in different environments. In conclusion it’s not an easy task, you have to pay attention in not mixing the two

metrics, but, if you do this in the right way, it could be very useful to you! Benefits: To understand that the process to choose the right metric for evaluating pieces of software

has to be done with intelligence and attention. 18:00 Closing


147

FRIDAY, 29 MAY 2009 09:15 Cognitive Perspective on Analogy-based Project Estimation

Carolyn Mair a, Martin Shepperd b, Miriam Martincova, Mark Stephens a Southampton Solent University, United Kingdom b Brunel University, United Kingdom

Individuals have been found to solve problems using analogies, drawing from episodic

memory. This knowledge has been applied to the design of knowledge management tools including those which use analogical or case-based reasoning (CBR). CBR is based on the premise that history repeats itself, but not exactly and has been used to address many software engineering problems including cost or effort prediction. However the variability of results is difficult to interpret.

Recent research interest in CBR, as a knowledge management tool, has emphasised

algorithmic approaches typically used for well-defined problem solving. In contrast, an ill-defined problem is that in which the starting position, the allowable operations, or the goal state are not clearly specified, or a unique solution cannot be shown to exist. Project managers need to solve ill-defined (non-trivial) problems which demand cognitive strategies other than the use of algorithms. These higher cognitive strategies are the main focus of investigation in our study. However, research has found that personality impacts cognitive processes and therefore will affect problem solving ability and strategy. Thus we are also assessing project managers’ personality using the Eysenck Personality Questionnaire.

The current pilot study is integrating knowledge from cognitive psychology and computer

science to investigate how the cognitive aspects of accurate software cost estimation can be best applied when using analogy-based tools. We adopted a grounded theory approach in order to eliminate preconceived hypotheses about underlying cognitive processes. We assessed personality and interviewed a sample of experienced project managers employed on cost estimation tasks to gain an understanding of their background and the problem solving strategies they currently employ. Following these sessions, the project managers were given a typical task encountered when estimating project cost and asked to complete the task using the ‘think aloud’ protocol whilst utilising a CBR tool. The results are being used to develop more effective CBR tools and a better understanding of how to use them. This contributes to our ultimate goal which is to enhance the accuracy and reliability of software project effort prediction.

Benefits: This research aims to better understand the cognitive processes and the impact of

personality on analogical problem solving. We are investigating how these psychological aspects impinge on cost estimation specifically when the estimator is using software tools to facilitate his or her decisions. Knowledge management technology, for example, case-based reasoning tools, is loosely based on human cognitive processes. However, we believe that knowledge of psychological processes has developed sufficiently over the past decade that these tools can now be improved by being based on a more realistic model. The work is both timely and novel and is therefore of interest to practitioners as well as researchers involved in project cost estimation.


148

10:05 Effective Project Portfolio balance by forecasting project's expected effort and its break down according to competence spread over expected project duration Gaetano Lombardi, Vania Toccalini Ericsson Telecomunicasioni, Italy

The proper allocation of an organisation’s finite resource is crucial to its long term prospect. The most successful organisations are those that have in place a project portfolio planning process. Because of the large number of scenarios, taking the right decision requires the use of quantitative techniques. At any given time, an organisation has a finite capacity to perform work, and although this capacity could be modified, the process of acquiring or reducing resources takes time. Because of this, the organisation needs to plan how much work to take in, before keeping any decisions to change the current capacity.. Lack of an early planning can lead to paralysis as a result of firefighting or to inefficiencies in using available resources. On the other hand, making an early reliable project plan starting from rough and incomplete market requirements is very difficult and the error in planning is quite high, typically leading to underestimation and therefore firefighting and inefficiencies.

For that reason in the Project Office, we have started a Six Sigma project to identify from past project data, planning constants referring to time, effort and effort distribution among competencies needed in projects.

In the specific, the Six Sigma project has defined: • Limits on Total Project Duration and Efforts, using Montecarlo Simulation on

historical data. Optimal probability distribution functions fitting historical data have been obtained using Normality Test and Individual Distribution Identification tools. Results have been grouped according to current project categories.

• Average effort distribution among competencies has been computed by Exploratory Data Analysis using Box-Plot.

• Effort for each competence spread over the duration of the project has been modelling using Multiple Regression Analysis. Best Subset Regression and Correlation analysis have been carried out to find a suitable subset of predictors for each competence profile. Finally fitted models have been validated by Residual Plots analysis.

• New process, Project Formulation and Scenario Analysis, on how to use the previous indicators has been described and is currently piloted.

Besides, the overall Resource Management Planning process has been reviewed and

redefined with the introduction of resource planning tool. The way of usage of the resource planning tool has been stabilised by releasing a work instruction on how and when update resource requests and allocations into the database, based on project prioritisation. Resource Planning is now regularly performed and reviewed on a monthly basis and before a specific decision point in the project. Therefore data in the system are more reliable and can be used as base for scenario planning and to evaluate application of newly introduced Project Formulation and Scenario Analysis.

Benefits: The paper shows results of Six Sigma project aiming at improving the early planning

capability in Project Portfolio process. The application of the practice, based on the analysis of measurements and indicators, can help organisations in improving the capability of controlling and exploiting its development capacity.


149

10:25 Defect Density Prediction with Six Sigma Methods Thomas Fehlmann Euro Project Office AG, Switzerland

Is it possible to know defect density in advance for software that’s going into production?

The answer is “yes”, if you apply statistical methods to requirements and starts a suitable measurement program. The Six Sigma toolbox provides statistical methods for building such prediction models.

This paper explains the basics of these statistical methods assuming the readers are familiar with Design for Six Sigma and Quality Function Deployment.

Defect Density Prediction Tester would love to know how many defects remain undetected when they deliver to a

customer or user of the software. For capacity planning and in service management, knowing in advance how many people will be needed for application support would be welcome.

Intuitively it seems impossible, to predict the unknown future and predict, how many defects customers and users are able to detect when starting to use the new application. In fact, not even the Six Sigma “Silver Bullet” knows the future. However, statistical methods exist for predicting the probability of finding defects by calculating the expected defect density. It works similar to weather forecast, where predicting humidity levels and temperature leads to rain or snowfall forecasts.

Statistical methods are the key; however, we must apply statistical methods to requirements and semantics of language statements rather than to physical entities.

Incidentally, the statistical methods for requirements engineering are those known from Six Sigma.

What are defects in Software Development? In Six Sigma for Software, we distinguish to kind of defects: requirements not seen or

recognised, and badly implemented requirements. The former are much adverse for success than the latter; they can impact project success much heavier than the second. Software crashes and malfunction are more easily detectable, thus cumbersome, but fixable.

The first kind of issues, missing requirements, we call “A-Defects”; the second kind, missed requirements, we call “B-Defects” (“B” refers to “Bugs”). B-Defects block the application behaviour as specified; A-Defects block the overall usefulness of an application. B-Defects can be found by Verification; A-Defects sometimes by Validation; however, only when using additional techniques for requirements completeness, such as Quality Function Deployment.

The foundation of any software measurement program therefore is to distinguish those two kinds of defects. Reviews and Tests in the software development cycle produce the measurement data needed.

Towards a Prediction Model for Defect Density The prediction model uses standard techniques from physics. We use a multidimensional

vector space with a topology that represents the defect density during the various stages of software development. In mathematics, this structure is called a “Manifold”. The defect density function changes from proposal stage to requirements definition stage, to design and implementation stage. In practice, these stages correspond to different views on the software requirements definition process.


150

This topology generates measurable signals when verifying and validating application software.

Nevertheless, measuring defects is not that easy. It is not sufficient to count entries into bug lists; these have no statistical relevance. More relevant are measurements that target for the effort needed to avoid, find, and correct defects.

We prefer to call such effort “Learning Opportunities” rather than “bug cost” or “quality cost”, because we want people to enjoy finding a mistake – that otherwise might become a defect. The measurement problem is that we must rely on people to get data; there are few other possibilities to get evidence, e.g., from a project repository. We can count check-in’s and check-out’s but we don’t know whether these are due to bug fixes or planned enhancements.

Thus we want software developers and tester to record their effort for Learning

Opportunities rather than for bug fixing, and we get nervous when they record no such effort, rather than when they fix bugs.

The more Learning Opportunities are taken in the early stages of the project, the less defects we expect the users will encounter at the end.

The second problem to address is the origins of defects. For this we use Deming’s generic process model adapted for software requirements, see Figure 1: Deming's Model for Software Requirements.

This model is well known and understood from the Six Sigma discipline of „Quality

Function Deployment“. The model uses different views into the multidimensional requirements space: the „Voice of the Customer“ view, the „Voice of the Engineer“ view, but also views from less known stakeholders in the project such as competition („New Lanchester Strategy“) or – on the other hand – from software development process engineers (SEPG) and software testers.

Between the different views exist the “Transfer Functions” known from Design for Six

Sigma (DfSS). According the well-known Walter-Wintersteiger principle, we can represent these transfer functions as linear mappings (“Quality is linear”). With this framework we can predict defect density in our multidimensional requirements space and create a Key Performance Indicator (KPI) for Software Development: the „Learning Opportunity Ratio“ (LeOR), based on data from our measurement program.

Remains to explain the prediction model itself. Thomas Saaty, the inventor of the

Analytical Hierarchical Process used for decision-making, found the mathematics behind as quite straightforward, however usually not yet known well: we need to find the Eigenvectors of the transfer functions. Algorithms are easily available how to effectively do this.


151

FN → BO

Functional Tests (FT)

Voice of the Customer (VoC)

Application Test (AT)

Acceptance Test (AcT)

Competitive Analysis (LT)

RealizationDecision

Enablers

Customer's Needs (CN)

Business Objectives (BO)

Functionality (FN)#FP

#Bugs

#Bugs

#Bugs

#Market Share #Opinion polls

AcT → CN

AT → BO

FT → FN

CtQ → CN

Critical to Quality (CtQ)

CapabilityMaturity (CMM)

#CMMI level

CMM → CtQ

BO → CN

CN → LTCN → VoC

Figure 1: Deming's Model for Software Requirements The Prediction Model thus has a static part: the transfer functions calculated from the

Eigenvectors for the requirements based on Quality Function Deployment; and the dynamical part, based on the effectiveness of defect-removal techniques that defines the LeOR. Now, predicting the defect density is simply a calculation based on several linear matrices applied to the LeOR.

We are almost finished; however, the LeOR metrics is a ratio and must be calibrated with the application size.

Functional Size for Calibrating the Model In order to learn how many defects remain after the LeOR and its propagation through the

development process is known, we only need the application size. To get such a sizing metrics, we use the International Standards ISO/IEC 20926 (IFPUG 4.2 Function Points Analysis, unadjusted, [FPA]), for measuring the size of business requirements, and ISO/IEC 19761 (COSMIC 3.0 Full Function Points, [FFP]) for technical views on performance and application architecture. Which one to choose, depends upon the specific environment for the project. Based on these measurement instruments, we can effectively predict defect density of software applications. 11:10 Coffee


152

11:35 Using metrics to evaluate user interfaces automatically Izzat Alsmadi, Muhammad AlKaabi Department of Computer Science and Information Technology, Yarmouk University, Jordan

User interfaces have special characteristics that differentiate them from the rest of the

software code. Typical software metrics that indicate its complexity and quality may not be able to distinguish a complex GUI or a high quality one from another that is not. This paper is about suggesting and introducing some GUI structural metrics that can be gathered dynamically using a test automation tool.

Rather than measuring quality or usability, the goal of those developed metrics is to measure the GUI testability, or how much it is hard, or easy to test a particular user interface.

We evaluate GUIs for several reasons such as usability and testability. In usability, users evaluate a particular user interface for how much easy, convenient, and fast it is to deal with it. In our testability evaluation, we want to automate the process of measuring the complexity of the user interface from testing perspectives. Such metrics can be used as a tool to estimate required resources to test a particular application.


153

12:05 Estimating web application development effort employing Cosmic size measure: a comparison between the use of a cross-company and a single company dataset Filomena Ferrucci a, Sergio Di Martino a, Carmine Gravino a, Luigi Buglione b a University of Salerno, Italy b École de Technologie Supérieure - ETS – Université du Québec, Canada Nexen - Gruppo Engineering, Italy

Early effort estimation is a critical activity for planning and monitoring web project development as well as for delivering a high quality product on time, and within budget. Several studies have been conducted applying 1st generation FSM (Functional Size Measurement) methods (i.e. IFPUG FPA ) as the (product) size unit for building an effort estimation model. The latest FSM (Functional Size Measurement) method produced in time is COSMIC. Such method has been applied to several domains, also for the prediction of web application development efforts, and interesting results have been recently reported.

This paper will present and analyse the results from an empirical study carried out using

data from both a single company dataset and from the public benchmarking repository ISBSG r10.

In particular, the single-company dataset was obtained collecting information about the total effort and the functional size (sized with COSMIC) of 15 web applications developed by an Italian software company. The second dataset was obtained by selecting the 16 ISBSG r10 projects sized with COSMIC and classified as ‘web’ applications. As for the estimation methods, two widely used techniques (OLS - Ordinary Least Square regression and CBR – Case-Based Reasoning) were applied in order to construct the prediction models, while a leave-one-out cross validation was exploited to assess the accuracy of estimates obtained with the models.

The OLS analysis was successfully applied to the single company dataset. Indeed, the obtained prediction model was characterised by a good fitness and the estimates turned out to be accurate according to widely used evaluation criteria and thresholds.

On the other side, it was not possible to apply OLS to the ISBSG dataset since the hypothesis underling the method did not hold (also applying data transformation). Indeed, the dataset is not homogeneous both in effort and size variables and the productivity varies considerably among the considered projects.

As for the application of CBR, good results were obtained with the single company dataset, while a poor prediction accuracy was got using the data from the ISBSG repository.

The paper will highlight in detail the differences between the two datasets, discussing the

issues related to the use of a cross-company dataset by a software organisation, and derive suggestions about the way public data could be exploited in order to get more accurate estimates of development effort.

Benefits: • To receive suggestions about the usage of public repository data from software

projects in order to improve the selection of the proper dataset for building the estimation model.

• To discuss the way web projects fit with FSM methods for estimation purposes. • To discuss the profitable usage of cross-company datasets by a software organisation.


154

12:05 From Performance Measurement to Project Estimating using COSMIC Functional Sizing Cigdem Gencel a, Charles Symons b a Blekinge Institute of Technology, School of Engineering, Sweden b United Kingdom

The starting point for methods of estimating the effort for a project to develop a new piece of software early in its life is usually to determine a size of the new software and to use past measures of performance made on comparable projects, using the same sizing method, to produce the estimate. The COSMIC method of Functional Size Measurement (FSM) has been used for both performance measurement and for estimating, and good results have been reported.

Sizes measured using the COSMIC method are pure functional sizes, that is, they are

totally independent of any technical or quality requirements for the software. Such sizes are therefore ideal for cross-technology performance comparisons.

Other commonly-used FSM Methods such as those of Albrecht (now IFPUG) and MkII

FPA use weights for their various base functional components (BFC’s) that were originally derived from measurements of performance on limited numbers of projects, using a limited range of technologies. These and most other related FSM methods were, in fact, originally calibrated for use in estimating and actually define measures of ‘standard-effort’; they are, not pure measures of functional size. (‘Standard-hour’ measurements for defined tasks are commonly used in work-study measurement and in ‘bottom-up’ software project estimating methods.) A consequence is that the resulting size measures are, to a degree, dependent on the technology of the projects whose performance was originally measured so are, in principle, less suitable for cross-technology performance comparisons.

This paper will examine these different ways of measuring sizes related to software. Our

analysis suggests that one could apply different locally-calibrated weights to the counts of BFC’s (Entries, Exits, Reads and Writes) of a COSMIC functional size measurement, dependent on the technology, etc to be used for the development. These weights would convert the BFC counts to produce local, COSMIC-based standard-effort measures in units of standard-hours. Such measures might enable even more accurate project effort estimates than using standard COSMIC functional sizes as input to an estimation process in the traditional way. We will present very early findings on the results of this exploration.

Benefits: • A clear understanding of the origin of the weights of the various well-known

functional size measurement methods and the consequences for current measurement practice – a subject that is poorly understood and rarely discussed nowadays.

• A description of the basic principles of ‘top-down’ estimating methods for use early in the life of a software project when the primary input is a measure of the size of the functional user requirements; the importance of calibrating the parameters of such methods using local performance measurements.

• An exploration of how the components of a COSMIC functional size measurement might be weighted differentially, so as to produce locally-calibrated ’standard-effort’ measures (in units of standard-hours) that would enable more accurate project effort estimation.


155

13:15 Lunch 14:30 A 'middle-out' approach to Balanced Scorecard (BSC) design and implementation for

Service Management: Case Study in Off-shore IT-Service Delivery Srinivasa-Desikan Raghavan a, Monika Sethi a, Dayal Sunder Singh a, Subhash Jogia b a Tata Consultancy Services Ltd. (TCS), India b India

In this paper, we describe a BSC approach to Service Management for a portfolio of

projects at Tata Consultancy Services Ltd. (TCS), India, with one of its valuable customers, a Leading Global Financial Service Company. During the growth of their business relationship, there was a need to manage a critical portfolio of projects in ‘Straight-Through-Processing’ (STP) services, with special reference to customer feedback and KPI management. We chose BSC approach to manage and control this flagship program, for the ease of design and for the clarity of communication amongst its stake holders.

Since some of the projects were already in existence, we had to evolve a ‘middle-out’

approach to BSC design, instead of a traditional top-down mode, when the STP Program was sanctioned. From the new set of ‘objectives’ of STP Program, we designed the scorecards, with appropriate lead and lag measures by iterating through both top-down and bottom-up approaches; we arrived at the measures and their targets, by keeping the KPIs in focus, both at the STP-Program level and at individual project level. While designing the scorecards, we re-used some of the (then) existing review measures and governance mechanism. In the final mode of governance, we have superposed the new BSC based STP-Program review, while retaining the (then) existing review mechanisms at the project level. This has helped the program to track important program specific measures, while facilitating the need-driven data drill down at individual projects.

Some of the critical success factors for BSC implementation were as follows: • An effective change management approach to managing implementation, by

identifying early adaptors and champions amongst the project teams; and by maintaining regular communication through training and town hall meets (we call it ‘socialising’).

• By involving the stake holders and the project managers, through iterative discussions on the objectives of the Program, and on the elements from SLA and the KPIs at the Customer relationship level; this would become the leaven for useful scorecards with well defined project management metrics, well defined delivery performance (quality) metrics, Customer satisfaction metrics and Knowledge Management metrics.

• By designing scorecards with measures that are independent at their scorecard level besides the measures whose values of performance are aggregated from those of lower levels; this has facilitated a quick identification of ‘root causes’ and ‘relationships’ (if any) amongst scorecard elements, while trouble shooting.

• Index (weighted average) based method of monitoring measures, for BSC perspectives and Scorecards; this has helped in comparing the projects’ performance.

• By making the scorecards visually ‘pleasing’ (we found this being important, after a quick deployment of proof of concept!) and useful, by having important measures tracked for their trends.


156

Benefits: • Practical (industry) situation where Program Performance Management was designed

and implemented. • Mixed mode of both top-down and bottom-up approaches to BSC design and

implementation. • Index (weighted average) based method of monitoring measures, for BSC

perspectives and for individual Scorecards. • A case of multi-location, KPI driven Scorecard implementation, with Change

Management focus. 14:00 Assessment of Software Process and Metrics to Support Quantitative Understanding:

Experiences from an Undefined Task Management Process Ayca Tarhan a, Onur Demirors b a Computer Engineering Department of Hacettepe University, Turkey b Informatics Institute of Middle East Technical University, Turkey

Withdrawn due to insufficient research result.

15:40 Coffee


157

16: 05 FP in RAI: the implementation of software evaluation process Anna Perrone, Marina Fiore, Monica Persello, Giorgio Poggioli RAI - Radiotelevisione Italiana, Italy

Since 2006 the RAI Information & Communication Technology Department introduced a

new relationship with its partners to develop software, moving from the use of time and material, to fixed price contracts, that involve a fixed total price for a defined product to be provided.

Suppliers are legally obliged to complete such contracts: RAI ICT must precisely specify the product or service being procured and any additional cost, due to adverse performance is the responsibility of the Supplier, who is obliged to complete the effort.

In this new perspective the advance estimate of software dimension has become the first

fundamental step to establish price, cost and time and to negotiate with Suppliers. RAI ICT adopted the Function Point IFPUG metric to estimate functional projects size,

because it is a well established method and known as an international standard. As additional metric, RAI ICT used also the Early & Quick Function Point methodology, that seems to be very useful when requirements are not well defined, in case of short term project, in contracting with Suppliers who know very deeply the contest they work in and specially now that FP price is continuously decreasing.

Obtained FP estimate is then converted in cost, using the term of contract, to fix the project price.

To implement the estimation process we involved different stakeholders, first of all the

PMO, who became the Competence Center of project’s functional estimate, Project Managers, who used to estimate projects using FTE, and RAI Suppliers who develop software.

Two years has gone by the starting of this process and now RAI ICT can make some

considerations about this experience. In particular it is possible to analyse the investment on resource training, how to go over the traditional FTE measurements and finally to make some considerations about the positive aspects of this experience and the still open items.

Some data will be presented to give an idea of the experience and some useful tips will be

suggested to implement a software functional estimate process. Through a critical analysis of the IFPUG methodology and the direct experience, the paper

will try to answer to the following questions: • Which are the benefits of implementing a software estimate process? • Are FP still actual and do they fit the requirement of giving a quantitative software

development measurement? • Why are they not so widely used? • Can FP be used in any kind of software application? If yes, how?

Benefits: • An interesting use case of FP use as a metric of software dimension in outsourcing

contracts. • Sharing a practice software estimate process experience. • Some consideration about IFPUG vs E&Q methods. • Lesson learned from the estimate process implementation.


158

16:40 KEYNOTE: Implementation of a metrics program in a large organisation Ton Dekkers Galorath International Ltd, United Kingdom/Netherlands

17:15 Closing


159

Author’s affiliations

Muhammad AlKaabi Department of Computer Science and Information Technology, Yarmouk University, Jordan [email protected] 67 Using metrics to evaluate user interfaces automatically Izzat Alsmadi Department of Computer Science and Information Technology, Yarmouk University, Jordan [email protected] 67 Using metrics to evaluate user interfaces automatically Dr. Luigi Buglione École de Technologie Supérieure - ETS – Université du Québec, Canada Nexen - Gruppo Engineering, Italy [email protected] 77 Estimating web application development effort employing Cosmic size measure: a

comparison between the use of a cross-company and a single company dataset Dr. Luigi Buglione is an Associate Professor at the École de Technologie Supérieure

(ETS) – Université du Québec, Canada and is currently working as a Quality & Process Engineer in Engineering.IT (formerly Atos Origin Italy and SchlumbergerSema) in Rome, Italy. Previously, he worked as a Software Process Engineer at the European Software Institute (ESI) in Bilbao, Spain. Dr. Buglione is a regular speaker at international Conferences on Software Measurement and Quality, is part of the Board of Directors of the Italian Software Metrics Association (GUFPI-ISMA), where coordinates the Software Measurement Committee (SMC).

He developed and was part of ESPRIT and of Basque Government projects on metric programs, EFQM models, the Balanced IT Scorecard and QFD for software and was a reviewer of the SWEBOK project. He received a Ph.D in Management Information Systems from LUISS Guido Carli University (Rome, Italy) and a degree in Economics from the University of Rome “La Sapienza”, Italy. He is a Certified Software Measurement Specialist (IFPUG CSMS) Level 3.


160

Ton Dekkers Galorath International Ltd, United Kingdom / Netherlands [email protected] 129 KEYNOTE: Implementation of a metrics program in a large organisation

Ton Dekkers is working as a practitioner, consultant, manager and trainer within the area

of project support, software measurement and quality assurance for over 15 years. Within these areas he specialises in estimating, performance measurement (Sizing, Goal-Question-Metric, Benchmarking), risk analysis and scope management.

In 1992 he developed the "FPA in enhancement" methodology that has evolved as an alternative method to estimate enhancement projects. In 1996 he was involved in the development of Test Point Analysis, an extension to FPA in order to estimate strategy based testing. He also plays a role in the development and promotion of the COSMIC Functional Size Measurement Method.

His current position is Director of Consulting for Galorath International. In addition to his regular job he is Immediate Past President of International Software Benchmarking Standards Group (ISBSG), Vice President of the Netherlands Software Measurement Association (NESMA), member of the International Advisory Committee of COSMIC and Director-At-Large ISBSG in the Project Management Institute (PMI) Metrics Specific Interest Group (MetSIG). Thomas Fehlmann Euro Project Office AG, Switzerland [email protected] 59 Defect Density Prediction with Six Sigma Methods Marina Fiore RAI - Radiotelevisione Italiana, Italy 117 FP in RAI: the implementation of software evaluation process

Marina Fiore received his degree in Physic Science in 1992, from 1997 she works in RAI ICT Department: at the beginning as business analyst and since 1998 in the PMO where she gained a relevant experience in project management methodology, tools and technique. Since 2005 she started learning FPA, E&Q FP and she has a practice experience on FP software estimation.

In 2007 she became PMP certified.


161

Prof. Filomena Ferrucci University of Salerno, Italy [email protected] 77 Estimating web application development effort employing Cosmic size measure: a

comparison between the use of a cross-company and a single company dataset

Prof. Filomena Ferrucci received the Laurea degree in computer science (cum laude) from the University of Salerno, Italy, in 1990 and the PhD degree in applied mathematics and computer science from the University of Naples, Italy, in 1995. From 1995 to 2001, she was a research associate at the University of Salerno, where she is currently an Associate Professor of computer science and teaches courses on software engineering and Web information systems. She was program co-chair of the 14th International Conference on Software Engineering and Knowledge Engineering and program co-chair of the 2nd, 3rd, 4th, 5th, and 6th editions of the International Summer School on Software Engineering. She has served as a program committee member for several international conferences in the areas of software engineering, web engineering, and human-computer interaction. Her main research interests are in software metrics for the effort estimation of OO systems and Web applications. Her research interests include also software-development environments, visual languages, human-computer interaction, and e-learning. She is co-author of several scientific papers published in international journals, books, and proceedings of refereed international conferences Dr. Cigdem Gencel Blekinge Institute of Technology, School of Engineering Department of Systems and Software Engineering, Sweden [email protected] 91 From Performance Measurement to Project Estimating using COSMIC Functional

Sizing

Cigdem Gencel is an assistant professor at the department of Systems and Software Engineering in Blekinge Institute of Technology, Sweden. She holds a Ph.D. degree from the Information Systems Department of the Middle East Technical University in Turkey and completed her post-doctoral research at the same university. She also works as a part-time consultant on software measurement, estimation and process improvement. She has been giving software size and effort estimation training to software organisations since 2004. She is a member of the International Advisory Committee of the COSMIC. Her other interest areas involve software project management and software requirements elicitation.


162

Dr. Carmine Gravino University of Salerno, Italy [email protected] 77 Estimating web application development effort employing Cosmic size measure: a


Dr. Carmine Gravino received the Laurea degree in Computer Science (cum laude) in 1999, and his PhD in Computer Science from the University of Salerno (Italy) in 2003. Since march 2006 he is Assistant Professor in the Department of Mathematics and Informatics at the University of Salerno. His research interests include software metrics to estimate web application development effort, software-development environments, and design pattern recovery from object-oriented code. Subhash Jogia India 105 A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation

for Service Management: Case Study in Off-shore IT-Service Delivery

Subhash Jogia is experienced in Program Management, holds a graduate degree in Aeronautical Engineering from Imperial College, University of London. Has 21 years of professional experience in the IT industry in the Banking and Financial Services domain. Prof.dr. Rob J. Kusters (1957) Eindhoven University of Technology, Netherlands 1 Results of an empirical study on measurement in Maturity Model-based software

process improvement Rob Kusters obtained his master degree in econometrics at the Catholic University of

Brabant in 1982 and his Phd. in operations management at Eindhoven University of Technology in 1988. He is professor of 'ICT and Business Processes' at the Dutch Open University in Heerlen where he is responsible for the master program 'Business Process Management and IT'. He is also an associate professor of 'IT Enabled Business Process Redesign' at Eindhoven University of Technology where he is an associate member of the research school BETA which focuses at operations management issues. He published over 70 papers in international journals and conference proceedings and co-authored five books. Research interests include enterprise modelling, software quality and software management.


163

Gianfranco Lanza CSI Piemonte, Italy [email protected] 25 IFPUG Function Point or COSMIC Function Point?

Gianfranco Lanza graduated in Informatica in 1985. His first experience in Function Point is from 1998 and he is CFPS since January 2007. In CSI Piemonte he’s taking up the functional dimensional process of software applications and the prevision effort model for their implementation.

Gianfranco participated to the translation in Italian Language of IFPUG Counting Rules 4.2 and COSMIC Manual 3.0.

He attended assemblies of GUFPI-ISMA (Gruppo User Function Point Italia – Italian Software Metrics Association), CPC (Counting Practice Committee) and the SBC (Software Benchmarking Committee). He also attended previous SMEF Editions.

Gianfranco presented at GUFPI and “SMEF 2007” about the application of software metrics in CSI Piemonte.

From the first of January 2009 he’s member of the board of GUFPI-ISMA. Gaetano Lombardi Ericsson Telecomunicasioni Italia SpA, Italy [email protected] 49 Effective Project Portfolio balance by forecasting project’s expected effort and its

break down according to competence spread over expected project duration

Gaetano Lombardi is currently Project Office Manager at Ericsson Telecomunicasioni Italy with more than 15 years experience in handling projects to develop complex telecommunications systems. Gaetano Lombardi has cooperated with several Italian Universities to investigate decision models and, in the specific, together with CNR in Pisa, he published several papers on Software Testing and Software Metrics. In 1996-1998 he was first responsible of implementing practices according to CMM Level 3 and then Engineering Group Leader with responsibility of implementing practices according to CMM Level 4 in Ericsson R&D Italy. Dr. Carolyn Mair Southampton Solent University, United Kingdom [email protected] 37 A Cognitive Perspective on Analogy-based Project Estimation Miriam Martincova Southampton Solent University, United Kingdom [email protected] 37 A Cognitive Perspective on Analogy-based Project Estimation


164

Dr. Sergio Di Martino University of Salerno, Italy [email protected] 77 Estimating web application development effort employing Cosmic size measure: a


Dr. Sergio Di Martino received the Laurea degree in Computer Science (cum laude) in 2001, and his PhD in Computer Science from the University of Salerno (Italy) in 2005. Since 2007 he is Assistant Professor in the Department of Physics at the University of Naples “Federico II”. His research interests include software metrics to estimate web application development effort, Human-Computer Interaction, and Data Visualisation. Roberto Meli D.P.O. – Data Processing Organization Srl., Italy [email protected] - KEYNOTE

How to bring "software measurement” from research labs to the operational business processes

Roberto Meli graduated summa cum laude in Computer Science in 1984. Since 1996 he

has been General Manager of DPO Srl. For the past 15 years he has worked as an expert in project management and software

measurement and has written articles and papers for technical magasines and international conferences. In 1996 and 2001, Meli passed and renewed the IFPUG exams to be a Certified Function Points Specialist (CFPS). He is a consultant and lecturer in training courses on project management and software measurement for many major Italian companies and public organisations.

He invented and developed the Early & Quick Function Point Analysis method. Roberto is an active member of the Project Management Institute and past Chairperson of

the COSMIC Measurement Practices Committee. Anna Perrone RAI - Radiotelevisione Italiana, Italy 117 FP in RAI: the implementation of software evaluation process

Anna Perrone received his degree in Computer Science in 1990. After a period as software

consultant, in 1992 she started working in RAI ICT Department. At the beginning she worked as software engineer and in 1998 she became chief of PMO that is responsible of coordinating the ICT projects, program and portfolio. She has a significant experience in project management, in FP methodology and in supplier contract management.


165

Monica Persello RAI - Radiotelevisione Italiana, Italy [email protected] 117 FP in RAI: the implementation of software evaluation process

Monica Persello received his degree in Physic Science in 1994, after two years as fellowship at CERN, in 1997 she started working in RAI ICT Department: at the beginning as business analyst and since 1998 in the PMO where she gained a relevant experience in project management methodology, tools and technique. Since 2005 she started learning FPA, E&Q FP and she has a practice experience on FP software estimation.

In 2006 she became PMP certified. Giorgio Poggioli RAI - Radiotelevisione Italiana, Italy 117 FP in RAI: the implementation of software evaluation process

Giorgio Poggioli is working in RAI ICT Department since 1987. He has a significant experience in mainframe and client server and since 2000 he works in the PMO where he gained a relevant experience in project management methodology, tools and technique. He is the MS Project administrator and since 2005 he started learning FPA, E&Q FP and he has a practice experience on FP software estimation Srinivasa-Desikan Raghavan Tata Consultancy Services Ltd. (TCS), India [email protected] 105 A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation


Srinivasa-Desikan Raghavan is a certified PMP, holds a doctoral degree from Indian Institute of Management – Bangalore, besides a Bachelor’s degree in Technology (B. Tech.). Has over 20 years of professional experience spanning IT industry, academia and Government.


166

Ir. Jana Samalikova (1974) Eindhoven University of Technology, Netherlands 1 Results of an empirical study on measurement in Maturity Model-based software

process improvement Jana Samalikova obtained her master degree in business information systems at the

University of Economics in Bratislava, Slovakia in 2000. She obtained her professional doctorate in engineering at the Eindhoven University of Technology in 2003. She has started her PhD. in process-mining based software process improvement in the research school BETA in 2007. Her research interests include software process improvement, quality assurance and process mining. Monika Sethi Tata Consultancy Services Ltd. (TCS), India 105 A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation


Monika Sethi is a Management graduate from a premier school, holds a Bachelor’s degree in Engineering (B.E.). She has 2 years of professional experience in IT industry, primarily in the Project Management space. Prof. Martin Shepperd Brunel University, United Kingdom [email protected] 37 A Cognitive Perspective on Analogy-based Project Estimation Dayal Sunder Singh Tata Consultancy Services Ltd. (TCS), India [email protected] 105 A ‘middle-out’ approach to Balanced Scorecard (BSC) design and implementation


Dayal Sunder Singh a certified PMP, holds a post-graduate degree from College of Engineering, Chennai. Has 16 years of professional experience in the IT industry in the Banking and Financial Services domain.


167

Dr. Jari Soini Tampere University of Technology (TUT), Finland [email protected] 11 Practical viewpoints for improving software measurement utilisation

Jari Soini is working at Tampere University of Technology (TUT), Pori, Finland, as a researcher. His research interests include software process improvement (SPI) and especially software metrics and measurement. He has four years’ experience of developing and implementing software products, and currently his duties at TUT include managing software research projects and participating in teaching software engineering. He is also a member of the Centre of Software Expertise unit (CoSE) at TUT, Pori (http://www.tut.fi/cose). Mark Stephens United Kingdom [email protected] 37 A Cognitive Perspective on Analogy-based Project Estimation Charles Symons United Kingdom [email protected] 91 From Performance Measurement to Project Estimating using COSMIC Functional

Sizing

Charles Symons has almost 50 years experience in the use of computers for business and scientific purposes, in all the major disciplines of the Information Systems function. He has published original work in computer use accounting, data analysis, computer security, and software measurement and estimating. He has led consulting projects to develop IS strategies and to improve IS performance.

His interest in software measurement and estimating began in the 1980’s when he developed the MkII FP sizing and estimating methods. He is now semi-retired, but continues as joint project leader of COSMIC, the Common Software Measurement International Consortium.


168

Vania Toccalini Ericsson Telecomunicasioni Italia SpA, Italy [email protected] 49 Effective Project Portfolio balance by forecasting project’s expected effort and its

break down according to competence spread over expected project duration

Vania Toccalini is Project Quality Manager at Ericsson Telecomunicasioni Italia with more than 8 years experience in Product Quality activity management mainly for transmission and transport networks products. She is currently candidate to Six Sigma Black Belt certification as responsible for an improvement project about Project Planning practices in a multi-project environment.

Dr.ir. Jos J.M. Trienekens (1952) Eindhoven University of Technology, Netherlands [email protected] 1 Results of an empirical study on measurement in Maturity Model-based software

process improvement Jos Trienekens is an Associate Professor at TU Eindhoven (University of Technology –

Eindhoven) in the area of ICT systems development. His current research interests include software process improvement, software quality and software metrics. He is working on a research program on Software Management and is an associate member of the research school BETA. Jos Trienekens published over the last ten years various books, papers in international journals and conference proceedings. He joined several international conferences as member of the organisation committees and PC's.

.







Smef2009

Documents

software measurement

measurement process

software metrics

problem of measurement

measurement education

measurementdriven decision

china software process

university of rome