Assessing and Improving the Quality of Modeling - Technische

Assessing and Improvingthe Quality of Modeling

A Series of Empirical Studies aboutthe UML

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan deTechnische Universiteit Eindhoven, op gezag van de

Rector Magnificus, prof.dr.ir. C.J. van Duijn, voor eencommissie aangewezen door het College voor

Promoties in het openbaar te verdedigenop woensdag 24 oktober 2007 om 16.00 uur

door

Christian Franz Josef Lange

geboren te Tegelen

Dit proefschrift is goedgekeurd door de promotoren:

prof.dr. S. Demeyerenprof.dr. M.G.J. van den Brand

Copromotor:dr. M.R.V. Chaudron

Assessing and Improving the

Quality of Modeling

A Series of Empirical Studiesabout the UML


Eerste Promotor: prof.dr. Serge Demeyer (Universiteit Antwerpen)Tweede Promotor: prof.dr. Mark G.J. van den Brand

(Technische Universiteit Eindhoven)Copromotor: dr. Michel R.V. Chaudron

(Technische Universiteit Eindhoven)

Overige Leden Kerncommissie:prof.dr. Lionel Briand (Simula Research Laboratory, Oslo, Norway)prof.dr. Arie van Deursen (Technische Universiteit Delft)dr. Johan Lukkien (Technische Universiteit Eindhoven)

The work in this thesis has been carried out under the auspices of the researchschool IPA (Institute for Programming research and Algorithmics).

IPA dissertation series 2007-14.

c© Christian F.J. Lange, 2007.

Printing: Printservice Technische Universiteit EindhovenCover design: Oranje Vormgevers, EindhovenThe figure on the cover page resembles a UML-City view, which is described inChapter 7.

CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN

Lange, Christian Franz Josef

Assessing and Improving the Quality of Modeling :A Series of Empirical Studies about the UML /door Christian Franz Josef Lange. –Eindhoven : Technische Universiteit Eindhoven, 2007.Proefschrift. – ISBN 978-90-386-1107-5NUR 980Subject headings: software development / design / formal languages ; UML /software quality / human perception errorsCR Subject Classification (1998): D.2.2, D.2.7, D.2.8, D.2.9, H.5.2, K.6.3

Fur meine Eltern

Preface

During the process of my doctoral studies, I have gained many valuable experiencesabout software engineering, about research, and about myself. I am very pleasedto finally have the opportunity to express my gratitude to the people who havecontributed to this process.

First of all, I owe my sincere gratitude to my supervisor Michel Chaudron, whoinspired my interest in software engineering and modeling. Thank you very muchfor all the interesting discussions, your endless encouragement to conduct casestudies and experiments, the large freedom you allowed me for my research, andfor a personal and friendly collaboration.

I would like to thank Serge Demeyer and Mark van den Brand for their willingnessto adopt me as a PhD student and becoming my eerste promotor and tweedepromotor, respectively. I am very thankful for your guidance and constructivecomments concerning my dissertation.

Due to my plan of finishing this work in October, the thesis had to be reviewedduring the holiday period. For reading my work during their well deserved holidaysand for serving on my doctoral committee I owe my thankfulness to Lionel Briand,Arie van Deursen, and Johan Lukkien.

I owe many thanks to those with whom I collaborated as part of my dissertationproject. The MSc students Marcel Wijns and Maurice Termeer did a great job onthe implementation of the tool MetricView Evolution. I enjoyed the productivecollaboration with Marcel van Amstel and Dennis van Opzeeland during their MScprojects within the EmpAnADa project. It was a great pleasure to collaboratewith Bart DuBois on the modeling conventions experiment. The collaboration withJohan Muskens in the early phase of this project was very inspiring. A herzlichesDankeschon to Teade Punter for very useful discussions about empirical softwareengineering, for valuable hints, and for his pleasurable comradeship on trips toShanghai and Genua.

Within our research group SAN, there is a great atmosphere for doing research.I would like to thank all group members for creating this atmosphere, for manyfruitful discussions, and for all the enjoyable lunches. In particular I would like to

i

thank Reinder Bril for excellent reviews of several papers, Richard Verhoeven whois always willing to help in technical problems, but who also has an eye for otherissues, Cecile Brouwer for her anticipatory assistance and her cordiality, and myformer roommate Louis van Gool for keeping up the spirit of Limburg. I would liketo thank my roommates that have not been mentioned so far for their pleasurablecollegiality and cooperation: Bas Flaton, Martijn van der Horst, Dimitri Jarnikov,Rene Ladan, and Jiang Zhang.

I am very thankful to all my friends and my family for their support over the years.In particular I would like to thank Marc van Eyk and Rob Laumans for being myseconds.

This work would not have been possible without my loving parents Marlies andEberhard Lange. Thank you for your support, your love, and for always beingthere, when I need you. Your presence enables me to accomplish all this. ThereforeI dedicate this thesis to you.

Christian LangeNettetal, August 24th, 2007

ii

Publications

The following papers were published during my PhD project. The list is given inchronological order.

1. Konsistenz und Vollstandigkeit industrieller UML Modelle.C.F.J. Lange and M.R.V. Chaudron.Proceedings of Modellierung 2004. Gesellschaft fur Informatik. March 2004.

2. An Empirical Assessment of Completeness in UML Designs.C.F.J. Lange and M.R.V. Chaudron.Proceedings of the 8th Conference on Empirical Assessment in Software En-gineering (EASE04). May 2004. [112]Chapter 4 is an extension of this paper.

3. Investigations in Applying Metrics to Multi-View ArchitectureModels.J. Muskens, M.R.V. Chaudron, and C.F.J. Lange.Proceedings of the 30th EUROMICRO Conference on Software Engineeringand Advanced Applications (SEAA). September 2004. [148]

4. An Exploratory Study on the Industrial Use of UML: ImprovingControl over Design Quality.C.F.J. Lange and M.R.V. Chaudron.Proceedings of the JACQUARD Conference. February 2005.

5. Combining Metrics Data and the Structure of UML Models us-ing GIS Visualization Approaches.C.F.J. Lange and M.R.V. Chaudron.Proceedings of the IEEE International Conference on Information Technol-ogy. April 2005.

6. Quantitative Techniques for the Assessment of Correspondencebetween UML Designs and Implementations.D.J.A. van Opzeeland, C.F.J. Lange, and M.R.V. Chaudron.

iii

Proceedings of the 9th Workshop on Quantitative Approaches in Object-Oriented Software Engineering (QAOOSE), co-located with ECOOP’05. July2005. [190]

7. Managing Model Quality in UML-based Software development.C.F.J. Lange and M.R.V. Chaudron.Proceedings of the IEEE Conference on Software Technology and Engineer-ing Practice (STEP), co-located with ICSM’05. September 2005.Chapter 3 is an adapted version of this paper.

8. Visual Exploration of Combined Architectural and Metric In-formation.M. Termeer, C.F.J. Lange, A. Telea, and M.R.V. Chaudron.Proceedings of the 3rd IEEE International Workshop on Visualizing Softwarefor Understanding and Analysis (VISSOFT’05), co-located with ICSM’05.September 2005.

9. In Practice: UML Software Architecture and Design Descrip-tion.C.F.J. Lange, M.R.V. Chaudron, and J. Muskens.IEEE Software, Volume 23, Issue 2, March 2006.

10. Improving the Quality of UML Models in Practice.C.F.J. Lange.Proceedings of the 28th International Conference on Software Engineering(ICSE’06), Doctoral Symposium. May 2006.

11. Effects of Defects in UML Models - An Experimental Investi-gation.C.F.J. Lange and M.R.V. Chaudron.Proceedings of the 28th International Conference on Software Engineering(ICSE’06), Research Track. May 2006. [115]Chapter 5 is based on this paper.

12. Towards Task-Oriented Modeling using UML.C.F.J. Lange, M.A.M. Wijns, and M.R.V. Chaudron.Proceedings of the 10th Workshop on Quantitative Approaches in Object-Oriented Software Engineering (QAOOSE), co-located with ECOOP’06. July2006.This paper is a preliminary version of [120].

13. An Experimental Investigation of UML Modeling Conventions.C.F.J. Lange, B. DuBois, M.R.V. Chaudron, and S. Demeyer.Proceedings of the ACM/IEEE International Conference on Model-drivenEngineering Languages and Systems (MoDELS’06). October 2006. [118]Chapter 6 is based on this paper.

iv

14. Model Size Matters.C.F.J. Lange.Proceedings of the 1st Workshop on Model Size Metrics (co-located withMoDELS’06). October 2006.Selected as Best Paper.

15. A Quantitative Investigation of UML Modeling Conventions.B. DuBois, C.F.J. Lange, S. Demeyer, and M.R.V. Chaudron.Proceedings of the 1st Workshop on Quality in Modeling (co-located withMoDELS’06). October 2006.Selected as Best Paper.

16. A Visualization Framework for Task-Oriented Modeling usingUML.C.F.J. Lange, M.A.M. Wijns, and M.R.V. Chaudron.Proceedings of the 40th Annual Hawaii International Conference on SystemSciences (HICSS’07). January 2007. [120]Chapter 7 is an adapted version of this paper.

17. MetricViewEvolution: UML-based Views for Monitoring ModelEvolution and Quality.C.F.J. Lange, M.A.M. Wijns, and M.R.V. Chaudron.Proceedings of the 11th European Conference on Software Maintenance andReengineering (CSMR2007), Tool Demo. March 2007.

18. Interactive Views to Improve the Comprehension of UML Mod-els - An Experimental Validation.C.F.J. Lange, M.A.M. Wijns, and M.R.V. Chaudron.Proceedings of the 15th IEEE International Conference on Program Com-prehension (ICPC’07). June 2007. [116]Chapter 8 is based on this paper.

19. Four Automated Approaches to Analyze the Quality of UMLSequence Diagrams.M.F. van Amstel, Christian F.J. Lange, and M.R.V. Chaudron.Proceedings of the 1st IEEE International Workshop Quality-Oriented Reuseof Software (QUORS’07), co-located with COMPSAC’07. July 2007.

20. Supporting Task-Oriented Modeling using Interactive UML Views.C.F.J. Lange, M.A.M. Wijns, and M.R.V. Chaudron.Accepted for Publication in the Journal of Visual Languages and Computing(JVLC), Elsevier. To appear in 2007.This paper is an extended version of [120].

21. Defects in Industrial UML Models – A Multiple Case Study.C.F.J. Lange and M.R.V. Chaudron.

v

Proceedings of the 2nd Workshop on Quality in Modeling (co-located withMoDELS’07). October 2007.Chapter 4 is an adapted version of this paper.

vi

Contents

1 Introduction 1

1.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Unified Modeling Language . . . . . . . . . . . . . . . . . . . 2

1.2.1 Description of the Diagram Types . . . . . . . . . . . . . . 2

1.2.2 Characteristics and Risks . . . . . . . . . . . . . . . . . . . 4

1.2.3 Use of the UML . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Empirical Research Method . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 11

2.1 Purposes of Using the UML . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Studying the Use of the UML . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 General Modeling . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Particular Aspect . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Managing Quality in UML-based Software Engineering . . . . . . . 15

2.4 UML Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Activities supported by UML Tools . . . . . . . . . . . . . 19

2.4.2 Gentleware Poseidon . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 SDMetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii

3 How to Evaluate Model Quality 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Software Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Perspective on Quality . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 A Quality Model for UML . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Concepts in this Model and their Definitions . . . . . . . . 31

3.4.2 Relations between Concepts . . . . . . . . . . . . . . . . . . 36

3.5 Applying and Tailoring the Quality Model . . . . . . . . . . . . . . 37

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Defects in UML Models 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 Structure of this Chapter . . . . . . . . . . . . . . . . . . . 43

4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Design of the Case Study . . . . . . . . . . . . . . . . . . . 43

4.2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.3 Set of Defect Types . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3.1 Analysis of the Models . . . . . . . . . . . . . . . . . . . . . 48

4.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Effects of Defects 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Defect Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

viii

5.4.2 Objects and Task . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.3 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.4.4 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.5 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.6 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.7 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.8 Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5.1 Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5.2 LRQ1: Defect Detection . . . . . . . . . . . . . . . . . . . . 68

5.5.3 LRQ2: Variation of Interpretations . . . . . . . . . . . . . . 70

5.6 Additional Observations . . . . . . . . . . . . . . . . . . . . . . . . 71

5.6.1 Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . 71

5.6.2 Prevailing Diagram . . . . . . . . . . . . . . . . . . . . . . . 72

5.6.3 Comparing Students’ and Professionals’ Results . . . . . . . 73

5.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.7.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . 74

5.7.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . 75

5.7.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . 76

5.7.4 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . . 76

5.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 76

6 Modeling Conventions to prevent Defects 79

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 Modeling Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2.2 Modeling Conventions in this Experiment . . . . . . . . . . 82


6.3.1 Purpose and Hypotheses . . . . . . . . . . . . . . . . . . . . 82

6.3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.3.3 Objects and Task . . . . . . . . . . . . . . . . . . . . . . . . 84

ix

6.3.4 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3.5 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3.6 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 85


6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


6.4.2 H1: Presence of Defects . . . . . . . . . . . . . . . . . . . . 88

6.4.3 H2: Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4.4 Attitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4.5 Adherence to the Treatment . . . . . . . . . . . . . . . . . . 92






6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7 Task-Oriented Views 99

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 A Framework for Task-Oriented Views . . . . . . . . . . . . . . . . 100

7.2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2.3 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3 Proposed Task-Oriented Views . . . . . . . . . . . . . . . . . . . . 103

7.3.1 MetaView . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3.2 MetricView . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.3 UML-City View . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.4 Quality Tree View . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.5 Context View . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.3.6 Evolution View . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.3.7 Search and Highlight . . . . . . . . . . . . . . . . . . . . . . 110

x

7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8 Validation of Task-Oriented Views for Comprehension 113

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


8.2.1 Purpose and Hypotheses . . . . . . . . . . . . . . . . . . . . 114

8.2.2 Task, Objects and Treatment . . . . . . . . . . . . . . . . . 115

8.2.3 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.2.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.2.5 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.2.6 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.2.7 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119


8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120


8.3.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . 121

8.3.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 123

8.3.4 Comparing the Results of both Runs . . . . . . . . . . . . . 128

8.3.5 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . 129






8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

9 Conclusions 137

9.1 RQ1: Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9.1.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9.2 RQ2: Industrial Case Studies . . . . . . . . . . . . . . . . . . . . . 138

9.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 138

xi

9.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3 RQ3: Effects of Defects . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 140

9.4 RQ4: Modeling Conventions . . . . . . . . . . . . . . . . . . . . . . 140

9.4.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 140

9.4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9.5 RQ5: Task-oriented Views . . . . . . . . . . . . . . . . . . . . . . . 141

9.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9.5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Bibliography 143

A Effects of Defects Experiment 157

A.1 The Agreement Measure . . . . . . . . . . . . . . . . . . . . . . . . 157

A.2 Raw Result Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

B Modeling Conventions Experiment 161

B.1 Modeling Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 161

B.2 Post-test Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . 163

C Task-Oriented Views Experiment 167

C.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

C.2 Assumptions of the Student t-test . . . . . . . . . . . . . . . . . . 167

Summary 171

Samenvatting 173

Zusammenfassung 177

Curriculum Vitae 181

xii

Chapter 1

Introduction

1.1 General Introduction

The size of software systems has become extremely large, ranging up to millionsof lines of source code. It is common that projects creating these systems requirethe contributions of hundreds of project members and last several years. Thisemphasizes the software industry’s challenge of managing the large complexityof developing and maintaining software systems. To tackle these challenges, thesoftware industry needs proficient tools and methods.

Software engineering is the approach of developing and systematically applyingtheories, tools and methods for development and maintenance of software [178]. Alarge variety of methods is applied within software engineering to serve particularpurposes.

The activity used in software engineering of particular interest in this thesis is mod-eling. Models are representations of a (software) system at an abstract level [170].The purpose of models is to reduce the aforementioned level of complexity of soft-ware to a level that is comprehensible to humans [33][149]. In models the relevantissues such as important design decisions are described and non-relevant issues areomitted. Models are described textually or using a graphical notation. Withinsoftware engineering models are used in various phases of the software life cy-cle [191]. For example in the requirements engineering phase, models are used tospecify requirements and to communicate between users, clients and developers.Further in the life cycle, in the architecture and design phases, models describe thesystem at different levels of abstraction. These models are used to communicatethe design decisions between developers. During the maintenance phase modelsare used as documentation of the system. We will detail further applications ofthe continuously growing range of modeling applications when needed.

1

2 Chapter 1 Introduction

1.2 The Unified Modeling Language

The Unified Modeling Language (UML) [152] is the de facto standard for mod-eling software systems. Therefore we focus on UML in this thesis, however, weexpect most results to be transferable to other modeling languages with similarcharacteristics as the UML. The UML is a graphical notation. The first versiondates from 1997. Since then, several revised and extended versions have appeared.In its recent version UML 2.0, the language provides thirteen diagram types, suchas the use case diagram, the class diagram and the activity diagram. Each dia-gram type provides a different view on the described system. Standards such asthe IEEE 1471 for Architectural Descriptions [88] suggest to use views to addressstakeholder concerns. Various kinds of stakeholders take part in software projects,such as programmers, architects, clients, prospective users and many more. Allof them have different information needs and are interested in different concerns.By providing different diagrams and views, the UML can cope with the variety ofstakeholder needs. By separation of concerns [52] and hiding of unrequired details,a UML representation of a system reduces the system’s complexity to improve thestakeholders’ understanding of it.

The UML emerged from the so called method war that was fought in the earlynineties of the twentieth century. The competing notations for object-orienteddesign by Booch [22], Rumbaugh et al. [169], Jacobson et al. [90] and othersjoined forces and combined their most useful concepts. As a result, the UML is ageneral purpose modeling language that has been widely accepted since then.

The UML is specified by means of the UML Superstructure [152] using a metamodel. The meta model defines the model element types of the UML, such asclasses, packages and states, and their relationships. A UML model consists ofmodel elements and the relationships between them. The model elements mayappear in the diagrams that belong to the model.

1.2.1 Description of the Diagram Types

In this thesis we focus on the most widely used diagram types [53] [83]. Examplesof the diagram types are shown in Figure 1.1 and we briefly describe the diagramtypes in the sequel.

• Use Case Diagram. The use case diagram contains use cases, actors andtheir relationships. The use cases specify the functionality as required of thedescribed system.

• Class Diagram. The class diagram contains classes, which are the smalleststand-alone entities in object-oriented systems, including their attributes and

1.2 The Unified Modeling Language 3

Figure 1.1. Use Case Diagram (top), Class Diagram (center), SequenceDiagram (bottom)


methods and the relationships between the classes. This diagram type is astructural view of the system.

• Sequence Diagram. The sequence diagram describes the interaction betweenclassifier instances in a sequential manner. It is a behavioral view of thesystem.

The other diagram types are state chart diagram, activity diagram, communicationdiagram, component diagram, composite structure diagram, deployment diagram,interaction overview diagram, object diagram, package diagram, and timing di-agram. It is out of the scope of this thesis to give an elaborate discussion ofthe diagram types of UML, but we will give more details where needed. Moreinformation about the UML can be found for example in [70] and [152].

1.2.2 Characteristics and Risks

The UML has the following characteristics, which threaten the quality of themodels as discussed in the sequel:

• Lack of formal semantics. In the UML Superstructure [152] the UMLmeta model is accompanied with well-formedness rules expressed in the Ob-ject Constraint Language (OCL) [150] and with a description of the seman-tics using natural language rather than a formal language. Therefore themeaning of UML models is not precisely defined and the risk for differentinterpretations of the models is inherent to the UML [64]. Different inter-pretations between software project members can result in poor quality andloss of productivity due to communication overhead or resolving problems.It is additionally stated in [64] that the lack of formality prevents rigorousanalysis of UML models. As a result the models are validated and verifiedinformally.

• Multi-diagram notation. As described above the UML consists of a va-riety of diagram types that describe parts of the software system from aparticular viewpoint. Eventually all diagrams of a model describe the samesystem. Thus there is overlap between diagrams [179]. The overlap entailsthe risk for contradictions between diagrams, so called inconsistencies orconsistency defects. An example of a consistency defect is, when there isa method call to the method open() of an instance of class X, but in theclass diagrams, class X does not have a method called open(). It is commonto create models of different abstraction levels describing the same systemduring software development. Therefore consistency defects can occur notonly within a model between different diagrams, but also between models of

1.2 The Unified Modeling Language 5

different abstraction levels. A risk for inconsistency between views and in-completeness of views is inherent to multi-view modeling languages [68] suchas the UML. Examples of causes for consistency defects are lack of agreementbetween developers, evolution taking place in one of the diagrams withoutadapting the corresponding diagrams, or simply human mistake. Handlingconsistency defects in models is of major importance [76][77] to assure thequality of the models and the resulting implementation. A major challengein handling consistency defects is the aforementioned lack of a formal seman-tics, which precludes defining consistency based on the UML’s specification.Besides the aforementioned efforts to formalize the semantics of the UMLthere are other approaches aiming at dealing with inconsistency.

• Complexity of the UML. The UML is a large language with 13 diagramtypes and in its first version already more than 150 symbols in the alpha-bet [54]. Siau et al. [176] have shown that the UML is between two andeleven times more complex than other object-oriented techniques. In prac-tice, the size [121] of UML models ranges up to hundreds of classes and themodels are used by large development teams. This huge complexity com-bined with the lack of a formal semantics and the absence of guidelines onhow to use the language concepts provide the user with a large degree offreedom. The UML user has freedom in choosing the model elements, ab-straction level and the level of detail. The large degree of freedom in usingthe UML may result in different styles of modeling, hence, it is a possiblecause for miscommunication.

The aforementioned characteristics are inherent to the UML and its relation toits users. In this thesis we will address the risks caused by these characteristics.However, the range of risks that occur in software engineering with UML modelsis broader. For example there are risks that are associated with the UML andits relation to a particular programming paradigm. A characteristic of the UMLis that several of its concepts are based on the object-oriented paradigm. Thismakes the UML more suitable to model object-oriented systems than systemsusing other paradigms. The suitability of the UML’s meta model for round-tripengineering has been criticized [51]. Another cause of risk is the synchronization ofUML models and the corresponding implementations. This problem is addressedin [190].

1.2.3 Use of the UML

Originally the UML was designed as ‘a graphical language for visualizing, specify-ing, constructing and documenting the artifacts of a software intensive system’ asstated by Bezivin and Muller in their editorial introduction to the proceedings ofthe First International Workshop on the UML [18]. A similar observation is stated


Figure 1.2. The modeling spectrum (from [34])

by Engels et al. [63]. This statement indicates that originally the source code wasregarded as the primary artifact in software engineering. Models were used asinitial sketches of a software system or as visual representations at a higher ab-straction level that were created based on existing software to better understandthe system (also referred to as reverse engineering [42][24]). That situation ischaracterized on the left-hand side of the of the modeling spectrum of Brown [34](Figure 1.2). Bezivin and Muller [18] forecasted that the introduction of the UMLinitiates the transition from code-centric software engineering to model-centricsoftware engineering, which would mean a move towards the right-hand side ofBrown’s modeling spectrum. This trend is ongoing, as an effort to cope with theaforementioned challenges of increasing system complexity and demand. The ul-timate goal of initiatives such as Model-Driven Engineering (MDE [174]) and theObject Management Group’s (OMG [153]) so called Model Driven Architecture(MDA [151]) is to make models the primary artifacts in software engineering. Inthat case, which is represented on the rightmost box of Brown’s modeling spec-trum, the implementation is automatically generated from the model. The GartnerGroup categorizes MDA technologies in their 2006 ‘Hype Cycle’ report [183] stillas ‘on the rise’, which indicates that the goal of MDE is not yet reached, but thatsoftware engineering is becoming more model-centric. As a result, UML modelsare nowadays used for a much broader range of purposes than in its early years.The range of purposes includes but is not limited to: communication with stake-holders [170], documentation for maintenance [9] [185], effort estimation, qualitypredictions [74][11], and testing [28][173].

We have not yet reached the goal of MDA, but software engineering is definitelybecoming more model-centric. The broad range of activities where the UML isused and the UML’s large acceptance in the industry underline that UML modelsplay a critical role in software engineering. Hence, the quality of UML models isan important asset for successful software projects [75].

1.3 Problem Statement 7

1.3 Problem Statement

Based on the above discussion we formulate the problem statement that is centralto this thesis as:

The UML is a graphical multi-diagram modeling language that does not have aformal semantics and that provides the user with a large degree of freedom andcomplexity. Based on these characteristics the UML has become the de facto stan-dard as a general-purpose modeling language in software engineering. However,the described characteristics cause a risk to the quality of UML models. These riskscan lead to defects in UML models. Defects are flaws in the model that impair qual-ity attributes such as correctness, comprehensibility, consistency, non-redundancy,completeness, or unambiguity. Therefore there is the need for techniques that re-duce the risks and enhance the quality of UML modeling.

1.4 Research Questions

Most existing approaches that aim at tackling the risks inherent to the UMLrequire changes to the UML specification (e.g. formalizing the semantics) or makeuse of formal approaches (such as the Object Constraint Language OCL [150])to deal with the degrees of freedom offered by the UML. Therefore they requirea large amount of rigor in creation and maintenance and more knowledge of theUML’s theoretical concepts like the meta model. This would reduce the ease ofusing the UML and, possibly, reduce its acceptance by practitioners. Thereforewe are interested in improving the use of the language in its current form withlight-weight techniques that do not put a burden on the language user.

To address the problem statement, we formulate the overall research question tobe addressed in this thesis as:

• RQoverall: How can the quality of UML models be improved, in particularwith respect to defects, miscommunication and complexity?

We are going to decompose the overall research question in more detailed researchquestions. The first step in addressing this question is to define the perspectiveon quality that we take. Therefore we address the research question:

• RQ1: How can the quality of UML models be decomposed into quality notionsfor particular purposes of modeling?

As a starting point for the further studies, we are interested in the quality of UMLmodels in practice, which leads us to the following research question:


• RQ2: What is the quality of industrial UML models, in particular withrespect to defect containment?

To get further insight into the risk caused by defects, we address the followingresearch question:

• RQ3:What is the effect of defects in UML models, in particular with respectto detection and misinterpretation?

We are interested in techniques that reduce quality risks inherent to the UML.We propose modeling conventions, as an analogy to coding conventions in pro-gramming. Modeling conventions are a preventive technique to assure that modeldevelopers adhere to quality norms during the development of models. We studymodeling conventions in the following research question:

• RQ4: Can modeling conventions for developing UML models improve modelquality?

And finally, we propose task-oriented views, which are views on UML models thatvisualize model information needed by the developer to fulfill a particular task.Therefore we are interested in the following research question:

• RQ5: Can task-oriented views of UML models enhance the comprehensionof the models?

1.5 Empirical Research Method

In software engineering many techniques, methods and tools have been adoptedwithout a thorough evaluation [25][184]. In cases where methods are adopted with-out evaluation, there exists no scientific foundation about the desired enhancementof productivity and quality, nor about the absence of negative side-effects.

The software engineering community has realized the need for thorough evaluationof proposed techniques, methods and tools. The acceptance of empirical studiesas a means for evaluation is rising [203][12]. Empirical evidence reduces risk [184]of adopting new techniques and facilitates the transfer of sound techniques toindustrial practice [95].

Empirical studies are essential for developing and validating our knowledge ofsoftware engineering in general and in particular of the quality of UML modeling.The UML has been around for ten years now, but the number of empirical studiesaddressing its use and quality is still relatively small compared to its popularityin practice and the number of suggested changes and improvements for the UML.

1.6 Contributions and Outline 9

In this thesis we investigate the quality of UML models and techniques to improvethe quality using empirical studies. We conduct case studies to explore the defectcontainment of UML models in practice (RQ2). Additionally we report a series oflarge-scale controlled experiments to build knowledge about the effects of defects(RQ3) and to evaluate suggested methods to improve the quality of UML models(RQ4 and RQ5). The Chapters 5, 6, and 8 report on the large scale experiments.Therefore the structure of these chapters is very similar. However, the similarityof their structure does not imply a similarity of the content. The content differs,but is related as described in the following subsection.

The conclusions have implications for the successful use of the UML. Additionally,the reported studies form a basis for further empirical studies with respect to qual-ity of UML modeling. We enable external replications [32][184] of our experimentsby providing replication packages containing the experimental material.

1.6 Contributions and Outline

In this thesis we report the results of our research addressing quality of UMLmodels. This section gives an outline of the chapters in this thesis. Additionally,we summarize our contributions to the knowledge about software engineering, inparticular with respect to UML model quality. We indicate the relation betweenparts of this thesis to our previous publications.

• Chapter 2: Background. We discuss existing approaches that addressthe described quality problems in UML and we point out how they relate toour work as well as their advantages and disadvantages.

• Chapter 3: (RQ1) A quality model for the UML. We proposed aninitial quality model for UML models, which takes the quality of both, thedescription and the described system into account. The quality model relatesmetrics and defects to quality attributes. The quality model is a guidelineto select metrics and rules to assess the quality for a particular purpose ofmodeling. Additionally, we present a decomposition of model quality intoquality notions: system quality, semantic quality, pragmatic quality, socialquality, communicative quality, and correspondence. We use these notionsto position the studies presented in the remaining chapters of this thesis.This chapter is based on [114].

• Chapter 4: (RQ2) Survey-of-the-practice of quality in UML mod-eling. Most results in literature are based on models produced by subjectswith little or no industrial experience. We contribute explorative data aboutthe defect containment in UML models used in industrial projects from dif-ferent application domains. This chapter is based on [112] and [117].


• Chapter 5: (RQ3). This chapter is based on [115] and the contribution istwo-fold:

– Insights into the effects of defects. Based on a large scale con-trolled experiment we investigated the effects of several defect typesthat occur in UML models. The results quantify the likelihood thatthe defects remain undetected and the likelihood that the defects causemisinterpretation amongst different readers.

– An objective defect classification. We provide an objective classi-fication of defect types with respect to the likelihood of detection andmisinterpretation. The classification describes the risk caused by thedefect types and can be used to focus defect removal effort on the de-fects with the highest risk. Additionally the design of the conductedexperiment of effects of defects can be reused as an instrument for theclassification of a larger number of defect types.

• Chapter 6: (RQ4) Guidelines for using modeling conventions. Weproposed modeling conventions as a means to create UML models of a betterquality. In addition to existing literature we conducted explorative experi-ments that provide data about the effectiveness and efficiency of modelingconventions. Based on the experimental results we provide guidelines tofurther improve the use of modeling conventions. Additionally, the experi-mental results are a starting point for future investigations into the use ofmodeling conventions. This chapter is based on [118].

• Chapter 7: Task-oriented Views. We proposed task-oriented views, a setof new visualizations of UML models and related metrics data. The purposeof the views is to provide developers with model information that is neededto fulfill a particular development task. This chapter is based on [120].

• Chapter 8: (RQ5) Enhancement of model comprehension and qual-ity analysis using task-oriented views. We validated the task-orientedviews in a controlled experiment. The results of the experiment show signifi-cant increases in correctness and productivity for comprehension and qualityanalysis tasks. This chapter is based on [116].

• Chapter 9: Conclusions. We draw conclusions, reflect on our work in-cluding discussing the limitations of our work, and we give an outlook toopportunities for future work.

• Appendices. The appendices contain additional details to the experimentsreported in this thesis.

Chapter 2

Background

The purpose of this chapter is to provide a background of work that is in generalrelated to the topic of this thesis. Related work that is specific for a chapter willbe discussed in the particular chapter.

2.1 Purposes of Using the UML

We motivated the need for quality in UML modeling in Chapter 1 by the growingrange of purposes for which the UML is used. The purpose for which the UML isused in a specific project poses quality requirements on the UML model as we willdiscuss in further chapters, especially in Chapter 3 (quality model) and Chapter 7(task oriented views). Here we give an overview of a selection of purposes of UMLmodeling and we present previous work that studied the use of the UML for thespecific purposes. The range of purposes includes but is not limited to:

• Communication with stakeholders. As stated by Rumbaugh, Jacobsonand Booch [170], a primary purpose of the UML is to support communicationbetween stakeholders.

• Comprehension. Models are abstract representations of software systems.One of the purposes of UML models is comprehension of the described soft-ware system.

• Documentation for maintenance. The UML is a type of graphical docu-mentation of a system that can be used during maintenance to locate systemelements that are affected by a maintenance task. Arisholm et al. [9] con-ducted experiments that showed improvement of maintenance task qualityand correctness when UML documentation was available. However, no time

11

12 Chapter 2 Background

savings were observed. Tilley et al. [185] qualitatively assessed the suitabilityof the UML as a means for documentation. In situations where no documen-tation of a system is available, the UML is used as the target language ofreverse engineering activities [167]. UML documentation is created based onfacts extracted from the source code [29][100][165].

• Code Generation. The goal of initiatives such as Model-Driven Engi-neering (MDE [174]) and the Object Management Group’s (OMG [153]) socalled Model Driven Architecture (MDA [151]) is to develop techniques toautomatically transform models into implementations, i.e. source code.

• Effort estimation. As UML models are available early in the developmentlife-cycle, they are a possible basis for early estimations of project effort.Carbone et al. [38] proposed an approach based on different UML diagramtypes to estimate the size of the implementation, which is a substitute foreffort. Uemera et al. [187] proposed a method that takes a UML model asbasis for a Function Point effort estimation. Mohaghegi et al. [143] reporton a case study where effort prediction was based on UML use cases.

• Quality predictions. It is known that it requires less effort to change orimprove a software system at an earlier stage of the development life-cycle.Therefore it is desirable to predict quality properties of a system alreadyat an early stage, such that possible changes can be made economically.Genero et al. [74] experimentally investigated the suitability of UML modelmetrics to predict the quality attribute maintainability. Cortellessa et al. [49]presented a UML-based approach for reliability prediction. Balsamo et al.conducted a comprehensive literature review of model-based performanceprediction methods [11].

• Testing. In testing the conformance between an implementation and itsspecification is verified. Because UML is used for specifying software sys-tems, it is used to support testing. Briand et al. [28] present a methodwhich supports system testing based on UML specifications. Schieferdeckeret al. [173] discuss several approaches for UML-based testing.

2.2 Studying the Use of the UML

A starting point for improvements of UML modeling is to find out how the languageis used. This allows to identify good practices and weak points. Based on theobtained knowledge the practices of UML modeling can be improved. In thissection we will review related work that investigated good and bad practices inthe use of the UML. Studies regarding particular techniques for improving UMLmodeling will be discussed in the next section.

In our literature review we found work in these categories:

2.2 Studying the Use of the UML 13

• General Modeling. The objective of studies in this category is to findgeneral observations about UML modeling.

• Particular Aspect. The objective of studies in this category is to investi-gate a particular aspect of modeling.

We will discuss studies of these categories in the sequel.

2.2.1 General Modeling

Current practices in UML modeling were studied in a OMG-supported survey byDobing and Parsons [53]. They investigated the frequency of use of different UMLdiagram types and the reasons for using or not using the diagram types. The resultsshow that class diagrams, sequence diagrams and use case diagrams are the mostlyused diagram types, whereas state charts, collaboration diagrams and activitydiagrams are less frequently used. The reason for this usage of diagram typesis the adequacy of information in the diagram types as perceived by the modelusers. However, these observations rely on the subjective opinion of the surveyparticipants only. It is expected that the use of diagram types also depends onfactors such as development process, implementation technology, and applicationdomain. However, the survey does not provide information about such factors. Aproblem emphasized by the results of the survey is the overwhelming complexity ofthe UML. The authors conclude that the engineers using the UML should receivebetter training on how to choose parts of the UML that are suitable for theirpurpose. This conclusion motivates the studies of the category ‘particular aspect’and the need for guidelines and modeling conventions.

Grossman et al. [83] conducted a survey to find out whether the UML fits the tasksof the user community. The overall result was that the fit is slightly above neutral,but still far from a perfect fit. The authors of the study concluded that there isno consensus on how to use the UML and whether its use would be beneficial atall. Similar to the study by Dobing and Parsons, the authors conclude that UMLusers were overwhelmed with the language’s large complexity and they lacked anunderstanding on how to use the language to fit their tasks.

Anda et al. [7] and Staron [180] conducted case studies in large-scale industrialprojects. Their objective was to investigate the benefits and shortcomings of UMLmodeling, and identify conditions necessary for successful UML modeling. Anda etal. report that the introduction of UML-based development has improved severalof the aforementioned purposes of using the UML: test case development, commu-nication, documentation, traceability from requirements to code, and code design.However, problems are reported that reduced the benefits of using the UML: re-verse engineering legacy code was not feasible, organizational problems, insufficienttraining, and inappropriate functionality of the available modeling tools. Staron


reports experiences from two companies that introduced modeling in their softwaredevelopment process. The conclusions of Staron’s study are that model-driven de-velopment should be introduced through stages from using modeling at specificpoints to using models as main artifacts. The conclusion is based on the obser-vations that ‘the software development methods are not fitted to use models asthe main artifacts’ and that environments including tooling are not yet ‘matureenough to support companies to a sufficient extent’. To summarize, the case stud-ies of Anda et al. and Staron demonstrate, that the tooling, the methods andthe knowledge on how to use the UML need to be improved to better exploit thebenefits of UML modeling.

Leung et al. [128] conducted an experiment to assess the actual quality of UMLmodels. They inspected UML models created by novice developers for defects.The defects were classified according to the framework of Lindland et al. [130] (seealso Chapter 3 of this thesis). An outcome of the experiment is the likelihoodof occurrence of different defect types. This result can be used to focus qualityassurance effort and teaching effort to prevent the identified types of defects.

2.2.2 Particular Aspect

Tilley et al. [185] studied modeling for the purpose of comprehension. Tilley etal. conducted an experiment to investigate the effect of UML documentation onprogram comprehension. The results show that ‘the UMLs efficacy in supportof program understanding is limited by factors such as ill-defined syntax and se-mantics, spatial layout, and domain knowledge’. The usability of the UML isthe central topic of interest in an empirical study by Agarwal et al. [4]. In theirstudy they asked business-oriented students to evaluate the ease-of-use of differentUML diagrams after the students had used the UML in a course project. Thefindings show that the usability of use case and state chart diagrams receives thehighest score, whereas class and sequence diagrams receive the lowest score. Thequestion arises whether the results will be the same for students with a differentbackground, e.g. technical computer science, or practitioners, who have more ex-perience in using the UML. The overall conclusion is that all UML diagrams haveroom to improve their ease-of-use. Similar to the studies on the general use of theUML it is also concluded that a problem of the UML is its large size in terms oflanguage constructs. Both studies motivate the need to improve the comprehen-sibility and usability of the UML. We address the usability of UML for differentpurposes in our work on task-oriented views presented in Chapter 7. For compre-hensibility of the UML models we demonstrate significant improvements using ourviews.

A step to provide the community with guidelines on when to use which diagramtype of the UML is to study the suitability of diagram types for different pur-poses. Otero et al. [156] compared the suitability of three UML diagram types:

2.3 Managing Quality in UML-based Software Engineering 15

sequence diagrams, collaboration diagrams and state charts. They conducted anexperiment to evaluate the comprehension correctness and effort. They concludethat sequence diagrams are most suitable to model the dynamic behavior of asystem. Another direction of guidelines for the UML concerns the level of abstrac-tion in modeling. Verelst [192] reports on two experiments where he studied theeffect of the abstraction level of conceptual models on the correctness and effort ofmaintenance tasks. The results show that models of a lower abstraction level arebeneficial for small and easy maintenance tasks. Additionally, a lower abstractionlevel showed beneficial for drastic tasks, that have impact on large parts of themodel. However, for complex maintenance tasks consisting of a large number ofeasy changes, higher abstraction level models have proven to be beneficial. Russellet al. [171] evaluated UML activity diagrams for their suitability for business pro-cess modeling. The results of the study point out strength and weaknesses for thispurpose and provide recommendations to improve the suitability of UML activitydiagrams for the purpose of business process modeling.

2.3 Managing Quality in UML-based Software Engineering

In Chapter 1 the need for quality in UML modeling is motivated. Lack of aformal semantics, multi-diagram nature, and huge complexity are characteristicsinherent to the UML. These characteristics cause risks to a beneficial use of theUML to serve its aforementioned purposes. We define quality management ofUML modeling as: controlling and reducing the risks inherent to UML modelingto improve the use of the UML in serving its purposes.

Our review of the related literature revealed that the existing work can be classifiedinto approaches changing the UML itself and approaches changing the way theUML is used. In this thesis we focus on techniques that aim at managing thequality by improving the way the UML is used. In the sequel we will discussexisting work and put our work into context.

Formalizing the UML. An approach to address the risks inherent to the UMLis to change the language. The ‘Precise UML Group’ (pUML) [1] conducts effortsto formalize the semantics of the UML. The goal of this effort is to improve consis-tency, precision and the ability to verify UML models. However, the pUML groupacknowledges that ‘text-based formal techniques tend to produce models that aredifficult to read and interpret, and, as a result, can hinder the understanding ofUML concepts’ [71]. A formalization ‘requires in-depth knowledge of the formalnotation’ and ‘is often a significant barrier to industrial use’ [64]. Indeed, in theUML’s first version in 1997 up to its current version the degree of formality hasnot significantly changed. Improving the quality of UML modeling by formalizingthe language specification requires major changes to existing tooling, more rigor increating and reading the models, thorough knowledge and training, and possiblycauses compatibility problems to existing UML models and tools.


Providing conventions. The huge complexity of the UML is a source of uncer-tainty in the way of using the language. The use of guidelines or conventions is anapproach to preventing quality problems by tackling the UML’s complexity duringmodeling. These conventions should assist the modeler in his choice of what tomodel, how to model it, and which language constructs to use. An example isAmbler’s set of conventions for UML style [6]. Ambler provides an extensive setof conventions for each UML diagram type. The guidelines focus on issues such aslayout, model of details and naming. Furthermore, guidelines for using the UMLin object-oriented development methods are presented by Brugge [35], Gomaa [80],and Larman [126]. Conventions for modeling are similar to the well-establishedconcept of coding conventions for programming [154]. As in modeling there aremore degrees of freedom than in programming, modeling conventions are morecomprehensive than coding conventions. Coding conventions mainly deal withnaming, layout of the text (e.g. indentation), and comments. Whereas modelingconventions additionally give guidance with respect to e.g. level of abstraction,model elements to use, etc. So far there exists no scientific knowledge on thequality improvements obtained by using modeling conventions, or how much extraeffort is introduced by using conventions. If modeling conventions can be proven tobe beneficial, experience is needed regarding the right choice of modeling conven-tions, training, and possible tool support. In Chapter 6 of this thesis we providean in depth discussion and evaluation of modeling conventions.

Handling defects. The multi-diagram architecture and the lack of formal seman-tics of the UML causes the risk for defects. Handling defects in UML modelingcomprises specification, detection, and removal of the defects. The approaches canbe categorized as automated, semi-automated and manual. There is a series ofworkshops dedicated to UML consistency defects [105]. For defect detection existseveral automatic approaches, such as the approaches by Liu et al. [131], Beren-bach [16], and Campbell et al. [36], who developed tools for automatic detection ofconsistency defects. Muskens et al. [147] use relation partition algebra as a formal-ism to specify consistency defects. Egyed [57] developed an automated techniquefor ‘instant consistency checking’. This defect detection technique is very efficientand informs the developer immediately about defects that are introduced. Some in-teresting work exists about automated approaches for defect detection and supportfor removal. The approaches are different in the underlying formalism that is used.Mens et al. [140] developed a tool based on graph-transformations. Kuster [104]addresses consistency defect handling by defining a mapping from UML to a se-mantic domain which is a language with a formal semantics. Consistency defectscan be identified and resolved in the semantic domain. Van der Straeten [188]proposes a mapping from UML to description logics to resolve consistency prob-lems. However, in practice the number of defects detected by automated tools isrelatively large even for small UML models. Therefore it is difficult for the userto decide where to start resolving defects. Additionally the large list of defectsmight seem unfeasible to solve for the user. Users need to be guided which defects

2.3 Managing Quality in UML-based Software Engineering 17

types are most severe and how specific defects relate to specific quality attributes.Our work on the effects of defects (Chapter 5) led to an objective classificationof defect types with respect to severity. Such a classification, combined with thelocation of the defect, could be used to prioritize defects.

Another direction of defect detection is manual inspection. In inspection tech-niques [66] [78] humans read the model in a systematic manner to detect defects.Most existing research aims at inspecting source code, only little research has beendone on UML inspection. Laitenberger et al. [108] and Cantone et al. [37] evaluatedexisting inspection techniques for their effectiveness and efficiency in the contextof UML models. Both evaluation experiments showed promising results for theapplication of inspection techniques to UML. Conradi et al. [48] adjusted existinginspection techniques to better fit UML models. The results of their evaluationexperiment showed an improvement over the non-adjusted inspection technique.We found no studies that compare manual and automated defect handling tech-niques. A comparison with respect to effectiveness, effort, and defect types thatcan be handled, would be an interesting research direction.

Impact analysis. During their life-cycle UML models are changed due to cor-rections, improvements and changing requirements. As a result of the complexityof UML models and the interrelationships between different diagrams, a simplechange in a model may affect a large number of other model elements at variouslocations in the model. This chain reaction is likely to introduce consistency de-fects at model elements that are directly or indirectly related to the changed modelelement. To control this chain reaction it is necessary to analyze the impact of achange. Impact analysis is a preventive approach that aims at identifying all modelelements that are affected by a change. This technique supports the developer inavoiding or correcting defects that might be caused by a change. Briand et al. [30]developed a technique for impact analysis of UML models and implemented it ina tool that is empirically validated.

Formal complement. Briand et al. [31] showed in an experiment that formalOCL specifications augmenting UML models lead to an improvement in defectdetection, comprehension of system logic and impact analysis of changes. The im-provement is reached after an initial learning period. A basis for the improvementis a substantial and thorough training of the software engineers.

Diagram improvement. The UML is a diagrammatic language. A user of aUML model perceives the model mainly by viewing the diagrams. Therefore thepresentation of the diagrams is important to comprehend the model correctly.Despite the fact that the UML is a standard, there are notational variations be-tween UML tools and different publications containing UML diagrams. As anexample, two different representations of the inheritance notation are depicted inFigure 2.1. Both representations in the figure are semantically equal. Purchaseet al. [161] conducted an experiment to find out which notational variations arebetter for comprehension. The results can be used to improve the representation


Figure 2.1. Representational differences for the inheritance notation

of UML diagrams. Another diagram issue that influences model comprehensibilityis the layout, i.e. the way the diagram elements are arranged. The layout is usuallycreated manually or by layout-algorithms [59] [60]. A class diagram can for exam-ple be layouted such that it is directed, such that a reading the diagram from leftto right is supported, or it can be layouted such that the layout contains semanticinformation, e.g. a layered layout emphasizes a layered design. Purchase et al.studied the effect of layout on comprehension [162] and user preference [160]. Bothstudies yield results that are a basis for layout improvements. Purchase’s conclu-sion is that a model’s semantic domain and the task that the model should serve,should be taken into account when creating the layout of the diagrams. Wong etal. [197] present an overview of layout guidelines based on perceptual theory. Theyevaluate the layouts generated by two popular UML tools using their guidelines,revealing differences between the tools and suggestions for tool improvement.

A young, but interesting research direction in understanding how diagram layoutaffects model comprehension is the use of eye-tracking. Gueheneuc [84] conductedan initial study using eye-tracking during a class diagram comprehension task. Hediscovered that the participants mainly focussed on classes and their eyes hardlyfollowed the relations between the classes such as associations and inheritancerelations. Similar work is done by Yusuf et al. [202]. This type of research canlead to a systematic understanding of how engineers perceive UML diagrams.Hence, the diagrams can be further improved.

2.4 UML Tools

UML-related software engineering tasks are supported by UML tools. Thereforethe tools used have a major impact on the quality of UML modeling. Here we willgive an overview of the functionality offered by tools that are commonly used intoday’s software engineering tasks. We will have a closer look at two tools thatwere used in several of the studies reported in this thesis: Poseidon and SDMetrics.

2.4 UML Tools 19

2.4.1 Activities supported by UML Tools

Here we present a list of activities that are supported by commonly used UMLtools. Typical activities supported by UML modeling tools are:

• Editing. A model is created and changed by adding and removing modelelements. In general the editing of a model is done by editing the diagramsin a UML tool. In fact the actual model is an internal representation ina data structure based on the UML meta model. As the diagrams are theuser’s view on the model, the tool changes the model according to the user’schanges to the diagram. Because model elements can occur in more thanone diagram, a change in one diagram can cause changes in other diagrams.

• Viewing. Tools display the model’s diagrams and most tools also displaythe model in a tree representation that is used for navigating through themodel (also called ‘navigation pane’).

• Code Generation. Some tools support to create a skeleton of the sourcecode based on the UML model. Such a skeleton consists of the structure(empty definitions of classes, attributes, methods etc.) and can be used as astarting point for programming.

• Reverse Engineering. Some tools support the automatic creation of UMLmodels based on existing source code. The source code is parsed and theinformation necessary for creating the model is extracted from the sourcecode. Typically, the layout of the diagrams must be generated manually orusing a layout algorithm.

• Round Trip Engineering. Some tools are integrated with a programmingenvironment and support to keep the source code consistent with the model.When the user edits the model, the source code is automatically updatedand vice versa.

• Model Transformation. A key concept in MDA [151] (model-driven ar-chitecture) is transforming a model into another model. Transforming aPlatform Independent Model (PIM) into a Platform Specific Model (PSM)is a typical example. This activity is supported by some tools.

• Metrics Analysis. Tools supporting metrics analysis collect metrics fromthe UML model and present the results to the user. The metrics results areto evaluate quality characteristics of the model. A more elaborate discussionof the use of metrics in quality analysis is given in Chapter 3.

• Rule Checking. Another activity of quality analysis is checking rules.The model is checked for the violations of predefined rules and the detecteddefects are reported to the user.


Other activities that are less relevant for this thesis are simulation, model compo-sition, or test case generation.

Most UML tools support several of the described activities. However, there arespecialized UML tools that support only some of the activities. UML tools can becategorized according to the activities that they support. Relevant for this the-sis are UML modeling tools and UML analysis tools. Integral to UML modelingtools is that they support editing and viewing. Nowadays most UML modelingtools also support a large number of other activities. Examples of UML modelingtools are tools by IBM (Rational Rose, XDE) [87], Borland Together [23], Inter-active Objects ArcStyler [79], ArgoUML [8] or Gentleware Poseidon [3]. UMLanalysis models are typically stand-alone tools or plug-ins on top of a modelingtool. Analysis tools read a representation of the model, and report analysis resultsto the user. Examples of analysis tools are SDMetrics [198], SAAT [146][111],DesignAdvisor [16], or SQUAM [44].

2.4.2 Gentleware Poseidon

Poseidon [3] is a commercial UML modeling tool made by the company Gentleware.The tool is originally based on the open source tool ArgoUML [8].

Because Poseidon is a typical UML modeling tool that was freely available forour purposes, we used it for several of the studies reported in this thesis. Weused Poseidon in our experiments for editing and viewing models. Therefore wewill focus on the activities editing and viewing in this brief discussion of the tool,however, the tool also supports other activities such as code generation and round-trip engineering.

Figure 2.2 shows a screenshot of Poseidon. The user interface is divided into fourmain areas (indicated as Area 1 through Area 4 in the figure) as follows:

• Area 1: Navigation pane. The main functionality of this area is navigationthrough the model. Model-elements such as diagrams, classes, and use casesare displayed in a tree-like representation.

• Area 2: Diagram pane. This area is used to edit and view the diagrams.The area can display one diagram of the UML model at a time. Zoomingand panning functionality is provided to support diagrams that too large tobe displayed entirely in the area.

• Area 3: Details pane. This area displays all details of a selected modelelement for and allows for editing them. For example in Figure 2.2 the classRoute is selected. All its properties are displayed. Additional details suchas style, documentation, or the related source code can be selected using aseries of tabs.

2.4 UML Tools 21

Figure 2.2. Screenshot of the UML modeling tool Gentleware Poseidon

• Area 4: Overview pane. This area supports viewing and navigating withina model. It provides an overview of the diagram displayed in the diagrampane and provides zooming and panning functionality.

Navigation is supported by hyperlinks that are associated to elements in the nav-igation, diagram and details pane. The user can navigate to the elements usingthe hyperlinks and the three panes are updated to display the selected element.

Additionally the user interface contains a regular menu bar and a tool bar thatoffers buttons as shortcuts to functionality related to files, diagrams and searching.

2.4.3 SDMetrics

SDMetrics [198] is a representative UML analysis tool that we used during severalof the studies reported in this thesis. It is a stand alone tool that imports UML


Figure 2.3. Screenshot of the table-view in SDMetrics

models files that were created using a UML modeling tool. The tool extracts therequired facts from the UML model file and stores the data in an internal repre-sentation that is based on the UML meta model. SDMetrics has a list of metricsdefinitions and a list of rule definitions that can be applied to the stored UMLmodel to perform metrics and rule analysis. Therefore, SDMetrics supports theactivities metrics analysis and rule checking, both were used during studies re-ported in this thesis. Other functionality offered by SDMetrics includes exportinganalysis results, comparing models with respect to metrics, and editing the metricsand rule definitions. SDMetrics does not support any of the modeling activities(e.g. editing, reverse-engineering). We will briefly illustrate the functionality ofthe tool that was used during our studies.

Metrics Analysis. SDMetrics offers a large and extensible set of metrics for alldiagram types of the UML. The results of the metrics analysis are presented ina table or in graphs. In the table, the rows contain the model elements and thecolumns contain the metrics (Figure 2.3); functionality for sorting and filteringmetrics values is provided. Figure 2.4 shows a radar chart of all metrics of aclass on the left and a histogram displaying the distribution of a metric over allmodel elements on the right. In the histogram the x-axis represents the valuesof a selected metric and the bars represent the number of model elements having

2.4 UML Tools 23

Figure 2.4. Screenshots of the radar chart and the histogram in SDMet-rics

Figure 2.5. Screenshot of the rule checker in SDMetrics

the metric value corresponding to the bar’s position on the x-axis. Additionallydescriptive statistics of the metrics analysis are provided. Metrics that are part ofthe default set are for example the Number of Public Operations in a Class [132],the Number of Attributes in other Classes that have this Class as their Type [27],and the Cyclomatic Complexity of the State-transition Graph [142]. The metricsare organized in categories such as size, coupling, and complexity.

Rule Checking. SDMetrics contains a large and extensible set of rules that areused to detect defects in UML models. The result of a rule checking analysis isa table of elements that contain defects, i.e. rule violations. An example of the


results of the rule checker is given in Figure 2.5. For each defect there are detailsgiven such as the affected model element, the violated rule and its description, thecategory of the rule, the rule’s severity and a value if applicable. Examples of thestandard rules are: Abstract Class has no Child Class [166], Operation has a longParameter List with five or more Parameters [69], and from our previous workAbstract Class has a Parent Class that is not abstract [111].

2.5 Conclusions

In this chapter we have discussed existing work related to quality in UML mod-eling. We have discussed for which purposes the UML is used, we have presentedstudies about the use of the UML, approaches aiming at improving the quality ofUML modeling, and state-of-the-art UML tools. Existing studies show that devel-opers are overwhelmed with the complexity of the UML. There exists no consensusabout when to use which modeling construct for which purpose. Initial studiesexist that aim at finding out how to model for particular purposes. For exam-ple, Verelst [192] explored the level of abstraction of modeling and Otero [156]explored the usefulness of different diagram types for particular purposes. Morework is needed to establish validated guidelines for modeling. We can rely onseveral proposed sets of guidelines, but none of them is empirically validated.

Industrial case studies have shown, that the use of UML models is beneficial.Several activities, such as code design, test case development, and communicationwere improved due to modeling. However, it is reported, that the benefits canbe larger when methods and tools for modeling are further improved. Possibledirections for further improvements are better visual representations of modelsand better handling of model defects.

We can rely on several studies that addressed the visual representations of modelsto improve model comprehension. Most studies aim at improving the layout ofindividual UML diagrams. More work is needed to improve the comprehension ofthe entire model, which consists of several related diagrams. Additionally, furtherwork is needed to improve the comprehension of the model and related data, suchas source code, bug data, and metrics.

Several approaches for defect handling exist. However, only limited work existsthat reports on applying the approaches in practice to gain knowledge about defectoccurrences in real world modeling. As defect detection techniques may result inlarge lists of detected defects, work is needed to classify these defects to enablefocussing defect removal effort on the most severe defects first.

Chapter 3

How to Evaluate ModelQuality

In this chapter we address how we evaluate the quality of UML models in this thesis.Based on existing work we decompose the concept of quality into subnotions, suchas syntactic quality, semantic quality, pragmatic quality, and social quality. Weuse these subnotions throughout this thesis to define the context with respect toquality of the presented studies. In this chapter we propose a quality model forUML modeling. The purpose of the quality model is to support quality evaluationsof UML models by providing guidelines for selecting metrics and rules to measurethe quality for a particular purpose of using the UML. The quality model is based ona literature review and on experiences from our industrial case studies. The qualitymodel is specific for UML modeling, because it distinguishes between the quality ofa UML model as a description and the quality of the system that is described by aUML model. Most existing quality models do not make this distinction.

3.1 Introduction

In this thesis we address the quality of UML models. As a basis for evaluating andimproving the quality of UML models, it is necessary to have an understanding ofwhat quality really means. The purpose of this chapter is to present the notion ofquality that we will use throughout this thesis. As stated in Chapter 1 we addressthe research question:

• RQ1: How can the quality of UML models be decomposed into quality notionsfor particular purposes of modeling?

25

26 Chapter 3 How to Evaluate Model Quality

Table 3.1. Differences between source code and UML modelsSource Code UML Model

Abstraction low high - mediumPrecision high medium - lowCompleteness high lowConsistency high medium - low

In this chapter we will make the distinction between quality models for softwarein general (which is mainly source code based) and quality models for conceptualmodels such as UML models. We will present a decomposition of quality that isbased on the work of Lindland [130] and Krogstie [101]. This framework will serveas a reference for the studies around quality of UML modeling that we reportthroughout this thesis.

Early actions for quality improvement are less resource intensive and, hence, lesscost intensive than later actions [20]. Additionally new model-centric develop-ment techniques such as MDA [151] require high-quality models. Therefore mod-els should be subject to continuous quality control - as is common for sourcecode. There are, however, significant differences between source code and mod-els. Source code closely corresponds to the executing systems as the latter can beautomatically derived from the former. This correspondence is much weaker formodels. UML and source code differ in abstraction level, precision, completeness,consistency and correspondence to the ultimate system (depicted in Table 3.1).This results in the fact that a model does not describe the system unambiguously.Unlike for source code, there is a gap between the artefact model and the systemdescribed by it. The dominant existing quality models described in Section 3.2do not distinguish between quality characteristics that are inherent to the systemand quality characteristics that are inherent to its description. However, modelsare extensively used during development, and during maintenance. Hence, thereis a need to assess quality characteristics of models independent from implemen-tation. The quality model presented in this section distinguishes between qualitycharacteristics of the model as a description and the system described by it. Theseobservations motivated us to propose a quality model for UML models that takesinto account the aforementioned characteristics of models as well as the mannersand phases in which an organization aims to use the models. The quality modelcan be seen as complementary to existing quality models that focus on the qualityof the system. The purpose of the proposed quality model is to provide guidancein selecting metrics and rules to assess the quality of UML models.

This chapter is structured as follows. Section 3.2 presents our perspective on qual-ity and discusses related work. Section 3.3 discusses the decomposition of qualityinto subnotions, that is used as a reference throughout this thesis. Section 3.4

3.2 Software Quality 27

presents the quality model and Section 3.5 gives guidelines on how to tailor thequality model for particular purposes. Section 6.6 concludes this chapter.

3.2 Software Quality

In this section we review the relevant literature on software quality and relate ourapproach for quality of UML models to it.

3.2.1 Perspective on Quality

The subject (software) product quality is widely discussed, and most organizationsand stakeholders agree that quality is an important property of products. Butthere are many definitions of quality. To achieve the common target quality, onefirst has to agree upon the same definition. Garvin [73] defines five views of productquality:

• Transcendental view: quality is seen as something that can be perceivedbut not defined.

• User view: quality is seen as the fitness for the user’s purpose.

• Manufacturing view: quality is seen as conformance to the product’sspecification.

• Value-based view: quality is seen in terms of a product’s potential togenerate economical turnover.

• Product view: quality is decomposed into several characteristics that areinherent to the product.

The main application of our quality model is during development of the product.In this phase neither the user’s perception nor the product’s economical potentialcan be measured. During development only internal properties of the product andits model can be measured, hence, we apply the product view of quality since wedecompose quality into characteristics that are inherent to the product.

3.2.2 Existing Approaches

Early quality models that still serve as reference ones are the models proposed byBoehm [21] and by McCall [137]. The ISO 9126 [89] standard for software qualitycontains a quality model that is based on McCall’s model. These models together


with the model used by Rombach [168] also decompose quality in subcharacteristicsthat are inherent to the product.

A prominent way of defining subcharacteristics is by means of metrics. In literaturethere are many metrics [67] defined that are based on source code (e.g. Chidamberand Kemerer’s suite [41] or McCabe’s Cyclomatic Complexity Measure [135]).The quality of a system is assessed using metrics by relating metrics to qualityattributes as described in a quality model.

The models by Boehm and McCall structure the subcharacteristics into threehierarchical levels, where each level represents a different class of characteristics.The three levels of McCall contain Uses, Factors and Criteria, respectively. Boehmuses different vocabulary but similar meanings. The levels in Rombach’s modelare only hierarchically ordered and do not denote different levels of characteristics.We also use the three-level approach with a specific class of characteristics for eachlevel in our quality model.

The aforementioned quality models target a system during its entire life-cycle andalso take the user view and the manufacturer view into account. These modelshave a rather general view on quality with an emphasis on the product in operationand in maintenance. Both, the Boehm and the McCall model define three Usesas in the highest level of the quality tree: operation, maintenance and transition.(Note that Boehm and McCall use different terminology for some concepts.)

The existing quality models lack the distinction between the system and its de-scription, which is important for controlling quality of UML models. In the nextsection we introduce a new concept at the top level of the quality tree to includethis distinction in our development-centric quality model.

Balla et al. [10] proposed the QMIM framework to classify existing quality ap-proaches and to aid users of the models to choose one from (or a combination of)the different approaches using the framework’s classification. The classificationis two-dimensional with the object of interest in one dimension (process, productor project/resource) and in the other dimension the quality specification (metric,quality attribute or definition). Our proposed quality model is characterized tocover the product and the process on the object-dimension and all levels on thequality-specification-dimension.

3.3 Model Quality

We present a framework for quality of models to understand which notions ofquality are addressed in the remainder of this thesis. The framework used in thisthesis is an adopted version of the work by Lindland [130] and Krogstie [101].The notions of quality used in this thesis are based on the following observation:‘models are used to describe a domain and users interpret the model to understand

3.3 Model Quality 29

Figure 3.1. Concepts and relations of the quality framework (basedon [130] and [101])

the domain’. The concepts and relations that are part of this observation aredepicted in Figure 3.1. Similar as in [130] and [101] we use a set-theoretic approachto reason about the concepts and their relations. The elements of the sets arestatements, that are defined as ‘a sentence representing one property of a certainphenomenon’ [101]. First, we will describe the concepts (see Figure 3.1):

• Domain. This is the part of the reality, which is described by the model.The domain can be a problem, or a solution of a problem that is usually (tobe) implemented in some programming language. Often the actors that thesoftware interacts with are also regarded as part of the domain. We denotethe set of all domain statements as D.

• Model. This is a representation that describes the domain at a higherabstraction level. Depending on whether a problem or a solution is described,the model is an analysis model or a design model, respectively. In this thesiswe focus on UML models. The set of all statements of a model is denotedas M. It is common in software development to create several models ofdifferent abstraction layers for one domain. We use indices to distinguishthe sets of statements of different versions of a model (e.g. Mi).

• User. A user is someone who interprets models to get an understanding ofthe domain. Note that this concept is called ‘audience’ in [130]. We denotethe set of statements in a user’s interpretation as U. As a model is usually


Table 3.2. Relations between chapters of this thesis and quality notions

Chapter Quality DescriptionNotion

3 all quality no-tions

An Initial Quality Model for UML models.

4 syntactic, sys-tem

The case studies address syntactic and system qualityof UML models in practice.

5 social The experiment addresses the level to which defectedand undefected UML models are interpreted by differentusers.

6 syntactic We validate the effect of modeling conventions with re-spect to different quality notions.

7, 8 pragmatic We validate the proposed views with respect to prag-matic quality.

interpreted by several users, we use indices to distinguish the interpretationsets of different users (e.g. Ui represents the interpretation of user i).

Based on these entities, Lindland [130] proposed the following notions of modelquality: syntactic, semantic and pragmatic. Krogstie [101] extended this set withthe notion of social quality. Our extension is communicative quality [55]. Thesenotions form the quality framework that we will use throughout this thesis. Thenotions are discussed in the sequel (note that the numbers in Figure 3.1 correspondto the numbers in brackets in the bullet list):

• System Quality (1). This notion is not inherent to a model, but to asystem described by the model (i.e. the domain). Quality models such asISO 9126 [89], McCall [136] and Boehm [20] address system quality. In thisthesis we will focus on the other quality notions, rather than addressingsystem quality in detail.

• Model Quality. (Note that this quality notion is not explicitly mentionedin Figure 3.1, because it is the superordinate concept. All following qualitynotions are part of Model Quality). This notion addresses the quality charac-teristics of a conceptual model, e.g. a UML model. Model quality addressesthe model’s suitability to describe a system.

• Syntactic Quality (2). This notion addresses correctness of language usein the model. In addition to the adherence to the language specification, wealso regard the absence of consistency defects1 as part of syntactic quality.

1Note that consistency defects refer to the consistency between diagrams within a model. Werefer to the consistency between models or between a model and source code as correspondence.

3.4 A Quality Model for UML 31

Syntactic quality is measured by counting consistency defects and deviationsfrom the language specification. We will give more details on measuringsyntactic quality in the relevant chapters.

• Semantic Quality (3). This notion addresses the degree of correspondencebetween a model M and the domain D, i.e. it indicates the level to which themodel describes the problem it should describe.

• Pragmatic Quality (4). This notion addresses the degree of correspon-dence between a user’s interpretation Ui and the model M, i.e. whether auser understood the model.

• Social Quality (5). This notion addresses the degree of correspondencebetween the interpretations of different users, i.e. whether the model causesmiscommunication and misinterpretations. The measurement of social qual-ity of model M is based on all sets Ui where i is a user of M .

• Communicative Quality (6). This notion addresses the degree to whicha user’s interpretation Ui reflects the modeled domain D, i.e. whether a userunderstood the modeled domain.

• Correspondence (7). This notion addresses the degree to which differ-ent models that describe the same domain correspond to each other, i.e.the overlap between models Mi and Mj . In this thesis we do not addresscorrespondence between models. Approaches addressing the correspondencebetween models can be found in [189] [190] [199] [200].

Each of the studies presented in this thesis addresses one or more of the qualitynotions. Table 3.2 describes per chapter of this thesis which quality notions areaddressed in the chapter.

3.4 A Quality Model for UML

Here we describe the specific needs for a quality model that focusses on model-based development, the quality model and its concepts.

3.4.1 Concepts in this Model and their Definitions

In this section we describe the levels of our quality models and the concepts thatbelong to the levels.

Level 1: Use. The top level of our quality model describes the high-level use ofthe artefact. These uses are:


Table 3.3. Purposes of modelingPurpose Description

Modification The model and the system enable easy modification. Possiblemodifications are corrections, i.e. removal of errors, extensions ofthe system, or adaptive changes due to changed requirements.

Testing The model is used to generate test cases.Comprehension The model is used to comprehend the system.Communication The model enables efficient communication about the system’s

elements, their behavior and the design decisions. Communica-tion includes communication during the development phase withdifferent stake holders and documentation for understanding thesystem in later phases such as maintenance.

Analysis The purpose of the model is to explore and analyse the problemdomain including its key concepts and making some early designdecisions.

Prediction The model is used to make predictions about quality attributesof the eventual implemented system. These predictions are usedto make early and therefore cheaper improvements of the archi-tecture and design.

Implementation The model is used as basis for (manual) implementation of thesource code of the system.

Code Genera-tion

The model is used to automatically generate the source code ofthe system (as in MDA or MDE). Code generation can be com-plete or partial (only a skeleton of the source code is generated).

• Development: this use combines quality characteristics that concern theproduct and its artefacts in the phases before the product is finished.

• Maintenance: this use combines quality characteristics that concern theproduct when it is changed.

UML models are used in the development and maintenance phases. Therefore ourquality model takes these two uses into account. The decomposition of these usesis presented in the sequel of this chapter. Note we that we added Developmentto the uses that are defined in the established quality models [21] [137] (describedin in Section 3.2). In contrast to those quality models our quality model does nottake into account the following two uses, because they are not specific for UMLmodeling:

• Operation: this use combines quality characteristics that concern the prod-uct when it is operational. This implies that the system is readily imple-mented. The characteristics concerning the system’s operation relate to theuser view of product quality. We focus on the product view, hence, theoperational use is out of the scope of our quality model.


• Transition: this use combines quality characteristics that concern the prod-uct when it is moved to another environment, e.g. a different operating sys-tem. Transition issues are implementation dependent and are not addressedby modeling in the design phase of a system. Therefore the transition use isout of the scope of our quality model.

Level 2: Purposes of Modeling. The second level of the quality model containsthe purposes of the artefact. A purpose describes why the artefact (i.e. the model)is used. Usually an artefact is not used for all purposes at the same time. Thepurposes for modeling are presented and described in Table 3.3. The table containsthe most common purposes. However, it is possible that there are other purposes,that are important in particular application domains (e.g. analysis of timelinessin real-time systems).

Level 3: Characteristics. The third level of our quality model contains theinherent characteristics of the artefact. The characteristics are described in Ta-ble 3.4. According to the distinction between characteristics of the model andcharacteristics of the system we indicate for each characteristic whether it is amodel characteristic (column M ) or a system characteristic (column S ). Somecharacteristics are defined for both model and system. The only exception iscorrespondence: this is a characteristic of a model-system pair. Correspondencebetween a UML model and the source code of a system is needed to enable usingthe model to understand the system.

Level 4: Metrics and Rules. The characteristics concepts of the three afore-mentioned levels of the quality model cannot be measured directly from the arte-fact. To enable quantification, we propose to use metrics and rules. A metric isa mapping from the empirical domain (e.g. the model) to the numerical domain,such that its value reflects the level of some property of the artefact or an elementof the artefact. Rules are special cases of metrics: they are mappings to a binaryvalue: true or false. Rules are usually defined for elements of artefacts (e.g. ‘Ab-stract class X must have a subclass’). Hence, a rule is a numeric metric for anentire artefact (i.e. the number of elements violating the rule). The characteris-tics of Level 3 can be measured using metrics and rules. Table 3.5 presents thecollection of metrics that are used in the quality model. The number of metricsproposed in literature is huge (e.g. [67, 91, 41]). For the sake of brevity somerows of the table summarize an entire family of related metrics, instead of pre-senting each particular metric. For example we present only one row containing‘coupling’, whereas there exists a large variety of metrics quantifying particularnotions of coupling (e.g. temporal coupling, data type coupling, coupling basedon interactions, coupling based on associations,...). Additionally we do not claimthat this list of metrics and rules is complete. The relation between the qualitymodel’s level of metrics and rules to the level of characteristics is explained in thesequel.


Table 3.4. Characteristics

Characteristic Model

Syst

em

Description Ref

Complexity√ √

Complexity: measures the effort required to under-stand a model/system.

[67]

Traceability√ √

The extent to which relations between design deci-sions are described explicitly.

[163]

Modularity√

The extent to which its parts are systematicallystructured and separated such that they can be un-derstood in isolation.

[159]

Completeness√ √

A system possesses completeness to the extent towhich its functionality covers all requirements. Amodel possesses completeness to the extent to whichoverlapping parts of different views contain thesame elements and the system is completely de-scribed by the model.

[112]

Consistency√

The extent to which no conflicting information iscontained.

[56]

Communicativeness√

The extent to which it facilitates the specificationof inputs and provides outputs whose form and con-tent are easy to assimilate and useful.

[21]

Self-Descriptiveness

√ √The extent to which it contains enough informa-tion for a reader to determine its objectives, as-sumptions, constraints, inputs, outputs, compo-nents, and status.

[21]

Detailedness√

The extent to which it describes relevant details ofthe system.

[47]

Balance√ √

A model possesses balance to the extent to whichall parts of the system are described at an equaldegree of detail. A system possesses balance to theextent to which all parts of it have a similar level ofall other system characteristics.

Conciseness√

The extent to which the system is described to thepoint and not unnecessarily extensive.

[21]

Esthetics√

The extent to which its graphical layout enablesease of understanding of the described system.

[161]

Correspondence√ √

A pair of model and system possess correspondenceto the actual system to the extent to which systemelements, their relations and design decisions arethe same in the model and the system.

[189]


Table 3.5. Metrics and rulesName Description Ref

Dynamicity Measures the complexity of a class’ internal behavior. Dy-namicity is based on the assumption that message calls cor-respond to state transitions. It takes information from theinteraction diagrams into account.

[111]

Ratios Ratios between number of elements (e.g. number of methodsper class).

[148]

DIT Depth of Inheritance Tree. [41]Coupling The number of other classes a class is related to. [41]Cohesion Measures the extent to which parts of a class are needed to

perform a single task.[41]

Class Com-plexity

The effort required to understand a class. [148]

NCU Number of classes per use case. [148]NUC Number of use cases per class. [148]Fan-In The number of incoming association relations of a class. Mea-

sures the extent to which other classes use the class’ providedservices.

[86]

Fan-Out The number of outgoing association relations of a class. Mea-sures the extent to which the class uses services provided byother classes.

[86]

Naming Con-ventions

Adherence to naming conventions. [154]

Design Pat-terns

Adherence to design patterns. [72]

NCL Number of crossing lines in a diagram. [161]Multi defs. Multiple definitions of an element (e.g. class) under the same

name.ID Coverage Interaction diagram coverage. These metrics measure the ex-

tent to which the interaction diagrams cover the elements de-scribed in the structural diagrams. Examples are percentageof classes instantiated in interaction diagrams and percentageof use cases described by interaction diagrams.

[111]

Message needsMethod

This consistency rule states that each message in an inter-action diagram must correspond to a method defined in theclass diagram.

[112]

Code Match-ing

This family of metrics measures the extent to which the codematches the model. Example: percentage of model classesthat occur in the code.

[190]

Comment Measures the extent to which the model contains comment.Example: lines of comment per class.

[155]


Figure 3.2. Quality model

3.4.2 Relations between Concepts

Now we are ready to present the relations between the aforementioned concepts.The concepts together with the relations between them form the actual qualitymodel. Figure 3.2 shows the quality model. Due to the limited space Figure 3.2only contains three levels of the quality model: Use, Purpose, and Characteristic.The fourth level Metrics and Rules and the relations to level three are depicted inTable 3.6 (where a checkmark indicates that the characteristic (level 3) is relatedto the metric or rule (level 4)). The arrows in the figure indicate relations betweentwo concepts of different levels. The arrows are interpreted as follows: a lower levelconcept is part of all higher level concepts to which it is related by an arrow, and ahigher level concept contains the related lower level concepts. The interpretationof the relations is that a concept in a lower level in the quality model contributesto a the related concepts of the higher level.

3.5 Applying and Tailoring the Quality Model 37

Table 3.6. Relations between metrics and rules and characteristics

Metrics and Rules Modula

rity

Com

ple

xity

Com

ple

teness

Consi

stency

Com

munic

ati

veness

Self-D

esc

ripti

veness

Deta

iledness

Bala

nce

Concis

eness

Est

heti

cs

Corr

esp

ondence

Dynamicity√ √ √

Ratios√ √ √ √ √

DIT√ √ √ √ √

Coupling√ √

Cohesion√ √ √

Class Complexity√

NCU√ √ √ √

NUC√ √ √ √

Fan-In√ √

Fan-Out√ √

Naming Conventions√ √

Design Patterns√ √ √ √ √

Layout-Guidelines√ √

Multi defs.√ √

ID Coverage√ √

Message needs Method√ √ √

Code Matching√ √ √

Comment√ √

3.5 Applying and Tailoring the Quality Model

We have explained the quality model for UML models. This section describes howthe quality model fits into different phases of the development.

The quality model covers eight purposes of modeling. The purposes of modeling,and, hence, the quality requirements, differ per project phase. When using thequality model, the important purposes, and, hence, the important characteristics,should be carefully selected. The importance of purposes depends on the projectphase, the development process and characteristics (and taste) of the developmentteam. For example in a development team that is geographically distributed overdifferent countries, comprehension and communication are important purposes. Ina project that develops a very large system containing many risks, early predic-tion of the quality attributes of the system is important to enable cost-effective


Figure 3.3. Relations between modeling purposes and project phases

improvements.

In our case studies we have discussed the importance of modeling purposes withproject members. As a result, we have built a schema that relates purposes ofmodeling to project phases. Figure 3.3 shows the relations between project phasesand modeling purposes. The shaded boxes show that the purpose is seen as im-portant for that specific phase. Darker shading indicates higher importance. Thisschema can be seen as an indication, but different priorities for individual projectsare of course possible.

3.6 Conclusions

In order to support the management of quality from early phases of architectureand design, techniques are needed for assessing quality of models. In this chapterwe presented a quality framework that is an adaptation from Lindland [130] andKrogstie [101]. The framework consists of quality notions that relate to the process

3.6 Conclusions 39

of understanding a part of the reality that is described using a model. In each ofthe following chapters we use this framework to describe which quality notions weaddress.

Furthermore we have proposed an initial quality model that is specific for UMLmodeling. This quality model takes into account the different uses of models in aproject as well as the phase in which a model is used. The purpose of the qualitymodel is to provide guidelines to developers to choose suitable metrics and rulesfor analyzing the quality of a UML model for a particular purpose. Therefore, thequality model supports identifying the need for quality improvement already inearly stages of the life-cycle.

The main contribution of this chapter is that we have developed a quality modelthat takes into account quality characteristics of the model as well as quality char-acteristics of the system. Our quality model is organized in a four-level decomposi-tional structure. On the highest level we have introduced a new use: development.In the second level of our quality model we propose purposes for modeling. Werelate these purposes to different phases in the life-cycle of the product. The setof important purposes differs per project phase, such that the quality model canbe tailored according to the purposes of the current phase. The lowest level of themodel is related to a set of metrics and rules that quantify the properties inherentto UML models.

In further studies the relations between metrics and rules and the quality char-acteristics, as well as between the different layers of the quality model shall beempirically investigated.


Chapter 4

Defects in UML Models

This chapter reports on an industrial case study. The purpose of the case studyis to investigate quality problems that occur in real-world modeling, in particularwith respect to defect containment. We analyzed 16 industrial UML models usingestablished model analysis tooling. The analysis results were discussed with themodel developers to obtain their feedback, interpretations and possible additionalfindings. The results of the study show a large number of defects occurring inindustrial practice. Additionally, we discuss context factors of the models suchas model size, development tools, or team size to explore their effects on defectoccurrences.

4.1 Introduction

In Chapter 1 we identified risks to the quality of UML models. As a basis for ourfurther investigations into the quality of UML modeling, we explore actual qualityproblems that occur in UML models. If we know which quality problems are com-mon in UML modeling, we can further investigate these problems to understandtheir cause and effect. Based on this knowledge, techniques can be tailored toeffectively prevent or resolve such quality problems. Therefore we study qualityproblems in industrial UML modeling in this chapter. However, due to the limitedamount of resources, we limit the scope to syntactic quality, in particular defectcontainment. Hence, we address the following research question in this chapter,as stated in Chapter 1:

• RQ2: What is the quality of industrial UML models, in particular withrespect to defect containment?

41

42 Chapter 4 Defects in UML Models

We are mainly interested in analyzing the number of occurrences of several defecttypes. Therefore our case study is exploratory.

We define defects as flaws in the model that impair quality attributes such ascorrectness, comprehensibility, consistency, non-redundancy, completeness, or un-ambiguity.

4.1.1 Related Work

As we discussed in Chapter 2 there exist few surveys [53][83] and case stud-ies [7][180] that address the use of models in industrial software engineering projects.Their observations mainly address the usage of diagram types, or the factors thatimpair the benefits of using models, such as organizational problems, training,and tooling. We are interested in observations of a finer granularity, i.e. defectcontainment in UML models. There exist some studies that have analyzed UMLmodels with respect to their defect containment. Most of these studies, however,have analyzed rather small models that were created in an academic context, suchas the studies by Leung et al. [128] and Kuzniarz et al. [106]. It is questionablewhether observations obtained from small, academic models are generalizable toindustrial practice. Therefore studies are needed that analyze industrial modelsto obtain generalizable results. There exist few studies that report on industrialcase studies. We will briefly discuss the related industrial studies in the sequel.

Conradi et al. [48] conducted an experiment to compare the effectiveness andefficiency of two different inspection techniques (more precisely, object-orientedreading techniques [186]). Their experiment addressed a UML model used byEricsson, accompanied with some related artefacts such as textual descriptions ofthe use cases. However, the analyzed model is rather small: one class diagramwith five classes and 20 interfaces, two sequence diagrams and one state diagram.The authors report the number of defects found in the model and the relatedartefacts. In total 25 and 31 defects were found, respectively, depending on theapplied inspection technique. Besides a classification of the defects into categoriessuch as omission, incorrect fact, or inconsistency, no details about the defects weregiven.

Cheng et al. [40] report higher defect numbers from two large industrial models ofSiemens. The total number of defects found ranges ‘up to an order of magnitudehigher than the number of classes in the model’. The authors give details aboutthe number of detected instances of particular defect types. However, most defecttypes considered in the study by Cheng et al. were not specific to the UML,but were general design defects, such as ‘abstract class not inherited’, ‘number ofassociations above user defined threshold’, and ‘operation missing preconditions’.Hence, using the quality notions given in Chapter 3, Cheng et al. assessed systemquality, whereas we are mainly interested in syntactic model quality. In contrastto their defect types, that occur in class diagrams or source code, we address

4.2 Description 43

defect types that occur in several diagram types of the UML and are thereforemodel-specific.

Berenbach [16] analyzed five industrial UML models of Siemens. Similar to Chenget al., Berenbach uses an automated tool for defect detection. The size of themodels analyzed by Berenbach ranges from several hundred classes up to 1500classes. Only the total numbers of detected defects are reported: they range fromabout three times the number of classes up to almost 20 times the number ofclasses. In Berenbach’s analysis both UML-specific and system defects are ana-lyzed. However, no detailed information is given on the occurrences of particulardefect types.

To build a knowledge base for further improvements of quality assurance tech-niques for the UML, it is, however, important to have detailed information aboutthe occurrences of defect types. In some of the existing work, either only the de-tected defect types without the number of occurrences are reported, or only thetotal number of detected defects is reported. In our exploratory case study we willreport detailed occurrence numbers of several UML-specific defect types.

4.1.2 Structure of this Chapter

This chapter is structured as follows: in Section 4.2 we describe the design andoperation of our case study, in Section 4.3 we present and discuss the results, andSection 4.4 concludes this chapter.

4.2 Description

We are interested in exploring defects in real-world UML models. The context ofUML models consist of many factors that potentially influence the quality of themodels. The factors include for example the purpose of modeling, the abstractionlevel, the size of the model, the tool used, the development process, and manyother factors. Hence, the number of variables influencing the quality of UMLmodels is very large. For exploring defects in real-world models, there are toomany variables to be controlled in experiments. Therefore we decided to choosethe research method case study. In this section we will describe the design of ourcase study according to Yin [201], we will describe the approach used to conductthe case study, and we will describe the set of defect types that we studied.

4.2.1 Design of the Case Study

Since the purpose of our case study is to explore the quality of UML models, thetype of the case study is exploratory. Hence, the case study aims at providing


knowledge as a basis to develop further theories. Generally, for an exploratorycase study no propositions are stated in the design of the study. Possible unitsof analysis for our case study are amongst others software projects, models, ormodel repositories. We select UML model as the unit of analysis, as it fits bestwith our research question. We already mentioned, that the number of variablesin the context of UML models is large. Since our study is exploratory, we areinterested in building a broader knowledge base, i.e. studying models that differin context. Hence, we conduct a so-called multiple case study as opposed to asingle case study, where only a single model (typically an extreme case) would bestudied. In the selection of models for inclusion in our case study, we take intoaccount that the context variables of the models are diverse. In the discussion ofthe case studies in the following section we give as much detail about model andcontext variables as possible.

4.2.2 Approach

We are interested in the quality of UML models in practice. Therefore we neededto obtain industrial UML models and assess their quality. Here we describe howwe obtained the models and how we assessed their quality.

Obtaining software artefacts, i.e. models from industry is for several reasons achallenging task. First a person responsible for a model in a company must befound and contacted. Then the person must be convinced, that he or she can trustthat we adhere to the company’s confidentiality requirements. Additionally, theperson must be convinced, that the effort providing us the model and the relatedinformation is worth it, i.e. preferably there should be a benefit for the company.And finally, the obtained model and related information should suit the goals ofthe case study.

During our project we contacted several industrial practitioners in the Netherlandsfor collaboration in the case studies. Additionally, we were contacted by somepractitioners, who found out about our activities by reading our publications orwebsite. The approach of the case study was as follows (note that after each ofthe first two steps a decision was taken with both parties whether to proceed):

1. Initiation. In one or more meetings we informed the potential collaborationpartner about the goal of the collaboration, about the required information,about his required involvement, and about his potential benefits (i.e. feed-back about the quality of his model). Questions and concerns (often relatedto issues regarding the company’s intellectual property) were discussed. Ad-ditionally, we collected context information about the model, such as itspurpose. Note that we tried in all cases to obtain artefacts and data re-lated to the available model, such as source code, bug reports, or effort data.However, in most cases we were unable to obtain this additional data.

4.2 Description 45

2. Selection. To obtain meaningful results from the case studies it is necessarythat the models satisfy certain selection criteria. Therefore we decided, basedon a model’s characteristics and based on context information obtained inthe initiation phase, whether we would include the model in this study. Toselect a model for our study, we applied the following criteria:

• It must be used in practice, hence, it is not only a ‘toy’ model,

• its size must be reasonably large (at least 30 classes),

• its context is such that we have a variation of context variables for allmodels in our study,

• it is not obviously damaged (e.g. due to errors in transformation orconfiguration management), and

• it can be transformed into a format that is readable by our tooling.

Typically, after mutual consent in the initiation and a positive outcome ofthe selection, a non-disclosure agreement was signed.

3. Analysis. We analyzed the model mainly using the tools SDMetrics [198]and our SAAT tool that was developed in our research group [146][111].The analysis consisted of checking rules to detect defects and collection ofmetrics. For some models, additional detailed inspections were conductedmanually. Most analysis sessions were conducted on the site of the indus-trial collaboration partner. Access to the models was only provided for theanalysis session. Typically, an analysis session took up to one day.

4. Feedback and Discussion. A report was written containing the measure-ment results and our interpretations. The model owner received the reportand typically, a feedback session was held. In the feedback session we pre-sented our findings. Clarifying questions from both the model owner and uswere discussed in the feedback session. In some companies one representa-tive participated in the feedback session (usually the expert of the analyzedmodel), and in other companies entire teams that worked with the modelparticipated in the session. Note that for a few models no feedback sessioncould be held, because the company could not participate in the meeting dueto time constraints.

5. Possible Follow-up. For some case studies, both parties agreed upon afollow-up, to analyze a related model or a different model from within thecompany.

4.2.3 Set of Defect Types

The defect types analyzed in this study take into account a subset of the UMLdiagrams: class diagrams, sequence diagrams and use case diagrams.


The following descriptions consist of a name, an abbreviation (which is used inthe sequel), a brief description that informally describes the defect. The effects ofmost of these defects are studied in Chapter 5. Note that the abbreviations arecreated from the first characters of the main words in the descriptive name. Anexception is the capital letter ‘E’, which stands for ‘Message’ (since ‘M’ stands forMethod). No strict scheme is applied.

Use Case without Sequence Diagram (UCnSD)Sequence diagrams illustrate scenarios of use cases. Hence, the classes instanti-ated by a particular sequence diagram contribute to the functionality needed forthe corresponding use case. This defect exists if there is a use case that is notillustrated by any sequence diagram.

Multiple definitions of Classes with equal Names (Cm)This defect occurs if in a single model more than one class has the same name.The different classes with the same name may be defined in the same diagram orin different diagrams.

Abstract Classes in Sequence Diagram (ACSD)Many languages disallow instantiation of abstract classes. Based on this assump-tion an instantiation of an abstract class as an object in a sequence diagram canbe regarded as a defect.

Classes without Methods (CnoM)A class usually provides its functionality via its methods. Hence, a class withoutmethods in fact does not provide functionality. (Note that a class without meth-ods but attributes could be a data class whose data could be accessed directly.However, that is regarded as poor programming style, hence, we do not considerthis special case).

Method not called in Sequence Diagram (MnSD)This defect occurs if there is a method of a class that is not called as a messagein any sequence diagram.

Class not instantiated in Sequence Diagram (CnSD)The objects in sequence diagrams should be instantiations of classes. If there isno class instantiation in any sequence diagram of a class that is defined in a classdiagram of the model, this defect is present.

Interface not instantiated in Sequence Diagram (InSD)Similar to the previous defect type, but for instantiations of interfaces.

Objects without Name (OnN)In sequence diagrams objects are instantiations of classes. The objects should beannotated with a role name that describes their role in the interaction. In case norole name describes the object, this defect is present.

Object has no Class in Class Diagram (CnCD)This defect occurs if there is an object in a sequence diagram and no corresponding

4.3 Results 47

class is defined in any class diagram.

Message without Method (EnM)A message from one object to another means that the first object calls a methodthat is provided by the second object. The name annotating the message ideallycorresponds to the name of the called method. In case there is no correspondencebetween the message name and a provided method name, this defect is present.

Message without Name (EnN)In sequence diagrams objects exchange messages. The arrows representing themessages should be annotated with a name that describes the message. In case noname describes the message, this defect is present.

4.3 Results

In this section we describe the analyzed models and we discuss the observed results.The reported study was conducted between November 2003 and November 2005.Based on the aforementioned selection criteria we selected 16 models. Due tothe non-disclosure agreements with the providers of the models, the names ofthe models are made anonymous and only limited context information, such asapplication domain, can be provided.

Table 4.1 summarizes the descriptive variables of the analyzed models. The namingof the models is such that related models are identified by the same character. Theycan be distinguished by different numbers. Models with the same character stemfrom the same organization. Models without a number in their name do not relateto any other of the analyzed models. The relations between individual models willbe described in the sequel. Table 4.1 characterizes the models by providing theirsize in terms of various types of model elements. Additionally, the table providesinformation about the modeling tool, the application domain, the main purposeof each model, and the number of persons involved in the creation of the model.

Table 4.2 shows the results of the defect analysis of the models. For each modelthe percentages of occurrences per defect type are given. The percentages indicatehow many of the model elements that the particular defect type is based on, areaffected by an occurrence of the defect. E.g. the percentage for A1 and the defecttype ‘Objects without Name’ is 52,0%. This means that 52,0% of all objects inthe model do not have a name. A low percentage reflects good quality, whereasa high percentage indicates possible quality problems. Additionally, we providedescriptive statistics (minimum, maximum, mean, standard deviation) for eachdefect type. Note that for some models not all defects are reported. The symbol[ indicates that the data is not available. During the two years period we haveextended our set of defect detection rules based on the experiences from the earlieranalyses. Since we had no access any more to the earlier models, we could notmeasure some of the defect types for all models. The symbol \ indicates that


the particular metric is not applicable. The reason of this is that the element onwhich the defect type is based does not occur in the model (e.g. in case there areno methods defined in a model, it does not make sense to report the number ofmethods that are not called in a sequence diagram).

4.3.1 Analysis of the Models

In the sequel we will discuss the models individually. For some models we haveadditional context information that is not given in the tables. We will report theadditional information, we will interpret the analysis results, and we will provideobservations from the feedback discussions with the model providers.

Models A1 and A2

The models A1 and A2 describe different subsystems of a medical imaging system.Both models belong to the same release. The two models were built by differentdesigners within the same development team. The size of the models A1 and A2 ismedium, in terms of number of classes they have 142 and 168 classes, respectively.The purpose of the models was to provide a specification of the system for imple-mentation. No quality assurance was applied during development of the models.The development process is not known. Many context factors, such as modelingtool, level of abstraction, purpose, or size are the same or similar for both models.However, in Table 4.1 there are relatively large differences in defect occurrencesbetween both models. We expect that developer characteristics, such as skill andtraining, are the cause for these differences. Unfortunately, no feedback discussionwas possible with the developers of these models.

Models B1 and B2

The models B1 and B2 describe the same system. Model B2 is a later version,however, it was developed from scratch and no model elements from B1 werereused. The models describe an embedded controller at the analysis level (asdescribed by Jacobson [90]). Their size is small (34 and 69 classes) for industrialmodels, but typical for analysis models (as G1 and H1). The developer choseto focus on identifying the structure only, and therefore defining details such asmethods and attributes was postponed. The large difference in terms of numberof sequence diagrams and related elements drew our attention. Based on thisobservation the model developer discovered during the feedback session, that modelB2 contained a few scratch sequence diagrams. The model was supposed to haveno sequence diagrams at all, but the wrong version of the model was checked intothe configuration management system. No quality assurance was applied duringdevelopment of the models. The development process is not known.

4.3 Results 49

Tabl

e4.

1.V

aria

bles

desc

ribin

gth

eU

ML

mod

els

Nam

eA

1A

2B

1B

2C

1C

2D

EF1

F2

F3

G1

G2

H1

H2

H3

Cla

sses

142

168

34

69

45

46

538

443

48

65

73

59

395

56

250

338

Met

hods

340

406

00

142

180

1803

618

116

196

233

160

605

0163

2252

Att

ribute

s177

197

00

39

41

3148

63

132

155

5510

05

162

Inte

rface

s34

64

028

30

31

44

19

11

15

16

22

92

010

338

Pack

ages

245

286

54

49

15

15

108

97

63

61

71

18

53

10

46

396

Ass

oci

ati

ons

161

254

30

139

48

52

637

128

55

90

188

22

261

95

201

832

Inher

itance

rela

tions

36

113

12

29

28

25

355

92

21

39

51

156

384

24

100

208

Use

Case

s250

21

32

49

11

11

100

105

13

25

45

16

27

021

74

Seq

uen

ceD

iagra

ms

30

65

46

720

20

178

418

716

19

11

32

39

25

209

Obje

cts

200

544

297

30

103

100

939

1497

45

81

118

88

234

101

164

874

Mes

sages

613

853

705

51

210

219

1730

2982

60

96

120

143

481

216

324

2015

UM

LTool

Rati

onalR

ose

√√

√√

√√

√

Rati

onalX

DE

√√

√√

Borl

and

Toget

her

√√

√√

√

Applicati

on

Dom

ain

Imagin

gSyst

em√

√

Em

bed

ded

Contr

oller

√√

√√

√√

√√

√

Consu

mer

Ele

ctr.

√

Info

rmati

on

Syst

em√

Vid

eoSurv

eillance

√√

Main

Purp

ose

Imple

men

tati

on

√√

√√

√

Analy

sis

√√

√√

Com

pre

hen

sion

√√

Com

munic

ati

on

√√

√√

√

Per

sons

11

11

22

10

51

11

46

43

8


Table4.2.R

esultsofthe

defectanalysis(

[=

datanotavailable,

\=

rulenotapplicable)

IDN

am

eA

1A

2B

1B

2C

1C

2D

EF1

F2

UC

nSD

UC

with

out

SD

[[

[[

[[

100,0

%[

100,0

%64,0

%C

mM

ultip

ledef’s

ofC

lass

[[

[[

[[

0,0

%3,6

%58,3

%43,1

%A

CSD

Abstra

ctC

lasses

inSD

[[

[[

[[

0,0

%0,0

%0,0

%0,0

%C

noM

Cla

ssesw

ithout

Meth

.45,8

%51,2

%100,0

%100,0

%20,0

%23,9

%66,5

%78,6

%29,2

%24,6

%M

nSD

Meth

ods

not

inSD

67,6

%77,6

%\

\40,1

%55,0

%8,7

%23,3

%67,2

%61,7

%C

nSD

Cla

ssesnot

inSD

46,5

%59,5

%35,3

%84,1

%42,2

%43,5

%67,3

%30,2

%68,8

%64,6

%In

SD

Interfa

cesnot

inSD

100,0

%87,5

%\

100,0

%70,0

%71,0

%54,5

%21,1

%0,0

%0,0

%C

nC

DO

bjects

with

out

Cla

ss[

[[

[[

[3,9

%6,5

%0,0

%0,0

%O

nN

Objects

with

out

Nam

e52,0

%61,6

%91,9

%86,7

%76,7

%75,0

%21,4

%0,6

%0,0

%0,0

%E

nM

Messa

ge

with

out

Meth

.71,9

%79,4

%\

\81,9

%81,7

%25,4

%24,3

%15,0

%9,4

%E

nN

Messa

ge

with

out

Nam

e0,0

%0,0

%0,3

%0,0

%0,0

%0,5

%0,5

%14,6

%0,0

%0,0

%

IDN

am

eF3

G1

G2

H1

H2

H3

Min

Max

Mean

StD

ev

UC

nSD

UC

with

out

SD

42,2

%100,0

%100,0

%[

[[

42,2

%100,0

%84,4

%25,1

7C

mM

ultip

ledef’s

ofC

lass

0,0

%10,2

%23,3

%14,3

%48,8

%42,6

%0,0

%58,3

%24,4

%21,9

9A

CSD

Abstra

ctC

lasses

inSD

0,0

%5,1

%0,0

%[

[[

0,0

%5,1

%0,7

%1,9

2C

noM

Cla

ssesw

ithout

Meth

.17,8

%64,4

%55,9

%100,0

%98,8

%29,3

%17,8

%100,0

%56,6

%31,2

6M

nSD

Meth

ods

not

inSD

58,4

%58,1

%38,3

%\

28,2

%78,5

%8,7

%78,5

%51,0

%21,5

9C

nSD

Cla

ssesnot

inSD

61,6

%74,6

%83,8

%32,1

%98,4

%71,6

%30,2

%98,4

%60,3

%20,2

7In

SD

Interfa

cesnot

inSD

0,0

%100,0

%97,8

%\

0,0

%100,0

%0,0

%100,0

%57,3

%43,5

5C

nC

DO

bjects

with

out

Cla

ss0,0

%0,0

%0,0

%76,2

%3,0

%72,9

%0,0

%76,2

%16,3

%30,8

2O

nN

Objects

with

out

Nam

e0,0

%19,3

%71,4

%83,2

%62,8

%32,5

%0,0

%91,9

%45,9

%34,6

8E

nM

Messa

ge

with

out

Meth

.8,3

%22,4

%19,3

%\

90,7

%76,2

%8,3

%90,7

%31,7

%48,4

EnN

Messa

ge

with

out

Nam

e0,0

%2,8

%5,8

%0,0

%0,0

%0,0

%0,0

%14,6

%1,5

%3,8

2

4.3 Results 51

Models C1 and C2

The models C1 and C2 describe an embedded controller, similar to the systemdescribed by B1 and B2. Model C2 is the the final version of the model and isthe successor version of C1 after adding some details. The main purpose of themodels is to serve as documentation for comprehension of the system. In contrastto all other models in this case study the project that delivered C1 and C2 wasoff the critical path, hence, the time pressure was less intensive than for the othermodels. However, no quality assurance techniques such as inspections or similarwere applied to C1 and C2. Despite the absence of time pressure, the level ofdefect occurrences is not structurally lower than for the other analyzed models. Inthe feedback discussion all detected defects were acknowledged by the developersand the feedback was regarded useful for improving the quality of future models.

Model D

Model D describes the embedded software of a consumer electronics device. Themodel is a detailed description of the system. The main purpose of the model is tosupport communication within the project team during development of the system.The project mainly aims at improving the performance of their system by futuremodifications. The model was developed by 10 people over a period of two years.With 538 classes it is the largest model analyzed in our case study. None of the 100use cases in the model is connected to sequence diagrams. Therefore there is notraceability possible from use cases via sequence diagrams to classes. We observedthat about two third of the classes are not instantiated and two third of the classesdo not have methods. A deeper analysis showed, that most of the classes withoutmethods are not instantiated in sequence diagrams. In the feedback session thelead architect of the project acknowledged that this is on purpose. Most of theidentified classes are contained in a part of the model that was supposed as ahigher level model. Hence, in fact two descriptions at different abstraction levelsof the same system were contained in the model. No quality assurance was appliedduring development of the models. The development process is not known.

Model E

Model E describes an information system. The model is a detailed description ofthe system. Its purpose is to specify the system as a basis for the implementation.The model has 418 sequence diagrams. Hence, the interactions of the systemare extensively described. This is also reflected in the sequence diagram relateddefects: the ratio of classes and methods not occurring in sequence diagrams isrelatively low compared to the other analyzed models. But still, the percentageof messages without method is relatively high, about 25%. Unfortunately, nofeedback discussion was possible to discuss the findings with the developers.


Models F1, F2 and F3

The models F1, F2 and F3 describe an embedded controller. The models aresubsequent versions during development. The status of version F3 is almost final.Versions F2 and F1 are approximately one and two months older, respectively.The models are high-level descriptions of the system and the purpose of the modelsis to communicate the system architecture within the organization. One persondeveloped the models. An ad-hoc development process was used to create themodels, whose size is small, but typical for high-level models. Quality assurancewas applied in the sense that we analyzed both F1 and F2 immediately after theywere finished. We reported our findings in feedback sessions to the developershortly after we obtained the models. This feedback is a possible reason for thereduction of defects in the subsequent versions. For some defects, such as UCwithout SD or Multiple Class Definitions with equal Names, the improvement wasvery large. For most defect types the percentages are below the average of allmodels.

Models G1 and G2

Models G1 and G2 describe a subsystem of the same embedded controller as F1-F3.G1 and G2 are developed within the same project team but with different goals:model G1 is an analysis model, whereas model G2 serves as specification of thesystem for implementation. Therefore the size and the level of detail of the modelsdiffers: G1 is smaller and the level of abstraction is higher than G2. Four personsworked on G1. After finishing G1, two persons joined the team and the entireteam worked on G2, which was delivered about six months after G1. However,the version of G2 that we received was not the final version of the model. Despiteof differences in defect occurrences between G1 and G2, the distribution of thepercentages over the defect types is similar for both models. This is interestingbecause all observed context variables except for level of detail and purpose ofmodeling were not changed. Hence, in this case the level of detail and the purposeof modeling did not significantly affect the distribution of the observed defects.

Models H1, H2 and H3

Models H1, H2 and H3 describe the control software of a video surveillance system.The three models differ in their level of abstraction and in their purpose: H1 is ahigh-level analysis model, H2 is more detailed and is mainly used for communica-tion, and H3 is a detailed model for specifying the implementation. Accordingly,the models differ in size and in number of persons working on them, as can beseen in Table 4.2. No quality assurance was applied. The development in theproject team was organized using the Rational Unified Process (RUP [103]). It isnoticeable that H2 has very high percentages for five defect types. In the feedback

4.3 Results 53

discussion it turned out that the development of H2 suffered most under timepressure. The developers were aware of the fact that not enough rigor was appliedduring the development of H2. H3 was reported to be the most extensively usedmodel of the three. Additionally, it was discovered in the discussion that a largeamount of defects could be traced back to two individual developers. Hence, thefactor skill was discovered as an influential context variable in this case. Regardingthe classes that are not instantiated in sequence diagrams, the developers arguedthat ‘trivial’ sequence diagrams would not be modeled.

4.3.2 Observations

Besides the aforementioned observations about individual models, we made obser-vations concerning several models. The most frequent defects types are relatedto traceability between different diagram types. These defect types are ‘Use Casewithout Sequence Diagram’ (with a mean of 84.4%), ‘Methods not in Sequence Di-agram’ (51.0%), ‘Classes not instantiated in Sequence Diagrams’ (60.3%), and ‘In-terfaces not instantiated in Sequence Diagram’ (57.3%). Chapter 8 addresses therole of the relationships between different diagrams for comprehension. Anotherdefect type that was detected frequently is ‘Classes without Methods’ (56.6%).The results indicate that early models, which are typically at a high abstractionlevel, have more classes without methods than models at a lower level of abstrac-tion. As not all details are known at an early stage, the absence of methods is notnecessary a defect. However, in later stages, where a model is used to describe adetailed design and is used as a basis for implementation, the absence of methodsis a defect.

In several models we detected default names for model elements such as the name‘Class’ for classes, or ‘Method’ for methods. The cause of this problem is thatUML tools suggest these default names and due to carelessness of developers, thenames are not changed to more meaningful names. This problem was detectedas the defect ‘Multiple Class Definitions with equal Names’. However, a deeperanalysis revealed that also single occurrences of default names exist. The problemof carelessness of developers in dealing with default suggestions of tools is broaderthan only naming. Spot checks in the models and discussions with the developersrevealed that this problem also affects structural properties of a system, such asfor example the visibility of class members or the cardinality of association ends.Default names mainly affect the comprehensibility of the model, but problemsregarding structural properties do directly lead to unintended implementations incase the model is used for code generation.

Another problem related to naming found in some models is that several variantsare used to classify model elements. For example, we found in one model (G2) fiveways to indicate an interface: classes with prefix naming (e.g. iname), classes withpostfix naming (e.g. nameIF and nameInterface), classes with a stereotype ‘inter-


face’, and use of the UML model element ‘Interface’. Besides comprehensibility,this problem also affects automated transformation and automated analysis: allvariations must be known to tune a tool such that all occurrences of the classifiedmodel elements are detected. Graaf et al. [81][82] also encountered this problem ina project on model transformation. They addressed it by a manual normalizationstep. This problem is caused by absence of clear conventions and lack of rigor.

In some of the feedback sessions of the models that were created by more than oneperson, it was confirmed that parts of models with different defect densities weredeveloped by different persons. Hence, the influence of the human factor on thequality was confirmed. However, it could not be analyzed whether skill, training,motivation or another factor were the causes of the differences.

4.3.3 Threats to Validity

First we address the validity of the observations. In the reported multiple casestudy we have measured the size of models in terms of element counts and thenumber of occurrences of defects. The measurement was done using solid andestablished tooling. Since the measured concepts are simple counts, we arguethat there is no threat to the construct validity concerning the measurements. Inaddition to the measurements, we obtained information from discussions with thedevelopers of the models. To obtain context information about the models (e.g.number of team members) we asked a predefined set of questions. The discussion ofthe analysis results was structured according to the analysis report. Additionally,in all discussions we were accompanied with at least one of our colleagues, suchthat the risk for misunderstandings and loss of information was minimized. Giventhe described precautions, we argue that there is no threat to the construct validityconcerning the information obtained from discussions with developers.

Now we address the validity of the observations that can be drawn from the casestudies. The external validity concerns the level to which the results of a studyare generalizable to a broader context. A possible threat to the external validity isthe bias towards Dutch companies. All models were developed in the Netherlands.However, no differences between software engineering practices in the Netherlandsand in other industrial countries are known. Furthermore most models stem fromlarge international companies, that collaborate with divisions in other countries.Hence, we conclude that the geographical focus of this study does not compromisethe external validity. Furthermore, our case study consists of 16 analyzed models.This is rather large for a case study. The selection criteria for the case studiesensured that the set of models was heterogeneous with respect to several factors,such as model size or modeling tool. Hence, we argue that there is no threat toexternal validity based on focussing on a too specific case. A possible limitation ofthe generalizability is the fact that only a limited set of defect types was analyzed,and that some purposes of modeling are missing in the set of models (e.g. code

4.4 Conclusions 55

generation, or reverse engineered models).

A threat to conclusion validity is the lack of statistical inference testing. Due tothe exploratory nature of the case study and to increase external validity, we chosefor a heterogeneous set of models. Hence, there is a large degree of variation inthe variables that describe the models and their context. Therefore we have notconducted statistical inference testing, as advised by Miller [141]. According toYin [201] internal validity does not apply to exploratory case studies.

4.4 Conclusions

We have explored the quality of UML modeling in practice, in particular withrespect to defect containment, in a multiple case study. In the case study we haveanalyzed 16 industrial UML models and we discussed our findings in feedbacksessions with the developers of the models. As the study was exploratory, themodels were selected such that we analyzed a heterogeneous set of models thatdiffered in size, modeling tool, application domain, purpose, team size and otherfactors. We analyzed the models with respect to occurrence of a set of defect types,and per defect type we report the frequency of occurrence. Moreover, we describedadditional problems and interpretations found during the feedback discussions withthe model developers.

In the case study we found that the number of defects is alarmingly large inindustrial UML models. A particular contribution is the quantification of theoccurrence per defect type. Therefore prevention techniques such as modelingconventions, training, and tooling can be adjusted to focus on common defecttypes, such as multiple definitions of classes or use cases under the same name,large numbers of classes and interfaces without methods, messages in sequencediagrams that do not correspond to methods in sequence diagrams, or messageswithout names.

Additionally we have explored context factors and their relation to defect occur-rences. We found slight indications for the influence of the following factors ondefect occurrences:

• Time pressure. In the collection of the models H1, H2 and H3, the amount ofdefects in H2 was very large compared to the other models. The developersreported that the time pressure for H2 was very high.

• Human factor. In A1 and A2, as well as in H1, H2, and H3 differences inquality were observed between model parts that were developed by differentdevelopers.

• Quality assurance. F1, F2 and F3 is the only collection of models wherequality assurance was applied in the sense that feedback was given after


each version. We observed an improvement of the quality in subsequentversions. (Quality assurance in the form of modeling conventions is studiedin detail in Chapter 6).

No indications were found for the influence of the following factors on model qual-ity: purpose, level of detail, application domain, model size, modeling tool, andteam size.

However, these observations are provisional because of possible interactions be-tween factors and possible other factors that could not be observed. The data andobservations of this exploratory study can be used as a basis for future studies. Fu-ture studies should investigate whether there are causal relationships between theaforementioned factors and the occurrences of defects in UML models. Knowledgeabout such relationships could be used to improve quality control by preventingcauses of defects or by focussing defect removal actions on defects that are likelyto occur. For studies that focus on the relationship between particular factors anddefect occurrences, more control over the factors is needed.

Chapter 5

Effects of Defects

In the previous chapter we have shown that industrial UML models contain a largenumber of defects. These models are used as basis for implementation and main-tenance. In this chapter we investigate to what extent implementers detect defectsand to what extent defects cause different interpretations by different readers. Weperformed two controlled experiments with a large group of students (111) and agroup of industrial practitioners (48). The experiments’ results show that defectsoften remain undetected and cause misinterpretations. We present a classificationof defect types based on a ranking of detection rate and risk for misinterpretation.Additionally we observed effects of using domain knowledge to compensate defects.The results can be used for improving quality assurance techniques for UML-baseddevelopment.

5.1 Introduction

Notwithstanding advances in modern UML case tools and techniques, the numberof defects that remain undetected in practice is alarmingly large. This is reportedin the previous chapter where a large amount of defects was found in severalindustrial UML models.

UML models are used for describing solutions, analyzing their properties, and asa means for communication between stakeholders. Defects in UML models arelikely to affect these uses, but no research on such effects is known. Thereforethis chapter describes our study on effects of defects in UML models. We addressresearch question RQ3, as stated in Chapter 1:

• RQ3:What is the effect of defects in UML models, in particular with respectto detection and misinterpretation during the implementation phase?

57

58 Chapter 5 Effects of Defects

Table 5.1. GQM templateAnalyze defects in UML models

for the purpose of identifying riskswith respect to detection and misinterpretation

from the point of view of the researcherin the context of masters students at the TU Eindhoven

and professionals.

In terms of the quality notions defined in Chapter 3 we study whether syntacticquality of a model affects its social quality. We refine the research question intotwo local research questions (LRQ):

• LRQ1: Are defects in UML models detected by implementers?

• LRQ2: How do undetected defects impact the interpretation of the modelby different implementers?

The goal is summarized by the GQM template [14] (Table 5.1).

To answer these questions we have conducted an experiment with 111 students andreplicated the experiment with 48 professionals. Section 5.2 discusses related work.Section 5.3 explains the defect types that we have investigated. In Section 5.4 theexperimental design, its execution and the two groups of subjects are described.The results and major findings of the experiment are presented in Section 5.5.Additional observations are discussed in Section 5.6. Threats to the validity of theexperiment are discussed in Section 5.7. Concluding remarks and future work aregiven in Section 5.8.

5.2 Related Work

There exists only limited research on the effects of defects that occur in UML mod-els at the stage of implementing the system. Therefore we have widened our lit-erature review to defect detection techniques and defect classification approaches.This enabled us to compare our work to previous research.

Software inspection [65] is an efficient and effective means for quality improvementin software engineering. Defect detection is besides planning, defect collection anddefect correction a central activity in software inspection [109]. In inspectionsso called reading techniques are applied for defect detection. The most appliedreading techniques in industry are ad hoc reading and checklist-based reading(CBR) [65, 109]. Ad hoc reading is not structured and does not provide the

5.2 Related Work 59

inspector with any advice on how to proceed, whereas in CBR checklists use yes/noquestions to guide the inspector. Fagan [66] and Gilb and Graham [78] claimthat inspection leads to the detection of 50% to 90% of the defects in a softwaredocument. These results stem from the time before the UML existed.

Few results exist where inspection techniques for UML and related modeling tech-niques are analyzed. Since UML is a modeling language based on object-orientedconcepts, the more recent object-oriented reading techniques (OORT) such asscenario-based reading (SBR) [13] seem more applicable for inspection of UMLmodels. Scenario-based reading techniques provide the inspector with so-calledscenarios that describe what must be checked and how to perform a check. Oneof the specializations of scenario-based reading techniques is perspective-basedreading (PBR) [15] where the scenarios are based on a particular stakeholder’sperspective. Example perspectives are the tester’s perspective, the developer’sperspective, or the user’s perspective.

The literature review shows that initial empirical research has been done on apply-ing reading techniques to UML models. Laitenberger et al. [108] conducted an ex-periment with 18 practitioners as subjects to compare checklist-based reading withperspective-based reading for UML (both reading techniques applied in teams).The members of each team that applied PBR took the perspectives of designer,implementer and tester. The results show that perspective-based reading leadsto detecting 41% more defects than checklist-based reading and perspective-basedreading is less cost-intensive than checklist-based reading. The defect detectioneffectiveness using perspective-based reading is between 45% and 75%.

Wohlin and Aurum [195] conducted an experiment with 486 students as subjects toevaluate checklist-based reading for entity-relationship diagrams, which are com-parable to UML class diagrams. The results show that the median effectivenessis 46%. Conradi et al. [48] investigated cost-efficiency of reading techniques inan industrial experiment. Since their measurements are based on industrial UMLmodels and the total number of defects is not known they cannot report aboutdefect detection effectiveness like the other studies. However they report cost-effectiveness in terms of defects detected per person hour (d/ph). For individualreaders the cost-effectiveness is around 1.7 d/ph and for team meetings it is lessthan 1 d/ph. The range of the reported values for the effectiveness of defect detec-tion is large. However, even in the best reported case, 25% of the defects remainundetected and propagate to later phases. The experiences from our industrialcase studies (Chapter 4 and [112]) show that still a large number of defects existin industrial UML models and not all organizations are using defect detection tech-niques for UML models. Therefore an investigation like described in this chapterof the effects of defects in UML models that propagate to later phases is needed.Our study does not investigate inspection or OORT’s, but the effect of defectsthat outlive inspection.

Defect classification is a means to assign priorities to defects to enable cost-effective


resource-usage to fix defects (‘most severe defects first’). Defect classificationsare subjective [62] and many organizations use simple categories such as Minor,Major or Severe [92]. Since classifications are subjective, there are approachesthat provide guidelines to enable repeatable defect classification such as IBM’sOrthogonal Defect Classification (ODC) [43] and an improved variant of it by ElEmam et al. [62]. These approaches are not specifically validated for UML defects.The aforementioned study by Wohlin and Aurum [195] is the only work foundreporting a defect classification for ER-diagram defects. However, the classificationreported there is subjective and no guideline is given on how it can be repeated.There is no literature known to the author describing a defect classification basedon empirical validation.

5.3 Defect Types

If overlapping parts of diagrams conflict, then the model contains a consistencydefect. The defect types analyzed in this study take into account a subset of theUML diagrams: class diagrams, sequence diagrams and use case diagrams.

In this chapter we analyze the effects of the following defects that were describedin Chapter 4 and have been found in the industrial case studies reported in thatchapter:

• Message without Name (EnN)

• Message without Method (EnM)

• Class not instantiated in Sequence Diagram (CnSD)

• Object has no Class in Class Diagram (CnCD)

• Use Case without Sequence Diagram (UCnSD)

• Multiple Definitions of Classes with equal Names (Cm)

• Method not called in Sequence Diagram (MnSD)

Additionally we study the effects of the defect type:

• Message in the wrong Direction (ED)This inconsistency occurs if there is a message from an object of class Ato an object of class B but the method corresponding to the message is amember of class A instead of class B. This is a special instance of “Messagename does not correspond to Method”.

5.4 Experiment Design 61

5.4 Experiment Design

We conducted two similar experiments. The subjects of the first experiment werestudents and the subjects of the second experiment were professionals. In thissection the experiments are described according to Wohlin et al. [196]. Most partsare the same for both experiments. The differences are explained where applicable.

5.4.1 Design

The treatment in this experiment is the defect injection and the different levelsare “no defect” and the eight defect types defined in Section 5.3.

We assume that the subjects are not influenced in successor questions by treat-ments of previous questions, therefore the experiment is designed as a nestedsame-subject design, i.e. all subjects are exposed to all treatment levels. Hence,the design is by definition balanced.

5.4.2 Objects and Task

The subjects were given fragments of UML models. A fragment consists of two orthree diagrams. For each fragment the subjects had to answer a multiple-choicequestion that asked how they would implement the system given the UML modelfragments. Subject’s were not asked to inspect the model according to a particularreading technique, but asked to answer the question from the perspective of aperson who has to implement the system. Therefore they implicitly followed anad hoc reading approach.

For each of the eight selected defect types that were presented in Section 5.3we constructed a UML model fragment that contains an instance of the defecttype. With each fragment we presented a question that focusses on a specificaspect of the model fragment. For each multiple-choice question there were fourpossible answers, that represent four interpretations. The questions about modelfragments containing an injected defect were paired with a similar control questionthat focusses on the same aspect but that does not contain a defect in the modelfragment. The control questions served in the analysis of the results to comparethe answer behavior in case of a defect to the ideal case without a defect.

The four answer options that were provided with each question were designedaccording to the following schema:

For questions about a defected model : A defect between two or more diagramsmeans that there is conflicting information between the diagrams. The answeroptions are therefore designed such that for each of the diagrams there is at leastone answer that corresponds to the system as described in the diagram. If possible,


one answer option is a combination of the given diagrams, and at least one answeroption is incorrect with respect to all given diagrams. This is illustrated in theexample in Table 5.2. The table shows question Q2 from the experiment, includingthe diagrams, the question, and the answer options. The critical call is messageopen() from object atm to object a. Answer option A corresponds to the sequencediagram, options B and D correspond to the class diagram and option C is notaccording to any of the diagrams.

For control questions: One correct answer option and the other three answersbeing incorrect.

All questions had a fifth answer option where the subjects could indicate that theydetected a defect. The subjects were asked to give an explanatory motivation oftheir answer.

5.4.3 Subjects

Students

In total 111 students participated in the experiment. The experiment was con-ducted within the course “Software Architecting” at the Eindhoven University ofTechnology (TUE). This course is taught in the first year of the Masters programin computer science, hence all subjects hold a bachelor degree or equivalent. Moststudents have some experience in using the UML and object oriented programmingthrough university courses and industrial internships.

Professionals

In total 48 professionals from 18 companies in 10 countries participated in the ex-periment by completing the online questionnaire. Most of the professionals werecontacted through collaboration partners of us who work in the companies. Addi-tionally, the URL of the experiment questionnaire was announced on newsgroupswith related topics. Some of the subjects did not complete all questions, but allquestions were answered by at least 27 subjects. We removed subjects who entered‘student’ as job description or who had less than two years of work experience. Theaverage work experience of all remaining subjects is 10.7 years. The most frequentjob descriptions (of all subjects who entered a job description) were ‘architect’,‘designer’ and ‘engineer’. In a self assessment on a scale from 1 (no experience)to 5 (several years of experience) the professionals indicated the following: design-ing (average: 4.5), UML, (4.3), implementing (4.0), code review (3.8), inspections(3.7) and design review (3.4).


Table 5.2. Example question

Question: Suppose you are developer in this banking software project.It is your task to implement class ATM. Please indicate how you wouldimplement the ATM class given these two UML diagrams?

Answer option A:

Class ATM{

method getCardInserted(){

c.requestPIN();

dosomething;

a.open()}

method acknowledge (){

dosomething;

c.seeFromMenu()}}

Answer option B:

Class ATM{


c.requestPIN();

dosomething;

a.lock()}


dosomething;

c.seeMenu()}}

Answer option C:

Class ATM{


c.requestPIN();

dosomething;

a.acknowledge()}


dosomething;

c.seeMenu()}}

Answer option D:

Class ATM{


c.requestPIN();

dosomething;

a.validate()}


dosomething;

c.seeMenu()}}

Answer option E:No interpretation possible because of an error in the model.


5.4.4 Preparation

Prior to the experiment we conducted a pilot run to evaluate the experimentaldesign and the experiment materials. The subjects of the pilot experiment did notparticipate in the actual experiment.

UML was presented and explained to the students as part of the “Software Ar-chitecting” course. In the five weeks before the experiment was conducted thestudents had to develop and evaluate a UML model as part of a design assign-ment. Those who had only limited UML experience familiarized themselves withthe UML during the assignment.

The professionals’ experiment was conducted as an online questionnaire. Besidessetting up the website and the database to collect the results no preparation wasneeded.

5.4.5 Operation

Student Experiment

The student experiment was conducted in two runs. The first run contained ques-tions Q1 to Q10, the second run (five weeks later) contained questions R1 to R5.The procedure of operation was the same for both runs.

The incentive for the subjects was to gain bonus points for their grade by partici-pating in the experiment. The subjects’ achievement for the experiment questionshad no influence on the grade. The experiment is according to the ethical issuesproposed by Carver et al. [39].

The experiment was conducted in a classroom with the subjects spread out in anexam-like setting. The subjects were given the experiment material containinginstructions, the model fragments, questions and answers options. For the experi-ment the subjects had 90 minutes available. The average time for completing thefirst run was 67 minutes, hence there was no time pressure for the subjects. Forthe second run the time was not collected. In addition to the written instructionswe gave instructions at the beginning of the run. During the run the subjects wereallowed to ask questions for clarification. The subjects were not familiar with thegoal of the experiment to avoid biased results.

After the first run the subjects had to complete a questionnaire to assess their aca-demic background, work experience (e.g. internships, previous jobs), experiencewith UML and other relevant software engineering related experience.


Professionals Experiment

Because of professionals’ time constraints they performed only one experimentrun. The run contained questions Q1 to Q10. Since we intended to allow pro-fessionals from all organizations and from all over the world to participate in theexperiment we executed the professionals’ experiment as an online questionnaire.Subjects who prefer pen and paper for the experiment could download a printableversion of the experiment material from the experiment website and fax or mailit to us. We announced the URL of the experiment website on several relatednewsgroups and asked industrial contacts to participate in the experiment and toforward the request to colleagues in their organization. The professionals’ exper-iment contained the same diagrams and questions as the first run of the studentexperiment. The professionals’ questionnaire also contained background questionsto gain insight into the subjects’ experience (which enabled us to remove resultsfrom subjects that could not be regarded as “professionals” in the sense of thisexperiment).

5.4.6 Variables

The factor of interest in this study is the defect type. This variable has ninelevels: the eight defect types and ‘no defect’ for control questions. Additionallywe controlled the variable ‘domain knowledge’ by designing some questions withsymbolic names instead of meaningful names to make observations about the effectof domain knowledge. This is not related to our main research questions and isfurther discussed in Section 5.6.

The subject’s answers to the multiple choice questions are summarized in a vectorwith five fields. The fields contain the frequencies of answers for the four answeroptions and the fifth option, that indicates an error. Based on this data we measuretwo dependent variables that relate to our research questions stated in Section 5.1.

In LRQ1 we are interested in whether implementers detect a defect for a givenmodel fragment. The corresponding dependent variable is the detection rate foreach question. The detection rate of a question is the ratio of subjects that indicatethat they cannot give an implementation due to a defect that is present dividedby the total number of subjects that answered a question.

Ideally, if a defect is present, it should be detected and no implementation shouldbe given. Given the motivations of the subjects’ answers, we regard the case wheremultiple answers are given also as defect detection (by giving multiple answers thesubject indicates that the underspecification or ambiguity has been detected).When no defect is present, all subjects should ideally give the same implementa-tion.

In LRQ2 we are interested in whether undetected defects impact the interpreta-


AgrM=0 AgrM=10<AgrM<1

Figure 5.1. Informal explanation of the AgrM measure

tion of a model. An undetected defect is not necessarily problematic. In casethe writer(s) of a model and all readers have the same interpretation there is noproblem. But since defects are in most cases mismatches between diagrams, it ispossible that conflicting information leads to different interpretations. To mea-sure the degree of spread over the four possible options of the answers we havedeveloped a so called agreement measure (short AgrM ). AgrM maps the four fieldsof the answer vector representing the answer frequencies to a scalar. The agree-ment measure is 0 if answers are distributed equally over all options (maximaldisagreement), and 1 if only one option receives all answers (everyone agrees).AgrM is informally explained in Figure 5.1 as the opposite of entropy. AgrM isdescribed formally in Appendix A. We use AgrM for descriptive statistics anddefect classification in Section 5.5.

5.4.7 Hypotheses

Ideally, when a defect is present, it is detected by everyone (detection rate = 1) andwhen no defect is present, no one reports a defect (detection rate = 0). We expectthat defects are not detected by everyone. This would lead to the null hypothesis:‘the detection rate for defected models = 1’ and the alternative hypothesis ‘thedetection rate for defected models < 1’. However, in practice false negatives andfalse positives appear. False negatives reduce the maximum measured detectionrate of 1. We assume that the likelihood of false positives equals the likelihoodof false negatives. This justifies hypothesis H1 for LRQ1, which is tested foreach defect type (d-ratedef and d-ratecontrol denote the detection rate for defectedquestions and for control questions, respectively):

• H10: d-ratedef = 1−d-ratecontrol

• H1alt: d-ratedef < 1−d-ratecontrol

In LRQ2 we address the risk for misinterpretation caused by a defect. For LRQ2we only consider the four fields of the answer vector that represent the answer

5.5 Results 67

frequencies for the four options that correspond to some choice of implementation.The field containing the frequency of indicated errors is discarded. Ideally, every-one interprets a given model in the same way (no misinterpretation). In this case100% of the subjects give the same answer. The results of the control questions(Table 5.3) show that even without a defect not all subjects agree upon the sameinterpretation (possibly due to human error or other causes). The results of thecontrol questions show that it is reasonable to expect that 98% of the subjectsagree on one interpretation, and 2% are spread over the other three options. Thisleads to the hypothesis H2 for LRQ2, which is tested for each defect type:

• H20: for a defected model fragment ≥ 98% of the implementers agree uponthe same interpretation.

• H2alt: for a defected model fragment < 98% of the implementers agree uponthe same interpretation.

5.4.8 Analysis Techniques

To test H1 we consider for each defect type the number of subjects that indicateda defect and the number of subjects that did not indicate a defect and the samedata for the corresponding control question. We applied McNemar’s test [5, 138]for marginal homogeneity. This test is similar to the χ2 test, but does not requireindependent samples. We have a same subject design, hence our samples aredependent.

To test H2 we consider the distribution (l, r) where l is the frequency of the mostprominent interpretation and r combines the frequencies of the three other inter-pretations. More formally: l is the maximum frequency of the four interpretationsoptions, and r is the sum of the frequencies of the three other interpretation op-tions. The sum s of l and r is the total number of answers of subjects, notindicating an error. We construct a reference distribution (0.98 · s, 0.02 · s) whichrepresents the ideal case, where 98% of the readers agree on one interpretation.Since these distributions are independent we can use the χ2 test to test the twodistributions for equality.

We used Microsoft Excel for hypothesis testing.

5.5 Results

The results of this experiment are shown in Table 5.3. The first column showsthe identifier of the question, the second column shows the defect type (as inSection 5.3). In the following columns, S and P indicate the results from thestudents experiment and the professionals experiment, respectively. The column N


shows the number of subjects participating in the question, The following columnsgive the results for detection rate (d-rate) and agreement measure (AgrM ). Thecolumn s indicates with a checkmark which questions used symbolic names in themodel fragments. All questions (except Q10) are paired with a control questionthat does not contain a defect (the paired control question is given in the columnControl Question). The results are discussed in this section.

In the Tables 5.4 and 5.5 we give an objective classification of defect types basedon the experiments.

5.5.1 Outlier Analysis

The results of this experiment might be biased by subjects with lack of motivationor with insufficient expertise to answer the questions. Therefore the answers ofsuch subject should be excluded from the results. We analyzed the subjects’ answerbehavior to identify subjects with the mentioned characteristics.

The answer options of the questions contained wrong answers (one wrong answerfor defected questions, three wrong answers for control questions). We use thenumber of wrong answers per subject to identify subjects who are guessing. Thefirst run of the student experiment and the professionals’ experiment contained 18questions (including subquestions). 81,9% of the students gave no or one wronganswer. No student gave more than three wrong answers. The results for theprofessionals experiment are similar. Therefore no subject was removed. Thesecond run of the student experiment contained 9 questions, two students gavethree wrong answers (which we regarded as suspect of guessing) and they weretherefore removed from the results.

5.5.2 LRQ1: Defect Detection

Research question LRQ1 investigates whether implementers detect a defect in aUML model.

The data of all defect types from Table 5.3 is summarized in Figure 5.2. Theboxplot shows the detection rate for questions containing a defect compared tocontrol questions. In case of a control question the vast majority of the subjectsgives an implementation and only a small fraction wrongly detects a (non present)defect. Figure 5.2 shows that even in case of a model defect, there are moresubjects that give an implementation than indicate a defect. Defect types markedwith an s are based on model fragments with symbolic names.

We tested hypothesis H1 to investigate LRQ1. The test statistic χ2 of McNemar’stest is given in Table 5.3. The threshold value at a significance level of α = 0.05and one degree of freedom is 3.84. We indicate rejection of the null hypothesis

5.5 Results 69

Tabl

e5.

3.C

ompl

ete

resu

lts(

[:no

tpar

tofp

rofe

ssio

nals

expe

rimen

t;\:

noco

ntro

lque

stio

n;‘r’

:hy

poth

esis

reje

cted

;‘f’:

faile

dto

reje

cthy

poth

esis

)

Nd-r

ate

AgrM

Contr

ol

χ2

(H1)

χ2

(H2)

Qu.

Defe

ct

Type

SP

SP

SP

sQ

uest

ion

SP

SP

Q1.1

Msg

.w

ithout

Nam

e111

48

.69

.60

.47

.44

√Q

1.2

16,0

0r

8,1

7r

87,3

7r

19,1

6r

Q2

Msg

.w

ithout

Met

hod

111

40

.39

.38

.84

.90

Q4

50,9

7r

13,3

7r

11,6

8r

1,7

8f

Q8

Msg

.w

ithout

Met

hod

111

30

.49

.33

.86

.94

√Q

445,6

3r

13,7

6r

7,1

7r

1,0

1f

R1.2

Msg

.w

ithout

Met

hod

110

[.4

9.6

9√

R1.1

42,1

2r

51,5

6r

Q3

Msg

.in

wro

ng

dir

ecti

on

111

34

.60

.58

.47

.95

√Q

1.2

73,9

6r

4,2

6r

25,1

1r

0,4

2f

Q6.1

Obj.

has

no

Cla

ssin

CD

111

31

.14

.21

.81

.97

Q6.2

73,0

9r

21,1

6r

13,0

9r

0,2

0f

Q7.1

Use

Case

wit

hout

CD

109

30

.50

.52

.83

.44

Q7.2

47,0

8r

6,2

5r

9,1

7r

10,6

2r

Q5.2

Cla

ssnot

inst

.in

SD

111

34

.47

.68

.49

.64

Q5.1

84,0

5r

4,7

6r

75,4

0r

6,5

2r

Q9.2

Cla

ssnot

inst

.in

SD

109

30

.95

.96

.54

.14

√Q

9.1

1,0

0f

0,6

7f

85,1

7r

7,2

6r

Q10

Mult

iple

Cla

ssdef

s.107

27

.10

.33

.92

.68

\5,2

7r

9,7

0r

R3

Met

hod

not

called

inSD

109

[.1

4.6

7√

Q1.2

76,0

5r

55,7

0r

Q1.2

Contr

olques

tion

110

48

.08

.16

.97

.89

√0,1

8f

1,5

0f

Q4

Contr

olques

tion

111

34

.05

.13

.95

.91

1,1

0f

1,6

6f

R1.1

Contr

olques

tion

110

[.0

6.9

2√

3,4

2f

Q6.2

Contr

olques

tion

111

31

.05

.07

.96

.97

0,1

3f

0,1

5f

Q7.2

Contr

olques

tion

110

30

.05

.22

.95

.92

2,6

9f

0,9

4f

Q5.1

Contr

olques

tion

111

34

.05

.13

.94

.95

1,8

5f

0,8

8f

Q9.1

Contr

olques

tion

110

30

.04

.07

.99

1.0

√0,4

0f

0,5

0f


Figure 5.2. Boxplots: Detection Rate (left) and Agreement (right)

with r, failure of rejection is indicated with an f. The null hypothesis is rejectedfor all defect types except for ‘Class not instantiated in SD’.

Table 5.4 presents a classification of defect types by detection rate (of the studentexperiment), defect types at the top of the table remain undetected in most cases.The least detected defects are Multiple classes with the same definition, Methodnot in SD (symbolic) and Object has no class in CD. The most detected defect isClass not in SD (symbolic).

5.5.3 LRQ2: Variation of Interpretations

With research question LRQ2 we investigate the risk for misinterpretation causedby a defect. The data from Table 5.3 is summarized in Figure 5.2. The boxplotshows the agreement measure of questions containing a defect compared to controlquestions. The control questions have a high AgrM value (average: 0.91), whichindicates that most subjects have the same interpretation of the model. The AgrMvalue of defected models is widely spread between .14 and 1.00 (average: 0.73).The results show that defected models have a larger variety of misinterpretationsand, hence, contain a higher risk for misinterpretation and miscommunication.

We tested hypothesis H2 to investigate LRQ2. The test statistic χ2 of the χ2

test is given in Table 5.3. The threshold value at a significance level of α = 0.05and one degree of freedom is 3.84. We indicate rejection of the null hypothesiswith r, failure of rejection is indicated with an f. The null hypothesis is rejected

5.6 Additional Observations 71

Table 5.4. Classification by detection rateDefect S P Question

Multiple Class Defs .10 .33 Q10Method not in SD (s.) .14 n/a R3Object has no Class in SD .14 .21 Q6.1Message without Method .39 .38 Q2Class not in SD .47 .48 Q5.2Message without Method (s.) .49 .33 Q8UC without SD .50 .52 Q7.1Message in the wrong Direction (s.) .60 .48 Q3Message without Name (s.) .69 .60 Q1.1Class not in SD (s.) .95 .96 Q9.2

for all defect types in the student experiment. In the professionals experimentwe fail to reject the null hypothesis for four defect types. These defect types areinconsistencies between sequence diagrams and class diagrams. We found that thepractitioners tend to regard the sequence diagram as leading. Amongst studentsthis behavior was observed to a much lesser degree.

H2 compares the results of a question with a reference distribution (as describedin Section 6.3.7). This allows us to test the hypothesis for defected questions aswell as control questions. As expected, we failed to reject the null hypothesis forall control questions, indicating that our assumption of an agreement of 98% forthe non-defected situation (i.e. control questions) is valid.

Table 5.5 presents a classification of defect types by AgrM (of the student ex-periment), defect types with the largest spread over different interpretations areat the top. Most misinterpretation is caused by the defect type Class not in SD(symbolic version).

5.6 Additional Observations

5.6.1 Domain Knowledge

When reading a text that contains errors it is often possible to understand theintended meaning of the text. Understanding the correct meaning is sometimeseven possible if the errors introduce ambiguity or change the meaning. This isbased on the fact, that the reader knows the language and he is supported bythe fact that he is familiar with the context, which enables to infer the correctmeaning.

In this study we investigate the effects of defects in UML models. Hence we are


Table 5.5. Classification by AgrMDefect S P Question

Class not in SD (s.) .34 .14 Q9.2Message without Name (s.) .47 .44 Q1.1Message in the wrong Direction (s.) .47 .95 Q3Class not in SD .49 .64 Q5.2Method not in SD (s.) .67 n/a R3Object has no Class in SD .81 .97 Q6.1UC without SD .83 .44 Q7.1Message without Method .84 .90 Q2Message without Method (s.) .86 .94 Q8Multiple Class Definitions .92 .68 Q10

also interested whether context knowledge enables the reader to infer the rightinterpretation from the defected model. To be able to analyze the use of context(or domain) knowledge, we designed pairs of models for the defects Message namedoes not correspond to method name (EnM) and Class from SD not in CD (CnSD)such that one model was taken from a familiar domain (ATM machine and traincrossing) and the other model is essentially equal, but the elements have symbolicnames without a particular meaning (e.g. class A, method3).

For the defect EnM the results of students and professionals are almost the samein the cases with and without domain knowledge (see Table 5.3). For the defectCnSD there is a large difference between the model with and without context forthe detection rate as well as for AgrM in both groups of subjects (Figure 5.3).When the reader cannot use domain knowledge to compensate the defect CnSDthe detection rate is higher, i.e. 95% of the students and 96% of the professionalsdetect the defect. Subjects who compensate the defect using domain knowledgehave different interpretations of the model, resulting in low scores for AgrM.

The question for this defect (using domain knowledge) was to describe the behaviorof the classes that control the traffic light events based on events from the gatesensors and the rail sensors. Because the order of events at traincrossings might beslightly different in different countries, we analyzed the results to detect whethersubjects from the same country (i.e. having common domain knowledge) wouldhave the same interpretation, but even this was not the case.

5.6.2 Prevailing Diagram

The questions about defects between a sequence diagram and a class diagramwere designed such that the answer options included at least one option thatcompensated the defect by regarding the sequence diagram as correct and at least

5.6 Additional Observations 73

Figure 5.3. Results for defect CnSD

one option that compensated the defect by regarding the class diagram as correct.Interestingly, for all defect types that allow either way of interpretation, the optionregarding the sequence diagram as correct (i.e. prevailing) received the largestamount of answers. These defect types are: ED, MnSD, CnCD and EnM.

The defect types Class not in SD (CnSD) and Message without Name (EnN)also address sequence diagrams and class diagram, but in these cases there isinformation missing in the sequence diagrams (instead of a mismatch). Thereforethey cannot be compensated by using the sequence diagram. CnSD was discussedin Subsection 5.6.1. The majority of subjects compensates a defect of type EnNusing the most straightforward interpretation of the class diagram (not taking intoaccount inheritance). But the degree of misinterpretation induced by this defecttype is rather high (i.e. low AgrM values: .47 resp. .44).

The defect type Multiple Class Definitions under the same Name involves onlyclass diagrams. Most subjects compensate an instance of this defect by taking theunion of all methods and classes of both definitions of the class.

5.6.3 Comparing Students’ and Professionals’ Results

Table 5.3 shows that the results of students and professionals are consistent formost questions, i.e. defect types. However, there are some questions where largerdifferences between the two groups occur. We have a closer look at questions wherethe difference in detection rate or AgrM is larger than 0.15.


For both Q10 (Multiple Class Definitions) and Q5.2 (Class not instantiated inSequence Diagram) the detection rate is larger for professionals. We assume thatthe professionals’ larger experience is the cause for the difference. For Q8 (Messagewithout Method, symbolic) the detection rate for the students is larger. Despitethe opposite direction of the difference, we also assume that also in this case thelarger experience of the professionals is the cause. We base our assumption on thelarge agreement for the professionals in this question (AgrM = 0.94). The largeAgrM value indicates that the professionals are determined of knowing the correctinterpretation, and therefore they do not report a defect.

For the agreement measure (AgrM) there are more questions with differences be-tween students and professionals: three questions yield a larger AgrM for profes-sionals (Q3, Q6.1, and Q5.2) and two questions yield a larger AgrM for students(Q7.1 and Q10). We do not take the AgrM of Q9.2 into account, because thevalues are based on a small data set only and, hence, are not reliable (as the de-tection rate for this question is 0.95 and 0.96, the AgrM is based on the answers ofonly 5% and 4% of the population). From these observations we can conclude thatdetection rate and AgrM of some defect types is affected differently by differencesin background characteristics of the reader. As we observed the differences in thisexperiment between professionals and students, we expect that experience is theinfluencing background characteristic in this case. However, further research isneeded to get better insight into the influence of background characteristics.

5.7 Threats to Validity

In this section we discuss the threats to validity in order of decreasing priority:internal, external, construct and conclusion validity (according to [196]).

5.7.1 Internal Validity

Threats to internal validity are influences on the causal relation between the con-trolled factors and the independent variable.

In both experiments described in this chapter the order of the questions is the samefor all subjects, hence there is a potential for order effects. Order effects occurwhen there is interaction between the objects of the study. To avoid interactionwe constructed the objects, i.e. the model fragments, such that all fragments arein different domains, and the naming of elements (classes, methods...) is chosensuch that each pair of model fragments has no common names for its elements. Asdescribed in Section 6.3.2 each question has three types of possible answers. Toavoid that the subjects could predict the correct answer the order of the answersis chosen absolutely random. In all runs, there were no two questions with thesame combination of injected defect and domain knowledge available, hence, we

5.7 Threats to Validity 75

assume that there were no learning effects. The second run of the first experimentcontained questions that were almost equal to questions from the first run. Theresults were almost the same, hence there were no learning effects.

Fatigue during completion of the questionnaire is a possible threat to validity. Thenumber of obvious wrong answers is almost the same for questions at the end ofthe questionnaire as it was for questions at the beginning of the questionnaire.Therefore there is no decrease in performance.

Communication between the subjects influences the subjects’ answer behavior.This threat can be ruled out for the student subjects, since the experiment wasexecuted in an observed exam session where the students were not allowed tocommunicate. There is a risk, that the professionals communicated. After partic-ipating in the experiment we interviewed some of the subjects, who indicated thatthey did not communicate about the experiment during execution.

A difference between the students and the professionals is, that the professionalsvolunteered to participate in the experiment and may be more motivated for a task.We cannot completely exclude this threat, but since the results of both subjectgroups are very similar we assume that the volunteer effect has no significantinfluence on our observations.

Subject’s knowledge of an application domain can influence the dependent vari-able. Since there are differences in the cultural and educational background of thesubjects this effect is a threat to the internal validity of this experiment. To com-pensate for this effect we have chosen objects from various application domainsand some objects unrelated to any application domain (because symbolic namesare used).

5.7.2 External Validity

Threats to external validity are concerned with whether the results of the experi-ment can be generalized to a professional software context.

Students as subjects could be a threat for the external validity of the experiment.The subjects in the first experiment are all MSc students with experience in UML.According to Kitchenham et al. [97] students can be used as subjects.

The size of the model fragments ranges between three and nine classes. In indus-trial models we have found between fifty and several hundred classes. Thereforethe size can be considered as a threat to validity concerning the generalizabilityof the outcomes. When acting (e.g. reading, modifying) on industrial models,designers focus on only a subset of the model, which decreases the size gap for thisexperiment on cognitive effects. As the complexity in industrial models is in gen-eral larger, the effects of defects in industrial models are expected to be at least assevere as the effects reported here. In terms of number of classes industrial models


are larger than our fragments by a factor ranging between ten and hundred. Sim-ilar experiments show the same size factor, e.g. Deligiannis et al. [50] have sourcecode fragments of 18 classes and Purchase et al. [161] has a model fragment of tenclasses. Multiplying the sizes of these experiments by a factor between ten and100 yields sizes that are common source code sizes in industrial projects.

5.7.3 Construct Validity

Construct validity is the degree to which the dependent and independent variablesaccurately capture the concepts that should be measured by this study.

Since the experiment is designed as a multiple-choice test, four possible interpre-tations are explicitly stated to the subject. This situation is different from thesituation in a real software development process, where the subject is not guidedby a set of predefined interpretations. In practice, the subject has to choose froman infinite set of interpretations. Therefore the results in the experiment mightdiffer from practice. As the set of possible interpretations is in practice much largerand the subject is not guided and will most likely not guess the interpretation,we expect the detection rate and agreement measure to be even lower than in theexperiment.

5.7.4 Conclusion Validity

Conclusion validity is the extent to which correct conclusions about the relationsbetween the treatment and the outcome of the experiment can be drawn. Wehave carefully taken precautions to avoid issues threatening the conclusion valid-ity. Particularly precautions guaranteeing homogeneity of the subject groups withrespect to subjects’ background, satisfaction of the assumptions of the statisticaltests and accuracy of measurements are taken (described in Section 5.4).

5.8 Conclusions and Future Work

In this study we investigated the effects of defects in UML models. The two majorcontributions are the investigations into defect detection and misinterpretationscaused by undetected defects. The results show that some defect types are detectedby almost all subjects (e.g. 96% of the subjects detect Class not in Sequence Di-agram) whereas other defect types are hardly detected (e.g. Multiple Definitionsof the same Class is detected by 10% only). Most of the analyzed defect typesare detected by less than 50% of the subjects. The risks for misinterpretationsare similarly alarming. Some defect types cause a large variation in interpreta-tions amongst readers (e.g. Class not in Sequence Diagram has an AgrM of 0.14)

5.8 Conclusions and Future Work 77

and other defect types hardly cause any misinterpretations (e.g. Message withoutMethod has a AgrM of 0.94).

We presented a classification of defect types based on detection rate and risk formisinterpretations. In contrast to most defect classifications found in literatureand industrial practice this classification is objective and based on empirical evi-dence. The results show that most defect types are hardly detected and that thereis no implicit consensus about the interpretation of undetected defects. Thereforedefects are potential risks that can cause misinterpretation and, hence, miscom-munication. These results are generalizable to professional UML designers.

We observed that the presence of domain knowledge effects the interpretation ofUML models. We found an instance of a defect type where the presence of domainknowledge strongly decreased the detection rate. This observation gives rise to theassumption that domain knowledge supports implicit assumptions that might bewrong and cause misinterpretations. The validity of this assumption should beinvestigated in further studies.

Our findings can be used to improve the practice of software modeling in thefollowing ways: Defect prevention can be improved through the use of guidelinesfor creating UML models that minimize the risk of misinterpretation. Using theclassification, defect removal activities can be improved by focussing on the mostrisky defects first.

In further studies the impact of misinterpretations should be investigated. Forexample questions like “which implementation errors will be caused by modeldefects?” and “when will errors caused by model defects be detected and what isthe cost of repairing them?” should be addressed. We invite other researchers toreplicate this experiment using other groups of subjects.


Chapter 6

Modeling Conventions toprevent Defects

To prevent syntactic quality problems we propose modeling conventions, analo-gous to coding conventions for programming. This chapter reports on a controlledexperiment to explore the effect of modeling conventions on defect density andmodeling effort. 106 masters’ students participated over a six-weeks period. Ourresults indicate that decreased defect density is attainable at the cost of increasedeffort when using modeling conventions, and moreover, that this trade-off is largerif tool-support is provided. Additionally we report observations on the subjects’adherence to and attitude towards modeling conventions. Our observations indi-cate that efficient integration of convention support in the modeling process, e.g.through training and seamless tool integration, forms a promising direction towardspreventing defects.

6.1 Introduction

The UML possesses the risk for quality problems due to its multi-diagram nature,its lack of a formal semantics and the large degree of freedom in using it. Thelarge degree of freedom and the lack of guidelines result in the fact that the UMLis used in several different ways leading to differences in rigor, level of detail, styleof modeling and amount of defects. In Chapter 4 we gave empirical evidence thatthe number of defects is large in practice and that the UML is used in differentways by different users. Moreover, in Chapter 5 we have shown experimentallythat defects in UML models are often not detected and cause misinterpretationsby the reader.

The effort for quality assurance is typically divided between prevention effort and

79

80 Chapter 6 Modeling Conventions to prevent Defects

appraisal effort [177]. Prevention effort aims at preventing deviations from qualitynorms and appraisal effort is associated with evaluating an artifact to identifyand correct deviations from these quality norms. There are techniques in softwaredevelopment to detect and correct the deviations from quality norms. Reviews,inspections and automated detection techniques are used in practice to detectweak spots. They are associated with appraisal effort. In programming preventivetechniques to assure a uniform style and comprehensibility of the source code areestablished as coding conventions or coding standards [154]. As an analogy forUML modeling we propose modeling conventions to prevent modelers to deviatefrom quality norms. We define modeling conventions as:

Conventions to prevent for defectsand ensure a uniform manner of modeling.

The main purpose of this chapter is to explore experimentally the effectiveness ofmodeling conventions for UML models with respect to prevention of defects.

An additional purpose of this study is to explore subjects’ attitude towards mod-eling conventions and how modeling conventions are used. The observations canbe used to improve the future use of modeling conventions.

In this chapter we address research question RQ4 as stated in Chapter 1:

• RQ4: How can modeling conventions for developing UML models improvemodel quality?

In terms of the quality notions defined in Chapter 3 we study the influence ofmodeling conventions on syntactic quality. Our study of the influence of modelingconventions on semantic, social and communicative quality is reported in [55].

This chapter is structured as follows: Section 6.2 describes modeling conventionsand related work. Section 6.3 describes the design of the experiment. Section 6.4presents and discusses the results. Section 6.5 discusses the threats to the validityof the experiment and Section 6.6 discusses conclusions and future work.

6.2 Modeling Conventions

6.2.1 Related Work

There is a large variety of coding conventions (also known as guidelines, rules,standards, style) for almost all programming languages. The amount of researchaddressing coding conventions is rather limited though. Oman and Cook [154]present a taxonomy for coding conventions which is based on an extensive review

6.2 Modeling Conventions 81

of existing coding conventions. They identify four main categories of coding con-ventions: general programming practice, typographic style, control structure styleand information style. They found that there are several conflicting coding con-ventions and that there is only little work on theoretical or empirical validation ofcoding conventions. Bieman [19] investigates the adherence to ‘style guidelines’,i.e. conventions, of source code and other software artefacts on real-world softwaresystems. Initial results reveal a large amount of violations.

Our review of literature related to modeling conventions for the UML revealed thefollowing categories: design conventions, syntax conventions, diagram conventionsand application-domain specific conventions.

Design conventions address the design of the software system in general, i.e.they are not specific for UML. Design conventions such as those by Coad andYourdon[45] aim at the maintainability of OO-systems. The conventions thatinclude for example high cohesion and low coupling are empirically validated byBriand et al. [26]. The results of their experiment show that these conventionshave a beneficial effect on the maintainability of object-oriented systems.

Syntax conventions deal with the correct use of the language. Ambler [6]presents a collection of 308 conventions for the style of UML. His conventionsaim at understandability and consistency and address syntactical issues, namingissues, layout issues and the simplicity of design. Object-oriented reading tech-niques (OORT) are used in inspections to detect defects in software artefacts.OORT’s for UML are related to modeling conventions in the sense that the rulesthey prescribe for UML models can be used in a forward-oriented way during thedevelopment of UML models to prevent for defects. Conradi et al. [48] conductedan industrial experiment where OORT’s were applied for defect detection (i.e. anappraisal effort). The results show defect detection rates between 68% and 98%in UML models.

Diagram conventions deal with issues related to the visual representation ofUML models in diagrams. Diagram conventions were addressed in the FourthWorkshop on Graphical Documentation [145]. Koning et al. [99] provide a col-lection of diagram conventions that are not specific for UML models but applyto IT-architecture diagrams in general. Their conventions aim at improving thereadability of diagrams and they provide a lightweight validation where indus-trial architects acknowledged the usefulness of 97% of their presented conventions.MacKinnon et al. [133] present a collaborative process to improve the readabilityof UML diagrams for technical documentation. A validation of the process is leftas future work. Purchase et al. [160] present diagram conventions for the layoutof UML class diagrams and collaboration diagrams based on experiments. Eichel-berger [58] proposes 14 layout conventions for class diagrams aiming at algorithmsfor automatic layout of class diagrams.

Application-domain specific conventions. A purpose of UML profiles is to


support modeling in a particular application domain. Hence, profiles are in factapplication-domain specific conventions. Kuzniarz et al. [107] conducted an ex-periment on the effect of using stereotypes to improve the understandability ofUML models. Their results show that stereotypes improve the correctness of un-derstanding UML class diagrams by 25%.

6.2.2 Modeling Conventions in this Experiment

Based on the literature review and the experience from our case studies, we selecteda set of modeling conventions. To keep the set of modeling conventions manageableand comprehensible, we decided that it should fit on one A4 page. This led to 23modeling conventions after applying these selection criteria:

• Relevance. The modeling convention should be relevant to improve the qual-ity of the UML model by preventing for frequent defects as found in prac-tice [117](Chapter 4).

• Comprehensibility. The modeling convention should be easy to comprehend(e.g. it relates to well-known model elements).

• Measurability. The effect of the modeling convention should be measurable.

• Didactic value. Applying the modeling convention should improve the sub-jects’ UML modeling skills.

The entire set of modeling conventions of this experiment can be found in Ap-pendix B Examples are given in Table 6.1. . In this experiment we focus on as-sessing syntactic quality, but we deliberately do not limit the collection of modelingconventions to syntactic conventions only. As described by Oman and Cook [154]there can be interaction between several conventions. To obtain realistic resultsit is necessary to use a representative set of modeling conventions. Therefore wechose conventions of all categories presented in Section 6.2.1.


6.3.1 Purpose and Hypotheses

We formulate the goal of this experiment according to the Goal-Question-Metricparadigm by Basili et al. [14] in Table 6.2.

Modeling conventions require model developers to adhere to specific rules. There-fore we expect the quality of models to be better, i.e. there are fewer defects in amodel that is created using modeling conventions. When additionally using a tool


Table 6.1. Examples of modeling conventions used in this experimentID Name Description4 Homogeneity of

Accessor UsageWhen you specify getters/setters/constructors for aclass, specify them for all classes

9 Model Class In-teraction

All classes that interact with other classes should bedescribed in a sequence diagram

10 Use Case In-stantiation

Each Use Case must be described by at least one Se-quence Diagram

14 Specify MessageTypes

Each message must correspond to a method (opera-tion)

15 No AbstractLeafs

Abstract classes should not be leafs (i.e. they shouldhave subclasses)

19 Low Coupling Your classes should have low coupling. (The num-ber of relations between each class and other classesshould be small)

Table 6.2. GQM templateAnalyze modeling conventions for UML

for the purpose of investigating their effectivenesswith respect to investigating their effectiveness

from the point of view of the researcherin the context of masters students at the TU Eindhoven.

to check for adherence to the modeling conventions, we expect the model qualityto be even better than without tool-support. In other words, we formulate in thenull hypothesis that there is no difference between the treatments:

• H10: There is no difference between the syntactic quality of UML modelsthat are created without modeling conventions, with modeling conventionsand with tool-supported modeling conventions.

Adherence to modeling conventions requires special diligence. We expect thatthis leads to higher effort for modeling. When additionally using the tool formonitoring the adherence to the conventions, the expected effort is even higher.Therefore we formulate the second hypothesis of this experiment as follows:

• H20: There is no difference between the effort for modeling UML modelsthat are created without modeling conventions, with modeling conventionsand with tool-supported modeling conventions.


6.3.2 Design

The purpose of this experiment is to investigate the effect of modeling conventions.Therefore the treatment is to apply modeling conventions with and without tool-support during modeling. We define three treatment levels:

NoMC: no modeling conventions. The subjects use no modeling conventions.This is the control group.

MC: modeling conventions. The subjects use the modeling conventions thatare listed in Appendix B.1.

MC+T: tool-supported modeling conventions. The subjects use the mod-eling conventions and the analysis tool to support adherence. The analysis tool isconfigured such that deviations from the modeling conventions are reported.

The experimental task was carried out in teams of three subjects. We have ran-domly assigned subjects to teams and teams to treatments. According to [67]this allows us to assume independence between the treatment groups. Each teamperformed the task for one treatment level. Hence we have an unrelated between-subjects design with twelve teams for each treatment level.

6.3.3 Objects and Task

The task of the subjects was to develop a UML model of the architecture of aninformation system for an insurance company. The required functionality of thesystem is described in a document of four pages [110]. The system involves multipleuser roles, administration and processing of several data types. The complexityof the required system was chosen such that on the one hand the subjects werechallenged but on the other hand there was enough spare time for possible overheadeffort due to the experimental treatment. The subjects used the Poseidon [3]1

UML tool to create the UML models. This tool does not assist in adhering to themodeling conventions and preventing model flaws.

The task of the teams with treatment MC and MC+T was to apply modelingconventions during development of the UML model. The modeling conventionsdescription contains for each convention a unique identifier, a brief descriptivename, a textual description of the convention, and the name of the metric or rulein the analysis tool, that it relates to.

The subjects of treatment MC+T used the SDMetrics [198]2 UML analysis tool toassure their adherence to the modeling conventions. SDMetrics calculates metricsand performs rule-checking on UML models. We have customized the set of metrics

1Version 3.2 of Poseidon was used for this experiment2Version 1.3 of SDMetrics was used for this experiment


and rules to allow checking adherence to the modeling conventions used in thisexperiment [110].

6.3.4 Subjects

In total 106 MSc students participated in the experiment, which was conductedwithin the course “Software Architecting” in the fall term of 2005 at the EindhovenUniversity of Technology (TU/e). All subjects hold a bachelor degree or equiva-lent. Most students have some experience in using the UML and object orientedprogramming through university courses and industrial internships. We analyzedthe results of the students’ self-assessment from the post-test questionnaire andfound no statistically significant differences.

The students were motivated to perform well in the task, because it was part ofan assignment which was mandatory to pass the course (see Section 6.4.4).

The students were not familiar with the goal and the underlying research questionof the experiment to avoid biased behavior.

6.3.5 Operation

Prior to the experiment we conducted a pilot run to evaluate and improve the com-prehensibility of the experiment materials. The subjects of the pilot experimentdid not participate in the actual experiment.

In addition to prior UML knowledge of the students we presented and explainedUML during the course before the experiment. The assignment started with aninstruction session to explain the task and the tooling to all students. Additionallythe subjects were provided with the assignment material including a detailed taskdescription, the description of the insurance company system, and instructionsof the tools [110]. The modeling conventions and the SDMetrics tool were onlyprovided to the teams which had to use them. The teams of treatment MC andMC+T were explicitly instructed to apply the treatment regularly and to contactthe instructors in case of questions about the treatment. The experiment wasexecuted over a period of six weeks.

6.3.6 Data Collection

We collected the defect data of the delivered UML models using SDMetrics, be-cause the majority of the applied modeling conventions is related to rules andmetrics that we defined for SDMetrics.

The subjects were provided with an Excel logbook template to record the timespent during the assignment in a uniform manner. They recorded their time for


the three activities related to the development of the UML model: modeling itself,reviewing the model and meetings related to the model.

We used a post-test questionnaire to collect data about the subjects’ educationalbackground, experience, how the task was executed and subjects’ attitude towardsthe task. The 17 questions of the questionnaire were distributed through the uni-versity’s internal survey system. The questionnaire can be found in Appendix B.


For quality and effort we have to analyze number of defects and time in minutes,respectively. These metrics are measured on a ratio scale. We use descriptivestatistics to summarize the data. For hypothesis testing we compare the meansusing a one-way ANOVA test. We have analyzed the data with respect to theassumptions of the ANOVA test and have found no severe violations. The analysisis conducted using the SPSS [2] tool, version 12.0. As this is an exploratory studywe reject the null hypothesis at the significance level of 0.10 (p<0.10). To enablecomparison of our results with other studies we report the effect size using Cohen’sd [46].

The data from the post-test questionnaire, which was designed as a multiple-choicequestionnaire, were answers on a five-point Likert-scale. Hence, they are measuredon an ordinal scale. We summarize the data by presenting the frequencies aspercentages for each answer option and providing additional descriptive statisticswhere appropriate. The answer distributions between different treatment groupsare compared using the χ2-test [139]. Microsoft Excel was used for this test. Weapply the threshold of p<0.10 for statistical significance. When comparing threedistributions (NoMC, MC and MC+T) a χ2 value greater than 13.36 implies thatp<0.10. In cases of comparing only two distributions the threshold is χ2 = 7.78.

6.4 Results


During the duration of the experiment eight subjects dropped out (7.5%). Theaffected teams were distributed evenly over all treatments, therefore we do notexclude their data. One team in group MC+T completely dropped out, thereforewe exclude its data.

To check whether the data is reasonable and to identify invalid data sets, weanalyze the outliers. Figure 6.1 shows the boxplots for the size of the obtainedmodels (number of classes, on the left) and the total amount of time needed bythe teams to complete the task (on the right). According to Wohlin [196] the

6.4 Results 87

Figure 6.1. Boxplots for number of classes and total time

Figure 6.2. Boxplots for absolute number of defects and defect density

reasons for an outlier should be analyzed in order to decide whether to include orto exclude the data point in the analysis. We scrutinized the outliers and came tothe conclusion that they are not due to a rare event that can never happen again.As these outliers can happen in other situations as well, we decided to includethem in the analysis.


6.4.2 H1: Presence of Defects

Total Number of Defects.

We assess the quality of the UML model in terms of number of defects as describedin Section 6.3.2. Figure 6.2 shows the boxplot for the total number of defects (onthe left) and the number of defects normalized by the size of the model (on theright). Table 6.3 shows the descriptive statistics. The percentages in Table 6.3 arerelative to the treatment level NoMC. The descriptive statistics for the normalizednumber of defects show that modeling conventions (MC) reduce the mean andthe median. Tool-supported modeling conventions (MC+T) result in a largerreduction of defects. However, according to the ANOVA test (see Table 6.4) theresults are not statistically significant and we cannot reject the null hypothesisH10. Additionally, we provide Cohen’s d [46] in Table 6.3 as a measure for effectsize. The reported value reports the effect size of MC and MC+T relative to thecontrol group NoMC, respectively. Since Cohen [46] states that an effect size of0.5 is medium, the size of the observed effects is slightly below medium.

Detailed Results.

In addition to the total number of defects which is discussed above, we have con-ducted a detailed analysis of 19 metrics and rules that are related to the modelingconventions applied in this experiment. The set of defects analyzed in this ex-periment contains defects that were detected in the case studies (Chapter 4) andaddressed in the experiment on effects of defects (Chapter 5).

For nine of these metrics the results for both MC and MC+T are better thanfor the control group. An example is the metric Number of Sequence Diagramsper Use Case which indicates how well the functionality defined in use cases isspecified by the sequence diagrams. Compared to the control group this metric is30.8% greater for MC and 80.5% greater for MC+T (these results are statisticallysignificant). Moreover, improvements were observed with respect to the followingdefects that were analyzed in Chapters 4 and 5: Object without Class, Objectwithout Name, Message without Name, Class not in Sequence Diagram, Methodnot in Sequence Diagram, Classes without Methods.

Three metrics show an improvement for MC+T but a decrease for MC. An exampleis the metric Number of Objects.

The metric Coupling between Objects (CBO) is the only one that has worse resultsfor both MC and MC+T than for the control group. A possible explanation couldbe, that the subjects applying modeling conventions model associations betweenclasses more explicitly, resulting in a higher CBO. The results of six metrics areinconclusive because of the small number of occurrences of the rule-violations.

6.4 Results 89

Tabl

e6.

3.D

escr

iptiv

est

atis

tics

for

defe

cts

and

mod

elin

gef

fort

(inm

inut

es)

Tre

atm

ent

Mea

nPer

c.M

edia

nPer

c.Std

.Dev

.M

axM

inC

ohen

’sd

Def

ects

NoM

C10

2.42

100.

0%55

.50

100.

0%15

7.28

057

242

(tot

al)

MC

53.6

752

.4%

49.0

88.3

%34

.102

135

90.

428

MC

+T

46.9

145

.8%

29.0

52.3

%40

.990

154

80.

483

Def

ects

NoM

C1.

5110

0.0%

1.47

100.

0%0.

396

2.31

21.

032

(nor

mal

ized

)M

C1.

3790

.5%

1.36

92.1

%0.

412

2.04

50.

607

0.36

6M

C+

T1.

2482

.0%

1.22

82.8

%0.

667

2.40

60.

320

0.49

9E

ffort

NoM

C10

69.1

710

0.0%

910.

0010

0.0%

670.

220

2125

120

(Mod

elin

g)M

C11

57.9

210

8.3%

982.

5010

8.0%

718.

225

2280

105

0.12

7M

C+

T18

85.0

017

6.3%

2010

.00

220.

9%83

4.55

431

3054

01.

048

Effo

rtN

oMC

367.

5010

0.0%

300.

0010

0.0%

329.

224

1155

0(R

evie

win

g)M

C38

5.83

105.

0%27

2.50

90.8

%29

9.40

090

075

0.05

8M

C+

T52

4.55

142.

7%60

0.00

200.

0%37

9.72

712

500

0.44

2E

ffort

NoM

C55

5.42

100.

0%37

5.00

100.

0%49

9.29

717

100

(Mee

ting

)M

C72

0.00

129.

6%64

0.00

170.

7%63

2.48

817

700

0.28

9M

C+

T86

2.73

155.

3%69

0.00

184.

0%83

9.06

930

600

0.44

5E

ffort

NoM

C19

92.0

810

0.0%

2062

.50

100.

0%11

87.4

9841

5048

0(T

otal

)M

C22

45.4

211

2.7%

2545

.00

123.

4%85

2.47

132

6569

00.

245

MC

+T

3272

.27

164.

3%33

30.0

016

1.5%

1151

.838

4590

650

1.09

4


Table6.4.R

esultsofthe

AN

OVA

testfordefects

andeffort

∑of

Squares

df

Mean

Squared

FSign

ifican

ceH

ypoth

esisD

efectsB

etween

Groups

2.157E+

42

1.078E+

41.144

.331H

10

(total)W

ithinG

roups3.017E

+5

329.428E

+3

failedto

Total

3.232E+

534

rejectD

efectsB

etween

Groups

4.320E-1

22.160E

-1.858

.433H

10

(normalized)

Within

Groups

8.048E+

032

2.510E-1

failedto

Total

8.479E+

034

rejectE

ffortB

etween

Groups

4.536E+

52

2.268E+

64.129

.025rejected

(Modeling)

Within

Groups

1.758E+

732

5.493E+

5Total

2.211E+

734

Effort

Betw

eenG

roups1.669E

+5

28.348e+

4.738

.486failed

to(R

eviewing)

Within

Groups

3.620E+

632

1.131e+5

rejectTotal

3.787E+

634

Effort

Betw

eenG

roups5.444E

+5

22.722e+

5.614

.547failed

to(M

eeting)W

ithinG

roups1.418E

+7

324.432E

+5

rejectTotal

1.472E+

734

Effort

Betw

eenG

roups1.042e+

72

5.210E+

64.535

.018H

20

(Total)

Within

Groups

3.677E+

732

1.149E+

6rejected

Total

4.719E+

734

6.4 Results 91

6.4.3 H2: Effort

We measure the effort to develop the UML model in minutes using logbooks. Ta-ble 6.3 shows the descriptive statistics for modeling, reviewing and team meetings.The columns showing percentages are relative to the treatment level NoMC. Thedescriptive statistics show that both the mean and the median increase for MC arehigher for MC+T. The last column of the table shows Cohen’s d [46], a measurefor effect size. The reported value reports the effect size of MC and MC+T relativeto the control group NoMC, respectively. Regarding the total effort the effect forMC is small (d = 0.245), whereas the effect for MC+T is large (d = 1.094). Notethat Cohen [46] states that an effect size of 0.2 is small, 0.5 is medium and 0.8 islarge. The data for the different activities modeling, reviewing and meeting shows,that for MC+T the effect sizes for reviewing and meetings are between small andmedium. MC+T mainly affects the effort required for modeling, where the effectis large (d = 1.048).

Additionally we performed an ANOVA-test for hypothesis testing. The resultsof the ANOVA-test are shown in Table 6.4. The results for the total effort arestatistically significant. Hence, we reject the null-hypothesis H20. However, whenwe analyze at the level of activities, we see that only the results of modeling arestatistically significant.

6.4.4 Attitude

To fully investigate the usefulness of modeling conventions it is necessary to assessthe subject’s attitude towards modeling conventions. We investigated the subjects’attitude using the post-test questionnaire. The questions are multiple-choice ques-tions with answers on a Likert scale ranging from 1 (very low agreement) to 5 (veryhigh agreement). The results are summarized in Table 6.5.

The subjects perceived the difficulty of the task as medium. The difficulty ofperforming the task with tool-supported modeling conventions is about 10% higherthan for MC.

There is a statistically significant difference in the degree to which the subjectsenjoyed the task. The mean for control group (NoMC) is almost one point higherthan for the other two treatment groups. The lower enjoyment might be causedby the extra effort (see Section 6.4.3).

The results show that the subjects of all treatment groups slightly indicate thatthey have confidence in the quality of their models. There is no significant differ-ence between the treatment groups.

The results show that the task and the treatment were well understood and thatthe subjects were well motivated. This is necessary to be able to draw valid


conclusions from the experiment. The χ2-test did not show significant differencesbetween the treatments groups.

6.4.5 Adherence to the Treatment

We used the answers to the post-test questionnaire to investigate the subjects’adherence to treatment MC and MC+T. The answers are summarized in Table 6.6.The table shows the percentages for the points ‘1’ (very low adherence) to ‘5’ (veryhigh adherence). On average both treatment groups adhere better than neutralto the modeling conventions (the mean is greater than 3). The χ2-test shows thatthe difference between MC and MC+T is not statistically significant.

The reported average adherence to the analysis tool is below the neutral point(3). We conducted a χ2-test to find out whether the adherence differs significantlyfrom the adherence to the modeling conventions of the same treatment group. Thedifference is statistically significant at the 10% significance level.

Furthermore we asked the subjects how they applied the treatment. For bothtreatment groups that applied modeling conventions, more than 80% of the sub-jects indicate that they read the modeling conventions several times during theproject. The tool was used up to ten times during the project at an average of3.32 times. The two instructors of the course (including the author) report thatthey received questions about both the modeling conventions and the analysis toolstarting from the second week of the experiment. This indicates that the subjectsstarted using the modeling conventions and the analysis tool.



Threats to internal validity can affect the independent variables of an experiment.A possible threat to internal validity is that the treatment groups behave dif-ferently because of a confounding factor such as difference in skills, experienceor motivation. Our analysis results show no significant differences between thetreatment groups for these factors.

A risk is that subjects apply a treatment they should not apply, because they areeager to learn about new technology. We minimized this risk by (i) not tellingthe subjects the goal of the experiment, (ii) by informing the subjects that theirgrade is not influenced by the treatment group that they were in, (iii) by makingmodeling conventions and tool available only to the appropriate teams, and (iv) byinforming the subjects that all technology would be made available to all subjectsafter completion of the task. In the case that subjects would have received a


Tabl

e6.

5.S

ubje

cts’

attit

udes

tow

ards

the

task

Tre

atm

ent

Nχ

2M

ean

12

34

5D

ifficu

lty

NoM

C34

11.8

602.

940.

00%

23.5

3%61

.76%

11.7

6%2.

94%

MC

363.

002.

78%

19.4

4%52

.78%

25.0

0%0.

00%

MC

+T

332.

616.

06%

42.4

2%36

.36%

15.1

5%0.

00%

Enj

oyN

oMC

3418

.886

3.47

0.00

%14

.71%

32.3

5%44

.12%

8.82

%M

C36

2.58

16.6

7%27

.78%

36.1

1%19

.44%

0.00

%M

C+

T33

2.58

21.2

1%21

.21%

36.3

6%21

.21%

0.00

%C

onfid

ence

NoM

C34

5.52

63.

182.

94%

17.6

5%41

.18%

35.2

9%2.

94%

inQ

ualit

yM

C36

3.31

0.00

%11

.11%

47.2

2%41

.67%

0.00

%M

C+

T33

3.24

3.03

%21

.21%

27.2

7%45

.45%

3.03

%U

nder

stan

ding

NoM

C34

4.08

93.

188.

82%

14.7

1%35

.29%

32.3

5%8.

82%

Tas

kM

C36

3.08

2.78

%27

.78%

33.3

3%30

.56%

5.56

%M

C+

T33

2.91

9.09

%27

.27%

30.3

0%30

.30%

3.03

%M

otiv

atio

nN

oMC

343.

862

3.56

5.88

%8.

82%

23.5

3%47

.06%

14.7

1%M

C36

3.44

5.56

%5.

56%

36.1

1%44

.44%

8.33

%M

C+

T33

3.67

3.03

%3.

03%

30.3

0%51

.52%

12.1

2%


Table6.6.A

dherenceto

thetreatm

entA

dheren

ceto

Treatm

ent

Nχ

2M

ean1

23

45

Modeling

MC

365.027

3.6380.00%

5.56%33.33%

52.78%8.33%

Conventions

MC

+T

333.303

3.03%6.06%

54.55%30.30%

6.06%A

nalysisTool

MC

+T

339.326

2.72712.12%

27.27%42.42%

12.12%6.06%


different treatment despite these precautions, it would only decrease the effectbetween the treatment groups. Hence, in case this happened, the effect would belarger in reality.


Threats to external validity reduce the generalizability of the results to industrialpractice. As described in Section 6.3 the experiment is designed to render a re-alistic situation. Hence, the experimental environment is designed to maximizegeneralizability (at the cost of statistical significance). We use students as sub-jects, which might be a threat to external validity. However, all students in thisexperiment hold a BSc degree in computer science and have relevant experience.The subjects conducted the task as a team instead of individually, this rendersa realistic situation. The size of the model to be developed is smaller than mostmodels in industrial practice. However, we assume that the effect of modelingconventions the same of larger for larger models, than it is for small models.

Due to curricular constraints the amount of training and, hence, experience withmodeling conventions and the analysis tool is limited. This renders the situation inthe introduction phase of the technology. We assume that more experience resultsin a reduction of extra effort and possibly a larger effect on model quality.


Construct validity is the degree to which the variables measure the concepts theyare to measure. The concept of quality is difficult to measure and it consistsof several dimensions [96]. It is not feasible to cover all dimensions in a singleexperiment. We limit the scope of this experiment to defect containment. Usingwell-established tooling to measure the defect containment we are confident tomeasure this dimension of model quality correctly.


Conclusion validity is concerned with the relation between the treatment and theoutcome. The statistical analysis of the results is reliable, as we used robuststatistical methods.

We minimized possible understanding problems by testing the experiment materialin a pilot experiment and improving it according to the observed issues. The courseinstructors were available to the students for clarification questions. The resultsof the post-test questionnaire show that the task was well understood. Hence, weconclude that there were no understanding problems threatening the validity ofthe reported experiment.


The metrics of the UML models (defects, size...) were collected using an analysistool and are therefore repeatable and reliable. A possible threat to the conclusionvalidity is the reliability of the measured time and the data from the post-testquestionnaire. For time collection a logbook template was used to assure unifor-mity. The author analyzed the data for validity and no obvious problems werefound.

6.6 Conclusions

The UML consists of different diagram types, has no formal semantics and doesnot provide guidelines on how to use the language features. Inherent to thesecharacteristics is the risk for syntactic quality problems, i.e. defects. In this studywe propose modeling conventions as a forward-oriented means to reduce thesequality problems. Our literature review shows that existing work focusses onparticular categories of conventions for UML modeling and that there is lack ofempirical validation of conventions for UML modeling.

Our main contribution is an experiment that provides empirical data about theapplication of modeling conventions in a realistic environment. Our results showthat the defect density in UML models is reduced through the use of modelingconventions. However, the improvement is not statistically significant. Given thatthere is a relation between the modeling conventions and the analyzed defects andmetrics, a larger improvement would have been expected. We observed that theadherence to the modeling conventions was not strict. In the sequel we discusspossible causes for the weak adherence and give recommendations to increase theadherence.

We provide data about the additional effort needed to apply modeling conventionswith and without tool-support. The presented data quantifies the trade-off be-tween improved model quality by using modeling conventions and the cost of extraeffort. Additional observations describe the developers’ attitude towards modelingconventions. The subjects using modeling conventions enjoyed their task slightlyless than the subjects who did not use modeling conventions, indicating that thecommitment in using modeling conventions can be improved.

Due to the time constraints of the experiment, we provided the subjects with a setof modeling conventions, instead of letting them select the conventions themselves.However, the subjects had no experience whether the modeling conventions wereuseful for their task, and the subjects received no reward for delivering a betterquality model (the typical reward would be less effort during use of the UML mod-els in a later phase). In practice it would be desirable if the developers who musteventually use the conventions participate in establishing the set of modeling con-ventions. This would increase their knowledge about and trust in the conventionsand we expect they would have more commitment in using modeling conventions.

6.6 Conclusions 97

(However, the process of establishing the set of conventions to be used should bemoderated in a productive manner, such that endless discussions are avoided). Weexpect that the commitment will also be improved in a practical situation becausethe models will be used after they have been developed, resulting in rewardingthe models’ quality. We observed that it is difficult to control the adherence tomodeling conventions. Hence, there is potential for improving the adherence ofthe developers. The subjects in this experiment did not get a reward for adheringwell to modeling conventions and the analysis tool. In an industrial environmentthe incentive to deliver good quality models is bigger, because the quality of themodels pays back in later phases (as opposed to the experiment, where the projectstops when the models are delivered). We expect that this would result in abetter adherence to the modeling conventions and, hence, an increased effect onthe model quality. The subjects in this experiment were not experienced usingmodeling conventions or the analysis tool. Therefore the experiment resemblesthe introduction of modeling conventions to a project. We expect that for moreexperienced developers the quality improvement is larger and the amount of extraeffort will be less than in our experiment.

The tool-support for adherence to the modeling conventions was given by a stand-alone tool. We expect that integrating adherence checks into UML developmenttools will decrease the extra effort and result in higher adherence, because of ashorter feedback loop. Egyed’s instant consistency checking [57] is a promisingtechnique for short feedback loops.

The observations made in this experiment potentially lead to the following guide-lines for applying UML modeling conventions:

• Attention must be paid to control the adherence to the modeling conventions.

• Commitment of the developers increases the adherence to the modeling con-ventions.

• Modeling conventions should be tailored to a specific purpose of modeling.

• Tool support to enforce adherence to the modeling conventions increases thequality improvement. A short feedback loop is required to minimize theamount of necessary rework.

Future studies of modeling conventions should extend the set of modeling conven-tions to conventions for the process of modeling, instead of focusing on model spe-cific conventions. Conventions for the process of modeling would guide the modelerin his choices about diagram types, model elements, or abstraction level dependingon the phase of modeling. Moreover, the effect of adherence and experience on theeffectiveness and efficiency of modeling conventions should be investigated in moredetail. External replications of the reported experiment should be conducted tofurther confirm our findings.


Similar to Chapters 4 and 5, the focus of this chapter is on defect containment.This is a limitation of this experiment, since the impact of modeling conventionson developers’ ability to use a model is also of large interest. In [55] we reporton an exploratory experiment where we studied the impact of modeling conven-tions on developers’ interpretation of models. For that experiment the modelscreated in this experiment were used as objects. Future work should address theimpact of modeling conventions on model properties such as communicativenessand changeability in more detail.

Chapter 7

Task-Oriented Views

Software development is becoming more and more model-centric. As a result mod-els are used for a large variety of purposes, such as quality analysis, comprehension,and maintenance. We argue that existing UML diagrams and related existing tool-ing do not provide information needed by developers such that software engineeringtasks can be completed at a high level of quality and efficiency. For example re-lations between diagrams are difficult to find and metrics data is not intuitivelyconnected to model elements. In this chapter we propose task-oriented views ofUML models. The views are based on a framework and designed such that devel-opers are supported in completing particular tasks.

7.1 Introduction

To successfully fulfill a model-related software engineering task, developers needsto access a specific subset of the information contained in the model. They possiblyneed also information from artifacts related to the model, such as source code, testdata, or evolution data. In its current version, the UML offers 13 related diagramtypes. Therefore, there is a large amount of inter-diagram information, such asrelations between classes (in class diagrams) and their instantiations as objects insequence diagrams. However, such information is scattered over different diagrams.For example, a class can occur in different class diagrams. However, the UMLitself and also current tooling (refer to Chapter 2) hardly provide any support fordevelopers to discover relations between diagrams nor to cope with informationthat is scattered over different diagrams. It is very tedious for developers to findthis kind of information and to relate other information such as metrics or evolutiondata to UML models. We argue that in model-centric software engineering, viewson the available data must be aligned with the tasks in which the views are used. Inthis chapter we present a framework which is the basis for developing views that

99

100 Chapter 7 Task-Oriented Views

support tasks that are required in model-centric development. We will identifytypical tasks, available (and necessary) data, and we will propose new views andvisualization techniques, to improve the use of UML modeling for practitionerswith respect to fulfilling their tasks. To answer research question RQ5 we willevaluate these views in the following chapter.

7.2 A Framework for Task-Oriented Views

We aim at developing views that support tasks in model-centric software engi-neering. Therefore, we use a framework that relates tasks to views as a startingpoint to define the views. Maletic [134] proposed a framework to classify softwarevisualizations focussing on the relation between tasks and views. In this studywe reuse an adapted version of Maletic’s framework. In this section the three un-derlying concepts of the framework for task-oriented modeling and their relationsare described. The three underlying concepts are: Properties, Views and Tasks.The relations between these concepts are illustrated in Fig. 7.1. Note some adjust-ments that we have made to Maletic’s framework: we refer to ‘view’ to describethe visualization and its representation, whereas Maletic’s framework has a sep-arate dimension called ‘representation’. Additionally we refer to ‘properties’ of amodel, whereas Maletic uses the term ‘target’ for the same concept. For simplic-ity’s sake we omit in the framework the concept ‘audience’ which describes thestakeholder associated with a particular visualization and the dimension ‘medium’which would be ‘color monitor’ in all cases.

Properties are characteristics of UML model elements (e.g. classes, use cases,or sequence diagrams). We will describe an initial overview of properties in thissection. Then we define the concept of views which are the basis for visualizingthe identified properties. A software developer has to fulfill tasks that have to beperformed on model properties and that are supported by views. In the remainderof this section we explain the concepts in more detail and provide examples foreach concept. However, we do not claim that the given examples are exhaustive.

Figure 7.1. The three underlying concepts and their relations

7.2 A Framework for Task-Oriented Views 101

7.2.1 Tasks

We define a task in the context of this work as a unit of work accomplished by asoftware engineer on a software artefact to fulfill a purpose.

Our framework is a basis for developing model-centric views that support fulfillingthe tasks. Hence, the tasks are the starting point for developing views. Note thatour definition of a task does not define a task’s granularity. In this section we giveexamples of task categories, that combine tasks with a common goal:

Comprehension.

The need for comprehension (or in this specific context model understanding) canhave different reasons. One reason is that a new developer joins an existing team,and needs to understand the system before he is able to make useful contributions.Examples of activities related to this task category are: identifying key classes,identifying which classes implement which functionality, identifying relations be-tween classes and identifying complex interactions.

Model development.

Creation of models is often an incremental and iterative process including manychanges. Either parts of the system are modeled in each step or the system asa whole is modeled from a high abstraction level down to a more detailed level.Examples of activities employed in this type of task are: adding, changing orremoving elements.

Testing.

Testing is used to detect defects in software. A testing task common in model-centric software development is the (automatic) generation of test cases from se-quence diagrams.

Model maintenance.

The process of changing a software system due to corrections, improvements or ad-justments is called maintenance. The tasks related to changes in the model belongto this task category. Some activities in which UML models are involved include:extension of a system, bug fixing, handling change requests and performing impactanalysis before making a change.

Quality Evaluation.

As the correction of quality problems is much cheaper at an early stage, i.e. atthe modeling stage, than at implementation stage, it is important to evaluate asystem’s quality before implementing it. The evaluation of the quality of a modelcan be performed at several abstraction levels, e.g. separate elements, diagramsor the system as a whole. Besides evaluating a single version of a model one canalso investigate a series of versions, to detect trends.


Completeness / Maturity Evaluation.

Related to quality evaluation is the evaluation of the completeness or the matu-rity of a model. This task category includes analyzing whether a model reflectsall requirements and whether the diagrams of the model describe the system com-pletely.

7.2.2 Properties

We define a property in the context of this work as a directly or indirectly measur-able characteristic of a model element. The set of its properties (or a subset of it)uniquely identifies a model element. Properties belong to the information neededby software engineers to perform tasks. Model elements are the building blocks ofUML models. The model element types are defined in the UML meta model. Foreach model element type, such as class, association, classifier instantiation, usecase etc. a number of properties are defined. We identify three different types ofproperties for model elements:

Direct Internal: Those properties of an element that are solely and directlybased on information that is present in the model. General examples of this kindof property are the name of an element or the owner of an element. Exampleproperties for a class are its operations, its attributes and its relations to otherclasses.

Indirect Internal: Besides the information that is directly present within themodel, we identify properties associated with model elements that can be derivedfrom the model. The information contained in indirect internal properties is de-rived from the model element itself or related model elements that may also bein different diagrams. For example multi-view metrics as proposed in [117] suchas ‘number of use cases per class’ are indirect internal properties that combineinformation from different diagrams. General examples of this kind of propertyare metrics and history data. Other examples of indirect internal properties of themodel element ‘class’ are the number of methods, the number of instantiations ofthe class or the complexity of the class (for example based on an associated statediagram).

External: A third type of property is based on information from outside themodel. This type of properties is ignored by the UML specification and to the bestof our knowledge there exist no commercial tools that take external properties ofmodel elements into account. Sources of external properties are other artifacts,such as source code, requirements documents or test documents. Additionally datathat is recorded during software development which is traceable to model elementsis an external property. For example bug reports can be seen as external propertiesof classes or can even be traced to use cases. Configuration management systemsallow to collect evolution data about a software artefact. In that case the number

7.3 Proposed Task-Oriented Views 103

of changes or the number of different developers working on a model element areexamples of external properties. As UML is used nowadays not only in design,but also in other phases, such as maintenance, external property data is availablefor UML models.

7.2.3 Views

We define a view [88][102] in the context of this work as a visible projection of asubset of properties of a model’s elements that is needed to fulfill a task. Typically,UML models are visually represented in diagrams. The UML specification [152]defines a variety of diagram types as views on a model from a certain perspective.This specification is only concerned with internal properties and not even all ofthese properties are viewable in UML diagrams. The relations between modelelements in different diagram types are often not intuitively presented. It is, forexample, often difficult and tedious to find out at which places in the model a classis instantiated, because this relation is not explicitly present in the diagrams. Weargue that in the design of the UML the choice of which properties can be viewedin UML diagrams and the visualization techniques used to represent them are notoptimal for common tasks in software engineering.

Views offer visual representations of a model. In a view, model element propertiesare visualized by creating a mapping to visual properties. Examples of thesevisual properties are: Position (Layout), Size (Width, Height, Depth), Color (Hue,Saturation, Luminance), Shape and Orientation.

7.3 Proposed Task-Oriented Views

This section describes the views we proposed to support comprehension of UMLmodels. In addition to the textual description, the views are summarized ac-cording to the Task-Oriented Modeling framework in Table 7.1. The views areimplemented in the tool MetricView Evolution [194].

7.3.1 MetaView

Figure 7.2 shows a MetaView in which inter-diagram relations are visualized. Thedifferent UML diagrams that are created during the software engineering processoffer different levels of abstraction. Typically, starting at a high abstraction level,we find the use cases. These use cases are described in more detail by sequencediagrams. In sequence diagrams objects representing classes occur. Classes andtheir relations to each other are described in class diagrams. The internal behaviorof these classes can be described by state machines.


Table 7.1. Description of the views according to the Task-Oriented Mod-eling framework

View Task PropertiesMetaView comprehension, main-

tenance, completenessanalysis

implicit and explicitrelations between dia-grams

MetricView comprehension, qualityevaluation, complete-ness evaluation

metrics (direct and indi-rect internal, external)

UML-City View comprehension, qualityevaluation, complete-ness evaluation

metrics (direct and indi-rect internal, external)

ContextDiagram comprehension implicit and explicitrelations between dia-grams

QualityView quality evaluation metrics (direct and indi-rect internal, external)

EvolutionView quality evaluation, iden-tification of trends

metrics (direct and indi-rect internal, external),different versions

A problem that exists in regular UML tools is that each of the diagrams is shownseparately, this results in the hiding of the relation between different diagramsand model elements. Our proposed solution to these problems is the MetaView.It gives an overview of the diagrams that describe the model and makes it possibleto show the relations between (elements on) different diagrams. This last featureallows tracing through the different abstraction levels that the different types ofdiagrams offer.

Figure 7.2 shows the four types of elements that take part in this example: A usecase, an object that occurs in the sequence diagram describing the use case, theobject’s class, and the state diagram describing the class.

Tasks. The MetaView can be applied in comprehension, maintenance, and com-pleteness analysis tasks. Browsing through a model for instance is a comprehensiontask that is actively supported by this view. Another example is impact predictionfor which the visualization of inter-diagram relations can be useful.

Properties. Explicit and implicit relations can be viewed using the MetaView(direct and indirect internal)


Figure 7.2. MetaView: Tracing from an object to a use case and to arelated class and state diagram

Figure 7.3. MetricView: Combining UML and metrics visualization


Figure 7.4. UML-City: Combining the MetaView and MetricView

Figure 7.5. Quality Tree view


7.3.2 MetricView

Figure 7.3 shows an example of the proposed MetricView, in which three differentmetrics are visualized on top of a regular class diagram. The idea of MetricViewis to combine the existing layout of UML class diagrams with the visualizationof metrics and UML models using a set of techniques adopted from geographicalinformation systems (GIS) [113] [17]. Applying metrics to a UML model can resultin an overwhelming amount of data. This data is usually presented in tables, suchthat the software engineer has to make the mapping between metrics values in thetable and classes in the UML diagrams manually or in his mind. MetricView over-comes this problem by integrating the model and metric visualization. MetricViewcan represent the metrics using color, size and/or shape to visualize values.

The tasks supported by this view are: comprehension, quality evaluation andmaturity/completeness evaluation. It can for instance help identifying key classesby emphasizing them using metric visualization. It supports the evaluation ofmetrics at the level of model elements for a single version of a model. Thesemetrics can be both indicators of the quality and of the maturity or completenessof a model.

The metrics for which the values are visualized are direct or indirect internal, orexternal properties.

7.3.3 UML-City View

Figure 7.4 shows an example UML-City. This view combines the concepts of theMetaView and MetricView. As metric visualization the ‘3D-heightbar’ is used, thisvisualization shows a box on top of the model element where the height and thecolor of the box indicate the value of the metric. Low metric values are depictedby flat green boxes while high values are depicted by tall red boxes. Applying thisto the MetaView results in a view for which city is a metaphor.

Tasks and Properties. As the UML-City view is a combination of MetaViewand MetricView, it also combines their tasks and properties. It gives an overviewover the elements with extreme values for a metric. For example, the UML-City can be used for model-wide identification of classes with extreme metricvalues. Depending on how the diagrams are arranged in the underlying MetaView,patterns can be detected to identify packages, or parts of the model based withparticular metrics values.

7.3.4 Quality Tree View

Quality models such as ISO 9126 [89] and proposed by Khosravi et al. [94] providea structure to define what quality means in a given context. The most common ap-


proach to create such models is used in so called decompositional quality models.Quality is decomposed in subconcepts such as maintainability and understand-ability. Each of these subconcepts can in turn be decomposed again. In this waya tree-like structure is constructed which builds a quality model. At the leafs ofthe tree are metrics. These metrics may be obtained from a UML model andthe metric values are used to calculate values for each of the higher nodes in thequality model.

Figure 7.5 shows a Quality Tree view. In the Quality Tree view, the nodes representthe concepts of the quality model and the edges represent the relations betweenthe concepts. Our contribution to the well known concept of quality models isthat in the Quality Tree view, color or graphs represent the value of each node.Additionally our proposed Quality Tree View is implemented, such that it interactswith the UML model and supports traceability between nodes in the quality modeland elements in the UML model. In Figure 7.5 a three-dimensional variant of theQuality Tree is shown. The height bars represent metric values for each of thenodes.

As quality models will differ based on the context in which they are used, theQuality Tree View is configurable to represent any hierarchical quality models,such as the quality model for UML (Chapter 3). This tailoring is possible bychanging the structure of the tree, as well as by changing the functions attachedto the relations in the tree. Additionally it is possible to change the metrics thatproduce the input for the quality model.

Tasks. The Quality Tree View is designed to support quality evaluation tasks.This evaluation can start at a high abstraction level by investigating the value ofthe root node of the tree. From this root node the exploration can continue tolower abstraction levels until the level of metrics is reached.

Properties. Metrics can be indirect or direct internal properties, as well asexternal properties.

7.3.5 Context View

The context of a model element consists of all model elements it relates to. Theelements of a model are typically scattered over several diagrams. UML diagramsare projections of the entire model, they typically do not contain all model ele-ments. Accordingly, it often occurs that only a limited context of model elements isviewed in one diagram. To fully understand a model element it might be necessaryto know its entire context. Therefore we propose the Context View. The modelelement whose context is viewed in a context view is centered in the diagram.All model elements that are directly related to the particular model element areviewed as a circle around the model element. As the context of a model element ispotentially very large and only a specific subset of the context might be necessary


Figure 7.6. Context View (left) showing all the children of a single class,compared with regular class diagram (right)

for a task it is desirable to filter the context. Two straightforward filtering criteriaare the model element type or the relation type (i.e. association, dependency, orinheritance relation). Figure 7.6 shows the context view for a class. The contextis filtered such that only model elements of the type ‘class’ that are related tothe class via an ‘inherits from’ relationship are viewed. The right side of the ofFigure 7.6 contains one of the class diagrams as shown in current UML tools. Onlyfour classes that inherit from the particular class are shown.

Tasks. The Context View is designed to support comprehension tasks. It canfor example be used to explore why a metric of a class has a specific value. Theexample in the figure is a class where the metric ‘number of children’ is 28 (forexplanations of common object oriented system metrics see [41]). It would betedious to analyze this outlier by browsing through all diagrams where inheritancerelations of the class are viewed. The Context View can also be used for impactanalysis (e.g. which classes depend on a particular class?)

Properties. Explicit and implicit relations can be viewed using the Context View(direct and indirect internal).

7.3.6 Evolution View

Figure 7.7 shows the evolution view, in which the two concepts graph and calendarare combined to identify trends. The reason for using a graph is that it is aneffective way to visualize the evolution of metric data. The purpose of the evolutionview is to enable users to spot trends in the values of quality attributes and/ormetrics at multiple abstraction levels. At system level such a graph can be used to


Figure 7.7. Evolution View: based on the calendar metaphor

give an overview of changes in aggregated data. By combining it with the conceptof a calendar, i.e. mapping time on the horizontal axis and values of the verticalaxis, and adding color to indicate whether a given value is considered good or badit becomes a compact and intuitive way to enable the evaluation of the evolutionof quality data. The same technique can be applied at diagram and element levelto allow for different analysis granularity.

Tasks. Quality evaluation, prediction, monitoring.

Properties. Metrics, i.e. direct and indirect internal and external properties.

7.3.7 Search and Highlight

The aforementioned views are complemented with search and highlight function-ality. The results of search actions are visualized by highlighting the relevantdiagrams and diagram elements with a distinct color (e.g. yellow). Additionallyrelations between elements in the set of results and other elements in the modelare drawn.

7.4 Related Work

In this section we discuss related approaches. We present four categories of relatedwork: approaches that aim to support tasks in model-centric software engineering,approaches using techniques similar to ours, and applications of our or similartechniques.

We start with the discussion of approaches that aim at supporting particular tasksin model-centric software engineering. The model analysis tool SDMetrics [198]calculates metrics for UML models and presents the data in tables and graphs(refer to Chapter 2). Using SDMetrics or similar tools, the user has to relate themetrics data to the model elements which are on his mental map. We propose theMetricView to combine metrics representation and the already existing layout ofthe class diagram. As the mental map often coincides with the layout in the class

7.4 Related Work 111

diagrams, MetricView supports the user in establishing the connection betweenmetrics and model elements. In MetricView we enrich the existing layout of classdiagrams with colors to represent metrics values. A similar technique is used bySchauer et al. [172]. They use color to indicate groups of classes belonging to thesame design pattern [72].

Another direction of related work aiming at improving model-centric software en-gineering deals with layout algorithms. For example the work of Eiglsperger [60]and Eichelberger [59] aims at creating layouts for class diagrams, which complywith rules from cognitive psychology. These layout algorithms are in particularuseful for reverse engineered models, where no layout exists. An integral idea ofMetricView is to keep the existing layout as created by the designer or generatedby a layout algorithm. However, the idea of the Context View is to create a newclass diagram only containing a class and its context. Kollmann and Gogolla [98]present a similar idea. They use metrics such as coupling, fan-in and fan-out todetermine the set of classes which should be contained in their version of the Con-text View. In contrast to using metrics, we use the existing structure of relationsin the model to determine the Context View.

Another category of related work contains approaches that use similar techniquesas we do, but that do not focus on modeling. Similar visualizations to representmetrics are proposed by Langelier and Sahraoui [123]. Their visualizations mainlyaim at visualizing metrics of source code, therefore they have to create a newlayout, whereas we can use the already familiar layout of the UML class diagrams.In [124] several mappings from properties to visual properties, called polymetricviews, are explored. The main difference to our work is that polymetric views aregeneral software visualizations aiming at reverse-engineering tasks, while our workconsists of UML-based visualizations targeted at various model-centric softwareengineering tasks.

In our literature study we found several studies about visualizing the evolution ofsource code, such as Voinea et al. [193] and Langelier and Sahraoui [122]. However,only little work addresses the evolution of UML models. Xing and Stroulia [200]present the UMLDiff algorithm for automatic detection of structural changes be-tween two class diagrams. The output of the UMLDiff algorithm is a detailedreport of structural differences between two versions of UML class diagrams. OurEvolution View presents the trend of metrics of the entire model over severalversions. Different from UMLDiff our Evolution View also takes external proper-ties into account. However, opposed to UMLDiff, which is fully automated, ourEvolution View is an aid for humans to analyze models.

The main purpose of our MetaView is to give an overview of and navigate throughan entire UML model by showing all diagrams from all views concurrently. AllUML case tools known to us only provide a tree for navigating through the model.Besides this, our literature study did not yield any work related to navigationalaids for UML models.


An approach related to ours is proposed by Kersten and Murphy [93] [144]. Theirtool MYLAR presents parts of a source code base which are relevant for a giventask. The tool highlights program elements which have a high ‘degree-of-interest’(DOI) and filters program elements with a low DOI. Their tool is adaptive inthe sense, that the DOI-data is collected by the tool during usage. Hence, theinterface adapts to a task, which is a difference to our approach, where we providethe user with pre-defined views for specific tasks. So far, their tool only takesinternal properties into account. The main difference is, that their tool works forsource code and we focus on UML models. Their tool’s views are in fact adaptiveinterfaces, which change according to the usage pattern

The work of Hansen [85] is in the category of applications of our techniques.Hansen gives an application of the MetricView visualization: he visualized projectdata mapped to UML classes and packages and reports positive feedback from acase study within the company ABB.

Chapter 8

Validation of Task-OrientedViews for Comprehension

In this chapter we report on an experiment to validate four of the views that weproposed in the previous chapter: MetaView, ContextView, MetricView, and UML-CityView. The purpose of this experiment is to study whether there is a differencebetween the proposed views and the existing views with respect to comprehensioncorrectness and comprehension effort. The comprehension task performed by thesubjects was to answer a questionnaire about a model. 100 MSc students with rel-evant background knowledge have participated in the experiment. The results arestatistically significant and show that the correctness is improved by 4.5% and thatthe time needed is reduced by 20%.

8.1 Introduction

In the previous chapter we have proposed task-oriented views for UML models.The purpose of this chapter is to empirically validate our proposed views. Weconducted two replications of a controlled experiment with 100 subjects to validatethe effort reduction and comprehension correctness improvement achieved by usingthe views. In this chapter we will address research question RQ5 as stated inChapter 1:

• RQ5: Can task-oriented views of UML models enhance the comprehensionof the models?

Due to the limited amount of time this study is limited to the validation of theviews with respect to comprehension and analysis tasks. The tasks in this studywill mainly address the following views:

113

114 Chapter 8 Validation of Task-Oriented Views for Comprehension

• MetaView,

• ContextView,

• MetricView,

• UMLCityView

and the search and highlight functionality.

Many researchers have studied and implemented visualization tools to supportcomprehension of source code. Examples of these tools are Rigi [182], SHriMP [181]and CodeCrawler [125].

As the importance of models such as UML models is increasing, we focus on thecomprehensibility of UML models. UML diagrams are graphical representationsof software and, hence, are expected to support comprehensibility. However, webelieve that tools are needed to further improve the comprehension of UML models.Especially the increased use of UML models and the extended set of tasks usingUML models justify this need.

Existing work addresses the comprehensibility of UML models [185] [164] [156] [157]and the layout of diagrams [197] [60] [59]. Only little work addresses how usersperceive UML models [84]. We proposed task-oriented views for improving thecomprehension of UML models [120] [119].

Tools vs. Views. The purpose of the experiment is to evaluate views for UMLmodels. One possibility would be to provide the views as printouts or static screenimages to the experimental subjects. However, in software engineering the viewsare usually implemented in CASE tools. The case tools enable to interactively ad-just the views by clicking, zooming and scrolling. We decided to use case tools topresent the views in this experiment, because this approach is much more realisticthan using static snapshots of the views. Hence, we do not directly measure theviews, but we measure tools that implement the views and interactive manipula-tion thereof. Note that due to this indirection we will often refer to tools in theremainder of this chapter.

This chapter is structured as follows: Section 2 describes the design of the exper-iment, Section 3 presents and discusses the results, Section 4 discusses threats tothe validity, and Section 5 concludes this chapter.


8.2.1 Purpose and Hypotheses

In Table 8.1 we summarize the purpose of the experiment according to the Goal-Question-Metric paradigm (GQM) [14].


Table 8.1. GQM templateAnalyze Comprehension task-oriented views

for the purpose of evaluationwith respect to correctness and effort

from the point of view of the researcherin the context of Master’s students at the TU Eindhoven.

More specifically, we want to compare the usefulness of task-oriented views with thetraditional views, which are used as a baseline in this experiment. Comprehensionplays an important role in the development and maintenance of software systems.In particular two concepts are of interest for the evaluation of comprehensiontechniques: correctness and effort. Correctness is essential for comprehension.Incorrect comprehension of a system can lead to wrong actions introducing faultsand communication overhead. Effort is relevant from an economical point of view.

To evaluate the task-oriented view we are interested whether they differ from thetraditional UML views. This leads us to the following hypotheses:

• H10: There is no significant difference between task-oriented and traditionalviews in terms of effort needed for comprehension the model.

• H1alt: There is a significant difference between task-oriented and traditionalviews in terms of effort needed for comprehension the model.

• H20: There is no significant difference between task-oriented and traditionalviews in terms of correctness of comprehension the model.

• H2alt: There is a significant difference between task-oriented and traditionalviews in terms of correctness of comprehension the model.

In addition to testing the described hypotheses, we are interested in informationdescribing how the subjects experience the use of the different views. We expectthis additional information to be useful to explain the results, to find opportunitiesfor improvements and as an indicator for the likelihood of engineers adopting thetechnique in practice.

8.2.2 Task, Objects and Treatment

The task in the reported experiment was to answer comprehension questions abouta UML model. In order to answer the questionnaire, the subjects had to analyze


a given UML model using a tool. The experiment was carried out in two runs. Ineach of the two experimental runs, a different UML model had to be analyzed.

For each model, a specific questionnaire was used. However, the two questionnaireswere conceptually equal. The main part of the questionnaire contained 29 multiple-choice questions for assessing the subjects’ comprehension of the model. Thequestions addressed relations between model elements and diagrams, as well asmetrics. The questions were designed such that they fit into one of the followingcategories:

• Category 1: Multiple Class Diagrams (CDs). The questions addressinformation that is scattered over multiple class diagrams. Subcategoriesaddress coupling via associations, inheritance and occurrences of classes inseveral class diagrams.

• Category 2: Relations between Class Diagram and Sequence Di-agram (CD – SD). The questions address relations between elements inclass diagrams and sequence diagrams. Subcategories address message callsbetween class instantiations and occurrence of classes is sequence diagrams.

• Category 3: Relations between Use Cases, Sequence Diagramsand Class Diagrams (UC – SD – CD). The questions address relationsbetween elements in use cases, sequence diagrams and class diagrams.

• Category 4: Metrics. The questions address metrics of the UML model.

Examples of questions are: ‘Which classes contribute to implementing the use case“Leave Route”?’ (Category 3) and ‘Which classes receive method calls of class“Route”?’(Category 2). The entire questionnaires are available in the replicationpackage [110].

In addition the questionnaire contained a section of questions addressing the sub-jects’ evaluation of the model, the task, and the tool. In the first experimentalrun an additional section was used to assess the subjects’ background and in thesecond run, the section was used for qualitative feedback on the tools. The ques-tionnaire was also used to log the time needed to fulfill the task (only for answeringthe question of the questionnaire’s main section). The questionnaire was the samefor both treatment groups.

The objects in this experiment are the UML models that are used for the analysistask. Both models were of similar size and complexity. The size was chosen to belarger than a pure toy-example, but still allowing to fulfill the task in the givenamount of time. The characteristics of the models are given in Table 8.2. Theapplication domains were an insurance information system in the first run and acar navigation system in the second run. It is reasonable to assume that bothapplication domains were equally familiar to the subjects.


Table 8.2. Model characteristicsRun 1 Run 2

Application Domain Insurance Car NavigationClasses 39 38

Use Cases 11 11Class Diagrams 5 5

Sequence Diagrams 5 6

We decided to measure the views indirectly by means of tools that implement theviews. Therefore the treatment in this experiment is the use of different tools:MetricViewEvolution and the combination of Poseidon [3]1 and SDMetrics [198]2.Poseidon and SDMetrics represent the current state-of-the-practice in UML toolsand these tools are used by the control group as a baseline. These tools aredescribed in Chapter 2.

8.2.3 Subjects

In total 100 MSc students participated in the experiment, which was conductedwithin the course “Software Architecting” in the fall term of 2006 at the TechnischeUniversiteit Eindhoven (TU/e). All subjects hold a bachelor degree or equivalent.80% of the students have a bachelor degree in computer science, 13% have abachelor in electrical engineering and 7% have a different background such as mathor mechatronics (see Table C.2). 54% of the students received their bachelor fromthe TU Eindhoven, 26% from other Dutch institutions, 8% from other Europeanuniversities and 12% from other countries (see Table C.1).

Using the debriefing questionnaire we measured the level of relevant subjects’background knowledge. Table C.3 shows the results of the students’ self assessmenton a Likert-scale from 1 (no knowledge) to 5 (applied the technology several timesin industrial context. The results show that more than 93% have UML knowledgeand that more than 62% have experience using the UML. The level of experiencefor UML and metrics tools is slightly lower, but still the majority of the subjectshas the relevant knowledge (94% have knowledge about UML tools and 81% haveknowledge about metrics tools). Statistical hypothesis testing shows that there areno significant differences between the treatment groups with respect to backgroundknowledge and motivation. The students were motivated to perform well in thetask, because it was part of an assignment which was mandatory to pass the course.The students were not familiar with the goal and the underlying research questionof the experiment to avoid biased behavior.

1Version 4.2.1 of Poseidon was used in this experiment2Version 2.0 of SDMetrics was used in this experiment


Table 8.3. Experimental designMVE POS

First Run Group B (48) Group A (52)Second Run Group A (50) Group B (45)

8.2.4 Design

The design of the experiment is a between-subjects design as described in Table 8.3.The 100 subjects are assigned to two groups (A and B) by randomization, suchthat we can assume absence of variation [67] in subject background between thetwo groups. The number of subjects per group is given in Table 8.3. For technicalreasons the groups are divided 48-52 instead of 50-50. The second run of theexperiment is a replication of the first run. Note that for the second run fivesubjects of the first run did not show up, which results in a mortality rate of 5%.A pedagogical constraint of the course was that the subjects should use both toolsduring the assignment sessions. To avoid learning effects of the tool or the UMLmodel, the groups had to switch tools and a new UML model was used as objectin the second run. However, there is still the chance for general learning effects,which will be discussed in the next section.

8.2.5 Preparation

Before the experiment we conducted a pilot run to evaluate the experimentalmaterial and the approach. The experimental material was improved accordingto the feedback received from the participants of the pilot run. The participantsof the pilot run were colleagues of us and final year students and they did notparticipate in the actual experiment.

The experiment was conducted as an assignment during the course ‘Software Archi-tecting’. Most students already had relevant knowledge in object-oriented systemdesign and UML. Nonetheless we explained UML and all tools used during theexperiment in lectures prior to the experiment. To familiarize the subjects withthe tools they had to conduct two preparatory assignments which were similar tothe experiment.

8.2.6 Operation

The experiment was conducted in two runs. Both runs were carried out as a class-room assignment in an exam-like setting. The subjects performed the task individ-ually. The subjects used their own laptops with the tools properly installed. We


distributed the models using our university’s education-support system ‘Study-Web’3. To prevent subjects from switching between treatments, i.e. tools, weconfigured the StudyWeb system such that for each subject the model was onlyavailable in the file format for the tool it was supposed to use. The questionnaireswere distributed as hardcopies, as for both treatment groups the questionnaire wasthe same. In the pilot run the subjects took up to 90 minutes to complete the task.As students tend to be more cautious diligent and, hence, slower in an assignment,we set the time for the experimental runs to 120 minutes. During the runs, theauthor and two colleagues were available for the subjects to answer comprehensionand technical questions and to observe the runs. The subjects were not allowed tocommunicate verbally or via network communication. This was enforced by spotchecks.

8.2.7 Variables

The independent variables in this experiment are

• L (tool): MetricViewEvolution (MVE) or Poseidon+SDMetrics (PoS),

• R (run): 1 or 2, and

• M (UML model): Insurance Information System (IIS) or Car NavigationSystem (CNS).

The variables R and M coincide in our experimental design.

Since we are interested in the correctness of and the effort needed for the compre-hension task, we have the following two dependent variables:

• Time T : total time in minutes to perform the task.

• Correctness C : number of correct answerstotal number of questions

We chose to measure correctness as a ratio instead of the absolute number of correctanswers to make the results easily comparable to replications of this experimentwith a different number of questions.

Now we can express the hypotheses in terms of the variables (where in Tx and Cx

are the time and correctness of the task performed with tool x ):

• H10: TMV E = TPOS

• H20: CMV E = CPOS

3More information available at http://studyweb.tue.nl/

http://studyweb.tue.nl/


The alternative hypotheses are accordingly:

• H1alt: TMV E 6= TPOS

• H2alt: CMV E 6= CPOS

The time was logged by the students in slots on the questionnaire. The subjectswere instructed to complete the questions in the given order, to log breaks, and tostart when they started answering the questions after the tools had been set up.As each subject used a computer for the task, it was ensured that each subjectcould use a consistent clock. The begin time and the end time would have beensufficient for us to calculate the total time, but we also included four spots for splittimes in the questionnaire, to enable us to perform sanity checks on the loggedtime.


We summarize the results of the experiment using the established descriptivestatistics. Each of the experimental runs is a between-subjects design. There-fore suitable tests for hypothesis testing are the independent-samples Studentt-test and its non-parametric alternative the Mann-Whitney test. We observedthat the assumptions of the Student t-test do not hold in all cases (Appendix C).Therefore we report the results of both, the Student t-test and the more robustMann-Whitney test. The results are consistent. We used the tool SPSS [2]4 forhypothesis testing. We apply the standard threshold of p<0.05 for statistical sig-nificance. To enable comparison of our results with other studies we report theeffect size using Cohen’s d [46].

The subjective evaluation data is obtained from the post-test questionnaire, whichwas designed as a multiple-choice questionnaire. The answers are on a five-pointLikert-scale [129] and, hence, measured on an ordinal scale. We summarize thedata by presenting the frequencies as percentages for each answer option andproviding additional descriptive statistics where appropriate. We compare theequality of answer distributions between different treatment groups using the χ2-test [139]. For this test we used Microsoft Excel. We apply the standard thresholdof p<0.05 for statistical significance. When we compare distributions a χ2 valueless than 7.82 implies that p<0.05.

8.3 Results

In this section we present the results of the experiment. First we discuss how wedealt with outliers in the data, then we give descriptive statistics of the results,

4SPSS version 12.0.1

8.3 Results 121

then we discuss the hypotheses, and finally we present the results of the subjects’subjective evaluation.


To be able to draw valid conclusions from data it is necessary to analyze andpossibly remove outliers from the data. Outliers are extreme values that mayinfluence the conclusions drawn from the collected data. According to Wohlin etal. [196] it must be analyzed for an outlier, whether it is caused by an extraordinaryexception, which is very unlikely to happen again, or whether the cause of theoutlier can be expected to happen again. Examples of causes for outliers of thefirst case are errors in the data collection or external factors such as a power failureduring the execution of the experiment. Therefore outliers of the first case shouldbe removed, because they tamper the results. Outliers of the latter case shouldnot be removed, because they are expected to happen again.

We used boxplots to identify outliers for both dependent variables. Ten observa-tions were found to be outliers (out of 400 observations: two variables, two runs,100 subjects). All available information from the questionnaires about the iden-tified observations and the corresponding subjects was taken into account and weeven contacted some of the subjects for further information. For four subjectswe identified causes resulting in an exclusion of their data for the analysis of theexperiment. The causes were

• lack of knowledge in OO technologies (two subjects),

• lack of training (one subject did not participate in the preparatory assign-ments and therefore did not know the tools well enough), and

• technical problems (one subject had serious technical problems with the Met-ricView Evolution tool, therefore its data for the MVE run was discarded).

The other outliers were not caused by exceptional problems and therefore remainedin the data set.

8.3.2 Descriptive Statistics

The data of the dependent variables correctness and time is summarized in Ta-ble 8.4. The data is presented separately for the first and the second run. Thecolumns of the table show the treatment, the number of correct observations (N),the mean, the median, the minimum and the maximum value and the standarddeviation (StDev). Additionally the data is summarized in boxplots in Figure 8.1(Time) and Figure 8.2 (Correctness). The small circles in these boxplots denote


Figure 8.1. Boxplot for time

Figure 8.2. Boxplot for correctness

outliers, i.e. values that are outside the range Median ± 2 ∗ StandardDeviation.Note that the maximum time needed to complete the task was 110 minutes. Hence,all subjects were able to complete the task within the given two hours, which meansthat the experiment is not biased due to a possible ‘ceiling effect’.

8.3 Results 123

Table 8.4. Descriptive statistics

Dep.Var. Tool N mean % median Min Max StDevFirst Run

Correct. MVE 45 .838 104.8% .833 .630 .960 .0749Correct. POS 51 .799 100.0% .791 .540 .960 .0931Time MVE 45 54.62 80.8% 54.00 32 84 12.145Time POS 51 67.59 100.0% 63.00 35 110 16.384

Second RunCorrect. MVE 48 .921 104.5% .931 .759 1.00 .0546Correct. POS 43 .881 100.0% .896 .759 1.00 .0603Time MVE 48 52.48 79.4% 50.50 36 80 9.789Time POS 43 66.09 100.0% 67.00 33 100 14.761

8.3.3 Hypothesis Testing

H1 – Time

The results of the hypothesis tests are summarized in Table 8.5. The first threecolumns present the dependent variable, the experimental run and the significanceof Levene’s test, which was discussed above. The table reports the test statistic(t), the degrees of freedom (df) and the significance (Sig.) for the Student t-test.For the Mann-Whitney test the test statistic (U) and the significance (Sig.) arereported. The next column reports the effect size according to Cohen’s d [46] andthe last column reports whether the hypothesis can be rejected.

Table 8.5 shows that the Student t-test and the Mann-Whitney test have low p-values. Therefore we can reject the hypothesis H10 which states that there is nodifference in time for completing the task using MVE or PoS. The effect size is0.899 in the first run and 1.087 in the second run. This is a relatively large effectsince Cohen [46] states that an effect size of 0.8 is ‘large’. The descriptive statisticsin Table 8.4 show difference in time for both runs. On average the MVE users usearound 20% less time for analyzing the UML model than the control group.

Time in Detail. A possible cause of the effort reduction for MVE compared toPoS is that the subjects in the PoS group use two separate tools, which mightresult in overhead time due to switching between the tools. However, the use oftwo different tools would only affect category 4 (metrics), because the metrics toolSDMetrics is not used for the tasks of categories 1, 2 and 3. If the use of twoseparate tools results in overhead time, we would expect that the difference intime between MVE and PoS is bigger for category 4. We will have a closer lookat the potential overhead time for category 4. Table 8.6 summarizes the detailed


analysis results for both runs. The table contains two data blocks containingfour columns for the combined analysis of categories 1, 2 and 3 and the separateanalysis of category 4, respectively. The data blocks are organized as follows: thefirst column contains the mean time, the second and third column contain thetest statistic U of the Mann-Whitney test and the P-value, the fourth columnreports the effect size according to Cohen’s d. In both experimental runs the timedifference for the combined categories 1,2 and 3 is statistically significant and theeffect size is around 1, hence, it is a large effect. However, for category 4 thedifference is not significant and the effect size is around 0.3 which is consideredsmall, whereas a larger effect size was expected. As a result, we conclude that theresults for time are not biased due to the fact that the treatment PoS comprisesthe use of two separate tools.

H2 – Correctness

The results for testing the hypothesis concerning correctness of the comprehensiontask (H2) are also presented in Table 8.5. For both experimental runs there is asignificant difference in correctness of the task between MVE users and the controlgroup. As the statistical tests yielded p-values far below the significance level of0.05, we reject H20. According to Cohen [46] an effect size of 0.5 is ‘medium’, whichmeans the reported effect sizes of 0.452 in the first run and 0.695 in the secondrun are medium and above the medium level, respectively. Table 8.4 describes thedifferences between the mean values of the MVE group and the control group. Inthe first run the mean was increased with 4.77% by using MVE and in the secondrun the increase was 4.54%. We will discuss the difference between the runs inSection 8.3.4.

Correctness in Detail. We will have a closer look at the differences in correctnessbetween the treatment groups. We are interested in the differences between thetreatment groups per task category. Therefore we compared the average correct-ness of the MVE groups with the control group’s average correctness per category.The correctness results and differences per category are summarized in Table 8.7.The differences for both runs are also shown in Figure 8.3 where the control groupPoS is used as a baseline (e.g. the average correctness of MVE for category 2 inRun 1 is about 14% higher than for the control group). Additionally we providethe differences between the two runs in Figure 8.4 where run 1 is the baseline andthe percentage indicates the difference of run 2 compared to run 1.

For categories 1 and 3 the differences between treatments are similar to the overalldifference discussed above. The results in Figure 8.3 for category 2, which ad-dresses relations between class diagrams and sequence diagrams, are noticeable:in the first run the average correctness of MVE is 14% higher than for PoS. Evena detailed analysis at the level of individual questions did not bring a particularcause for this large difference. However, when we take the data from Figure 8.4

8.3 Results 125

Tabl

e8.

5.R

esul

tsof

hypo

thes

iste

sts

Dep

enden

tLevene

T-test

Mann-W

hitney

Effe

ctSi

zeV

aria

ble

Run

Sig

.t

df

Sig.

USig

.C

ohen

’sd

Hypot

hes

isT

ime

1.0

81-4

.436

91.3

91<

0.00

159

7.0

<0.

001

0.89

9H

1 0re

ject

edT

ime

2.0

33-5

.122

71.6

72<

0.00

144

4.0

<0.

001

1.08

7H

1 0re

ject

edC

orre

ctne

ss1

.151

2.21

993

.245

0.02

987

3.5

0.04

10.

452

H2 0

reje

cted

Cor

rect

ness

2.5

583.

131

92.6

840.

002

735.

00.

002

0.69

5H

2 0re

ject

ed


Table8.6.D

etailedanalysis

oftime

(*indicates

statisticallysignificantdifference)

Categ

orie

s1,2,3

Categ

ory

4Treatm

ent

Run

Mean

UP

-value

Coh

en’s

dM

eanU

P-valu

eC

ohen

’sd

MV

E1

44.40557.8

*<

0.0010.98

9.73992.0

0.2500.24

PoS

155.59

11.47M

VE

243.96

698.0*<

0.0011.05

8.52792.5

0.0550.36

PoS

256.56

9.53

8.3 Results 127

Table 8.7. Correctness: results and differencesRun Category 1 Category 2 Category 3 Category 4

MVE 1 0,64 0,89 0,79 0,93PoS 1 0,62 0,78 0,76 0,95Diff. 1 3.3% 14.3% 4.1% -2.1%MVE 2 0,89 0,93 0,97 0,89PoS 2 0,88 0,89 0,94 0,84Diff. 2 1.3% 4.8% 3.1% 6.2%

Figure 8.3. Relative difference in correctness between tools per cate-gory

into account, we find that the improvement between runs for MVE in category 2is rather low. Both observations indicate that it is especially intuitive to use MVEfor tasks of category 2, i.e. users benefit from MVE without the need to invest alarge amount of effort in learning the tool’s features (‘steep learning curve’).

Category 4 is the only category where the questions were slightly changed in thesecond run: some metrics questions in the second run contained an indirection(e.g. ‘What is the root of the inheritance hierarchy of the class with the largestDIT?’ v.s. ‘What is the class with the largest DIT?’). There is a degradation incorrectness between run 1 and run 2 (see Figure 8.4). This degradation indicatesthat the change in the questions led to more difficult questions. We assume thatthe change in metrics-related questions caused the large difference in Figure 8.3.In the first run the average correctness for the PoS group is slightly better andin the second run the average correctness of the MVE group is better. Furtherdifferences between runs are addressed below.


8.3.4 Comparing the Results of both Runs

The experiment is conducted in two similar runs. In this section we will comparethe results of the two runs to analyze possible differences. The descriptive statisticsin Table 8.4 show that for both treatment levels the dependent variables changeover the two runs. The correctness is increased in the second run and the time isdecreased in the second run. Using the Mann-Whitney test we tested whether thedifferences are significant. The results of the Mann-Whitney test are summarizedin Table 8.8 and show that the difference for time is not significant, but the dif-ference in correctness is significant. Therefore we will discuss the possible causesfor the difference in correctness in more detail.

We identify three potential causes for the differences in correctness (see Table 8.9):

• Different models. To counter learning effects we used different models inboth runs. As described in Section 8.2 the models were very similar in size.Additionally the complexity and the level of abstraction of the models wasalso similar. The major difference of the models is the application domain:a car navigation system versus an insurance information system. We assumethat the subjects have sufficient knowledge to understand both domains.Additionally we do not consider domain knowledge as an important skillto perform the comprehension task. Therefore we conclude that it is un-likely that the different models caused the difference in correctness. This issupported by the analysis of the subjective evaluation as described below.

• Different room settings. The capacity of the rooms used for the first run wastwice the number of subjects. Therefore the subjects were spread over theroom and there was spare space between the subjects to prevent cheatingby looking at the neighbors’ work. Unfortunately we were assigned smallerrooms in the second run. Therefore most subjects were placed immediatelynext to each other. As a result, we cannot eliminate the different roomsettings as a potential cause. However, we surveyed the subjects and did notnotice any cheating.

• Learning effects. To be able to compare the results of both runs the taskswere equivalent. Therefore there is potential for learning effects for the task.Additionally there is potential for learning effects with respect to tool knowl-edge as the subjects had additional time in between the runs to experiencethe tools. Therefore we regard learning effects as the most likely cause ofthe correctness improvement. This should be studied in further replicationsof this experiment. The analysis of the subjective data as described belowsupports the assumption that learning effects caused the difference.

8.3 Results 129

Table 8.8. Mann-Whitney test between runsDependentVariable Tool U p-value HypothesisTime MVE 944.0 .295 not rejectedTime PoS 1086.0 .936 not rejectedCorrectness MVE 401.0 <.001 rejectedCorrectness PoS 568.0 <.001 rejected

Table 8.9. Differences in correctness between runsTool Cat 1 Cat 2 Cat 3 Cat 4MVE 39.1% 5.2% 22.1% -3.8%PoS 41.8% 14.7% 23.2% -11.4%

8.3.5 Subjective Evaluation

In addition to the comprehension task the questionnaire contained subjective ques-tions to enable the qualitative analysis of the models, the task and the tools. Theresults are summarized in Table 8.10. Each row corresponds to one question ofthe questionnaire’s subjective evaluation section. The first two columns containthe questions’ internal identifier (ID) and a descriptive name. The questions weremultiple choice questions on a five-point Likert scale [129]. The levels of the Likertscale are as follows: 1 = ‘very poor’, 2 = ‘poor’, 3 = ‘medium’, 4 = ‘good’ and

Figure 8.4. Relative differences in correctness between runs per cate-gory


5 = ‘very good’. The table summarizes the answers for each question using themean. For both runs the mean is given for both treatment levels (MVE and PoS).To find out whether there are statistically significant differences between the an-swers of the treatment groups we used the χ2-test. The results of the χ2-test aregiven in the table. We indicate significant results in the table using an asterisk(‘*’). Additionally we compared the answers between the two runs. The last twocolumns contain the χ2 test statistic of the comparison between the two runs. Notethat the multiple choice questions about the tool were only asked in the first run,resulting in some whitespace in the table. Alternatively the second run containedpreference questions about the tool. We will discuss these results below.

Evaluating the Model

The three questions about the model addressed quality, understandability andcompleteness of the model. The results are around medium and reflect our eval-uation of the UML models. There were no significant differences between theevaluations of the two treatment groups. The subjects’ rating was slightly higherin the second run, but the differences are not statistically significant. This sup-ports the above mentioned assumption that it is unlikely that the different modelsare a cause for the difference in correctness results.

Evaluating the Task

Regarding the task the subjects had to indicate how well they understood it, howdifficult it was, how they enjoyed it and how motivated they were to perform well.It is important to have motivated subjects who understand the task well in order toobtain valid results from the experiment. For motivation and understandability themean values range from ‘medium’ to ‘good’. For motivation there are no significantdifferences, i.e. we can exclude difference in motivation as a confounding factor.

However, for understandability and for perceived difficulty we have a significantdifference in the first run. The control group ‘PoS’ indicates lower values. As thetask was the same for both treatment groups, we assume that the subjects includethe understandability and difficulty of using the tool in this question. The under-standability and difficulty results improved in the second run. For difficulty thedifference was statistically significant. This supports our above stated assumptionthat the improved correctness in the second run is caused by a learning effect withrespect to the tool and the task.

The level to which the subjects enjoyed the task is between ‘medium’ and ‘good’.In both runs the mean of the MVE group is higher than the control group’s mean.However, the difference is not statistically significant. A good level of enjoyment(combined with an acceptable level of perceived difficulty) is an important factorfor successful adoption of a new technology in practice [127], as enjoyment in using

8.3 Results 131

Tabl

e8.

10.S

ubje

ctiv

eev

alua

tion

(*in

dica

tes

sign

ifica

ntdi

ffere

nce)

Run

1R

un

2χ

2bet

w.

Runs

IDN

ame

MV

EPoS

χ2

MV

EPoS

χ2

MV

EPoS

Model

M1

Qua

lity

3.09

3.20

1.37

3.41

3.39

0.74

5.71

3.78

M2

Und

erst

anda

bilit

y3.

343.

102.

813.

493.

282.

272.

441.

91M

3C

ompl

eten

ess

2.37

2.55

1.56

3.10

2.95

2.96

2.51

7.32

Tas

kS1

Und

erst

anda

bilit

y3.

983.

41*1

3.54

4.00

3.77

4.14

1.57

5.82

S2D

ifficu

lty

3.65

3.24

*19.

183.

713.

483.

664.

31*1

1.34

S3E

njoy

3.49

3.08

6.29

3.40

3.07

3.21

2.42

2.22

S4M

otiv

atio

n3.

893.

804.

393.

633.

594.

015.

784.

01Tool

T1

Mod

elC

ompr

ehen

sion

3.77

3.49

6.52

T2

Qua

lity

Eva

l.3.

683.

317.

42T

3M

etri

csA

naly

sis

3.96

3.76

1.51

T4

Nav

igat

eth

roug

hM

odel

3.43

3.29

6.97

T5

Usa

bilit

y3.

433.

502.

42


Figure 8.5. Tool preference

the technology increases the intrinsic motivation to use it.

Evaluating the Tools

In the first run the subjects rated the expected suitability of the tool they used forfour activities: comprehension a model, evaluating a model’s quality, conductinga metrics analysis and navigating through a model. Additionally we asked forrating the tool’s usability. The results in Table 8.10 show that both tools arerated between ‘medium’ and ‘good’ on average. There are no significant differencesbetween both treatment levels.

We complement this analysis with a direct comparison of both tools in the secondrun. As the subjects had used both tools after the second run, they were able tocompare them. The subjects had to indicate which tool they prefer for the afore-mentioned activities. Figure 8.5 shows the results for tool preference. The vastmajority of the subjects prefers MVE for the four activities. It is surprising thatthe subjects even prefer the usability of MVE, because it is a research prototypeand usability had a low priority in the implementation.

Tool Usage

To find out to which extent the features (or views) of the tools were used for thetask we asked the subjects to indicate the extent of usage in the questionnaires.For this purpose the questionnaires contained multiple choice questions with a five-


Figure 8.6. Level of feature usage per tool

point Likert scale ranging from ‘not used’ (1) to ‘extensively used’ (5). The meanvalues are summarized in Figure 8.6. The results show that the features of MVEwere used to a degree above ‘neutral’. The low values for Poseidon indicate thatmainly the regular navigation through the model was used. For the SDMetricsmainly the basic features ‘Table View’ and ‘Sort-by’ were used.

The differences in degree of usage are not due to differences in preparation, as allfeatures of all tools were demonstrated to an equal degree during the preparationof the experiment. Additionally the subjects had access to the tools’ user manualsduring preparation and execution of the experiment. However, we regard theresults about tool usage only as rough indications, because during the experimentwe observed that some subjects had difficulties assigning the names given in thequestion to the features in the tools. These difficulties are a potential bias for theresults. For replications of this experiment we advise logging in the tools to obtainmore precise and reliable data about how the tools were used.


Conducting experiments involves threats to the validity of the results. In thissection we discuss how we deal with the potential threats to the validity of thereported experiment. We structure our discussion according to Wohlin et al. [196]into internal validity, external validity, construct validity and conclusion validity.



Threats to internal validity can affect the independent variables of an experiment.A possible threat to internal validity is that the treatment groups behave differ-ently because of a confounding factor such as difference in skills, experience ormotivation. We used randomization to assign subjects to treatment groups toavoid differences between treatment groups. There were no significant differencesin subjects’ background and motivation (see Sections 8.2.3 and 8.3.5).

The experiment was conducted in a controlled environment, i.e. an exam-like set-ting. Due to limited room capacity the risk for cheating was somewhat higher inthe second run, but due to the taken precautions and spot checks we regard thechance of biased results due to cheating as very small. Another possible threat tovalidity is that subjects used a different tool for the analysis and therefore did notadhere to the treatment. We eliminated this threat by providing the subjects withthe model in the file format of the tool belonging to their treatment group. Ad-ditionally we conducted spot checks during the experiment and found no subjectdeviating from its treatment.


Threats to external validity reduce the generalizability of the results to indus-trial practice. The experiment is designed to render a realistic situation. Weuse students as subjects, which might be a threat to external validity. However,Kitchenham et al. [97] state that students can be used as subjects. All studentsin this experiment have relevant experience, hold a BSc degree and have receivedtheir BSc from different universities and different countries. The objects in thisexperiment are UML models that describe systems from realistic application do-mains. The consistent results of both runs support that the observed effect can beexpected in models of different application domains. The models are larger than‘toy models’ but in practice models of much larger size are used. However, weassume that support for comprehension is even more important for larger models.Therefore we assume that the observed effect for larger models will be at least aslarge as it is observed in this experiment.

A possible issue is the representativeness of the experimental task. To the best ofour knowledge there exist no published research results describing comprehensiontasks for UML models. Pacione et al. [158] conducted a literature survey to compilea list of general comprehension activities. The comprehension questions of ourexperiment cover different conceptual categories. These categories can be relatedto the activities in Pacione’s list. Some of the activities listed by Pacione arenot covered in this experiment. The uncovered activities are mainly related toinformation that is normally not present in practice in UML models, such asruntime information.

8.5 Conclusions 135


Construct validity is the degree to which the variables measure the concepts theyare to measure. The dependent variables in this experiment are comprehensioncorrectness and time. Comprehension correctness is difficult to measure as it isclosely related to the representativeness of the comprehension task as discussedabove. However, the measurement of correctness is objective and repeatable asit is based on a multiple-choice questionnaire which is provided in the replicationpackage [110]. The time is measured by logging the actual time in predefined spotson the questionnaire as described in Section 8.2.7 and we do not expect the resultsfor time being biased.


Conclusion validity is concerned with the relation between the treatment and theoutcome. The statistical analysis of the results is reliable, as we used robuststatistical methods. The results are statistically significant with very low p-values.

We minimized possible comprehension problems by testing the experiment ma-terial in a pilot experiment and improving it according to the observed issues.The course instructors were available to the students for clarification questions.The results of the post-test questionnaire show that the task was well understood.Hence, we conclude that there were no comprehension problems threatening thevalidity of the reported experiment.

8.5 Conclusions

In this experiment we have validated that task-oriented views improve the compre-hension of UML models. We compared the performance on comprehension taskssupported by the views implemented in our tool MetricView Evolution (MVE)with the performance on comprehension tasks supported by a traditional UMLCASE tool (Poseidon) and metrics tool (SDMetrics). The results show that theeffort needed is reduced by approximately 20% and the correctness of the compre-hension task is increased by approximately 4.5%. These results are statisticallysignificant. Additional analysis reveals that the participants prefer our MVE toolabove traditional tooling. This indicates that the adoption threshold for our MVEviews, and, hence, task-oriented views, is low.

Task-oriented views are simple and straightforward. They should be improvedby established techniques from information visualization such as mechanisms forfiltering and abstraction and by using more sophisticated layout algorithms. Asour simple implementation already led to considerable improvements, we conclude


that the field of improving the comprehension of UML models is a promising di-rection for future research. As software development and maintenance is becomingmore model-centric, we should also shift our research efforts into this promisingdirection.

In future work task-oriented views should be integrated into model editors insteadof pure model viewers. Using the enhanced editors a similar experiment shouldbe replicated where the set of tasks also includes change tasks instead of purecomprehension tasks.

Chapter 9

Conclusions

Software engineering is becoming more and more model-centric. The de facto stan-dard for modeling in software engineering is the UML. Characteristics of the UMLare its lack of a formal semantics, its multi-diagram notation, and its huge com-plexity. These characteristics cause quality problems in UML modeling. The topicof this thesis is defining, assessing, and improving the quality of UML modeling.To define the scope of this thesis, we have posed the overall research question:

• RQoverall: How can the quality of UML models be improved, in particularwith respect to defects, miscommunication and complexity?

We have decomposed the overall research question into five research questions. Inthis chapter we revisit the five research questions stated in the introduction. Wedraw conclusions based on the presented studies, summarize our contributions, andgive directions for future work related to the topics of this thesis.

9.1 RQ1: How can the quality of UML models be decom-posed into quality notions for particular purposes ofmodeling?

The central topic of this thesis is the quality of UML models. Therefore, ourfirst step was to define a notion of quality of UML models to be applied in thisthesis (Chapter 3). Based on existing literature we decomposed quality into dif-ferent notions: system quality, semantic quality, pragmatic quality, social quality,communicative quality, and correspondence. We used these notions to classifythe studies addressing the remaining quality notions. Additionally, we proposed aquality model for the UML.

137

138 Chapter 9 Conclusions

9.1.1 Contributions

We proposed a quality model that is specific for UML models. Different fromexisting quality models, our quality model distinguishes between quality attributesof the system described by the model and quality attributes of the UML modelitself (as description of a system). The quality model consists of four levels: atthe most abstract level the model’s use, then its purpose, its characteristics, andmetrics. The goal of the quality model is to guide engineers in selecting the metricsand rules based on the purpose of the UML models of which the quality is to beassessed. An additional contribution is a guideline that relates the purposes of thequality model to phases of the software life-cycle. The goal of this guideline is toassist the engineer in choosing the purpose of modeling, and hence, the metricsand rules, based on the particular phase.

9.1.2 Future Work

The proposed quality model is based on a literature study and discussions withindustrial experts during our case studies. A possible direction for future researchrelated to the quality model is to go further in its validation. The relation be-tween the levels of the quality model, and the correctness and completeness of theelements of each level should be validated.

9.2 RQ2: What is the quality of industrial UML models, inparticular with respect to defect containment?

To study the quality of UML models in practice, we conducted a series of industrialcase studies (Chapter 4). The purpose of these studies was to identify typicalquality problems in UML modeling. This identification is a basis to focus onrealistic quality problems in the remaining research questions. The focus of thecase studies was on syntactic quality, i.e. defect containment. We used a UMLanalysis tool for defect detection.

9.2.1 Contributions

We studied the quality of large-scale UML modeling in industrial practice, whereasmost previous studies address the quality of small models and student projects.The results of the case studies reveal that the number of defects in practice islarge. A particular contribution is the quantification of the occurrence of defecttypes. Therefore prevention techniques such as modeling conventions, training,and tooling can be adjusted to focus on common defect types, such as multiple

9.3 RQ3: Effects of Defects 139

definitions of classes or use cases under the same name, large numbers of classes andinterfaces without methods, messages in sequence diagrams that do not correspondto methods in sequence diagrams, or messages without names. An additionalcontribution is the observation, that there can be large differences between thedefect occurrence frequencies of models created by different teams and even bydifferent individuals within the same team.

9.2.2 Future Work

To analyze in more detail which factors influence the occurrence of defects, furthercase studies should be conducted. More detailed factors that should be taken intoaccount include for example tooling, expertise and skill of the developer, schedulepressure, and development process. Additionally, the cause of the variation indefect occurrence between models of different developers should be studied inmore detail. In the case studies we mainly addressed syntactic quality. Otherquality notions, such as semantic quality, and social quality should be studied.

9.3 RQ3: What is the effect of defects in UML models, inparticular with respect to detection and misinterpreta-tion?

We conducted an experiment to study whether defects in UML models are de-tected when developers use the models as basis for implementation (Chapter 5).Additionally we studied to which degree UML defects cause misinterpretationsamongst different readers. 111 students and 48 practitioners participated in theexperiment. The results of the experiment are generalizable to industrial softwaredevelopment.

9.3.1 Contributions

The results of the experiment show that defects often remain undetected. Fur-thermore, the results show that undetected defects are indeed a problem, becausethey cause a variety of different interpretations of the model amongst differentreaders. Another contribution of the experiment is a classification of defect typesregarding their likelihood of detection and the misinterpretation likelihood. Thisclassification can be used to prioritize defects according to their severity and assignremoval effort according to their priority. Additionally the experiment providesresults showing that the presence of domain knowledge increases the probabilityof not detecting a defect. However, the results regarding domain knowledge arebased on a small data set and, hence, are exploratory results.


9.3.2 Future Work

The results of this experiment are statistically significant. However, the experi-ment is limited with respect to the set of defect types that are analyzed. Moredefect types should be analyzed in replications of this study. This would extend theclassification of defect types. An important research direction is defect propaga-tion. In this study we have observed that model defects lead to misinterpretations.However, further studies should reveal whether model defects in fact lead to imple-mentation defects, i.e. whether defects propagate. We observed that the absenceof domain knowledge lead to an increased detection rate. A possible explanationis that developers with domain knowledge are more tempted to assume they knowthe correct interpretation, and hence, ignore the defect. However, a more in-depthinvestigation of the effect of domain knowledge on defect detection and modelcomprehension in general is needed.

9.4 RQ4: Can modeling conventions for developing UMLmodels improve model quality?

We have shown that defects are common in UML models. In the case studies weobserved that the UML is used differently by different developers. Additionally, thehuge complexity of the UML is often criticized. Developers have a large degree offreedom of UML use; this is often mentioned as a problem. In programming, codingconventions are an established technique to provide guidance to the developers.In Chapter 6, we have proposed modeling conventions as a means to guide thedeveloper in using the UML and to prevent defects. We conducted an exploratoryexperiment with 106 subjects to study the use of modeling conventions with andwithout tool support. The main research questions of the experiments addressedthe effectiveness for defect prevention, i.e. syntactic quality, and the extra effortneeded during model development.

9.4.1 Contributions

We provide empirical data about the effect of applying of modeling conventionsin a realistic environment. In the previous literature different categories of mod-eling conventions were proposed, but to the best of our knowledge no empiricalvalidation has been published. The results of the experiment show that the de-fect density in UML models is reduced through the use of modeling conventions.However, the improvement is not statistically significant. Extra effort is needed toapply modeling conventions, especially when an analysis tool is used to monitoradherence to the conventions. Based on the experimental results and observationswe identified factors that are expected to further improve the effectiveness of mod-

9.5 RQ5: Task-oriented Views 141

eling conventions if they were controlled better. These factors include adherenceto the modeling conventions, training and experience, and developer motivation.The motivation could be improved by rewarding for adherence to the modelingconventions and developer participation in the selection of modeling conventions.

9.4.2 Future Work

Our experiment was an exploratory study about modeling conventions. The resultsare promising, but further studies need to be conducted to build reliable knowl-edge and guidelines for applying modeling conventions. Of particular interest arestudies addressing the identified factors such as adherence, developer participa-tion, training, and motivation in more detail. In this thesis we have addressedthe effect of modeling conventions on syntactic quality of models. In addition,studies are needed that address the effect of modeling convention on other qualitynotions. We have conducted an exploratory study focussing on semantic quality,communicative quality, and social quality [55].

9.5 RQ5: Can task-oriented views of UML models enhancethe comprehension of the models?

UML models are used for a variety of software engineering tasks. We argue thatsome information needed by developers to fulfill the tasks is not provided efficientlyby the existing UML diagrams. For example inter-diagram relations and metricsdata are not visible in existing diagrams.

9.5.1 Contributions

In Chapter 7 we have proposed six views on UML models to provide developerswith the information that is required for particular tasks, e.g. comprehension offunctionality or structure, of analysis of structural design quality. The views are:MetaView, ContextView, MetricView, UML-City-View, Quality Tree View, andEvolution View. The views are interactive such that the developer can adjustthe views according to his information needs. We implemented the views in ourMetricView Evolution analysis tool.

We have conducted an experiment with 100 participants to validate the views withrespect to model comprehension and analysis tasks (Chapter 8). The results of theexperiment show that the correctness of model comprehension was improved by4.5% and the effort needed was reduced by 20% compared to the views provided byexisting tools. Hence, the views improve the productivity of model comprehension.We expect that the improved comprehension correctness improves the quality ofdevelopment and maintenance tasks related to models.


9.5.2 Future Work

The design of the views is based on a simple framework for aligning tasks withviews. Still, the views are beneficial, as shown through the validation. We expectthat the views could be improved and other useful views could be designed whenthere is a better understanding of the relation between tasks, the required informa-tion, and its visual representation. Further studies should improve the knowledgeof this relation.

Currently the views are implemented in MetricView Evolution, which is a stand-alone analysis tool. To improve the benefits of the views for model developmentand maintenance tasks, the views should be integrated into a modeling tool. Addi-tionally, further validation studies are needed, that address a broader set of tasksincluding model development and maintenance tasks. Besides validation of tasksin isolation, further validation studies should investigate whether the use of in-teractive, task-oriented views lead to better quality of models and maintenanceactivities.

Bibliography

[1] The Precise UML Group. http://www.cs.york.ac.uk/puml/.

[2] SPSS, version 12.0. http://www.spss.com.

[3] Gentleware AG. Poseidon for UML, community edition.http://www.gentleware.com.

[4] Ritu Agarwal and Atish P. Sinha. Object-oriented modeling with UML: A study ofdevelopers’ perceptions. Communications of the ACM, 46(9):248–257, September2003.

[5] Alan Agresti and Barbara Finlay. Statistical methods for the social sciences. Pren-tice Hall, 3rd edition, 1997.

[6] Scott W. Ambler. The Elements of UML 2.0 Style. Cambridge University Press,2005.

[7] Bente Anda, Kai Hansen, Ingolf Gullesen, and Hanne Kristin Thorsen. Experi-ences from introducing UML-based development in a large safety-critical project.Empirical Software Engineering, 11(4):555–581, 2006.

[8] ArgoUML. Open source UML tool. http://argouml.tigris.org/.

[9] Erik Arisholm, Lionel C. Briand, Siw Elisabeth Hove, and Yvan Labiche. The im-pact of UML documentation on software maintenance: An experimental evaluation.Technical Report 6, June 2006.

[10] Katalin Balla, Theo Bemelmans, Rob Kusters, and Jos Trienekens. Quality throughmanaged improvement and measurement (QMIM): Towards a phased developmentand implementation of a quality management system for a software company. Soft-ware Quality Journal, 9(3):177–193, November 2001.

[11] Simonetta Balsamo, Antinisca Di Marco, Paola Inverardi, and Marta Simeoni.Model-based performance prediction in software development: A survey. IEEETransactions on Software Engineering, 30(5):295–310, 2004.

[12] Victor R. Basili. The role of experimentation in software engineering: past, current,and future. In ICSE ’96: Proceedings of the 18th international conference onSoftware engineering, pages 442–449, Washington, DC, USA, 1996. IEEE ComputerSociety.

[13] Victor R. Basili. Evolving and packaging reading technologies. Journal of Systemsand Software, 38(1):3–12, 1997.

143

144 Bibliography

[14] Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. The goal questionmetric paradigm. In Encyclopedia of Software Engineering, volume 2, pages 528–532. John Wiley and Sons, Inc., 1994.

[15] Victor R. Basili, Scott Green, Oliver Laitenberger, Filippo Lanubile, ForrestShull, Sivert Sørumgard, and Marvin Zelkowitz. The empirical investigation ofperspective-based reading. Empirical Software Engineering, 1(2):133–144, 1996.

[16] Brian Berenbach. The evaluation of large, complex uml analysis and design models.In ICSE ’04: Proceedings of the 26th International Conference on Software Engi-neering, pages 232–241, Washington, DC, USA, 2004. IEEE Computer Society.

[17] Tor Bernhardsen. Geographic Information Systems: an Introduction. Wiley, 3rdedition, 2002.

[18] Jean Bezivin and Pierre-Alain Muller. UML: the birth and rise of a standardmodeling notation. In Jean Bezivin and Pierre-Alain Muller, editors, Proceedingsof the First International Workshop on the Unified Modeling Language (UML ‘98),volume 1618, pages 1–8. Springer, 1998.

[19] James M. Bieman, Roger Alexander, P. Willard Munger III, and Erin Meunier.Software design quality: Style and substance. In Proceedings of the Workshop onSoftware Quality (WoSQ). ACM, 2001.

[20] Barry W. Boehm. Software Engineering Economics. Prentice Hall, London, 1981.

[21] Barry W. Boehm, John R. Brown, Hans Kaspar, Myron Lipow, Gordon J. Macleod,and Michael J. Merrit. Characteristics of Software Quality, volume 1 of TRW Seriesof Software Technology. North-Holland Publishing Company, Amsterdam, 1978.

[22] Grady Booch. Object-Oriented Design with Applications. Benjamin-Cummings,Redwood City, California, 1991.

[23] Borland. Together. http://www.borland.com/us/products/together.

[24] Mark G. J. v. d. Brand, Paul Klint, and Chris Verhoef. Reverse engineering and sys-tem renovation – an annotated bibliography. Software Engineering Notes, 22(1):57–68, January 1997.

[25] Lionel C. Briand, Erik Arisholm, Steve Counsell, Frank Houdek, and PascaleThevenod-fosse. Empirical studies of object-oriented artifacts, methods,and pro-cesses: State of the art and future directions. Empirical Software Engineering,4(4):387–404, 1999.

[26] Lionel C. Briand, Christian Bunse, and John William Daly. A controlled exper-iment for evaluating quality guidelines on the maintainability of object-orienteddesigns. IEEE Transactions on Software Engineering, 27(6):513–530, June 2001.

[27] Lionel C. Briand, Premkumar T. Devanbu, and Walcelio L. Melo. An investiga-tion into coupling measures for C++. In Proceedings of the 19th InternationalConference on Software Engineering (ICSE’97), pages 412–421, 1997.

[28] Lionel C. Briand and Yvan Labiche. A UML-based approach to system testing.Software and Systems Modeling, 1(1):10–42, September 2002.

[29] Lionel C. Briand, Yvan Labiche, and Johanne Leduc. Toward the reverse engineer-ing of UML sequence diagrams for distributed java software. IEEE Transactionson Software Engineering, 32(9):642–663, September 2006.

Bibliography 145

[30] Lionel C. Briand, Yvan Labiche, L. O’Sullivan, and Michal M. Sowka. Automatedimpact analysis of UML models. Journal of Systems and Software, 79(3):339–352,March 2006.

[31] Lionel C. Briand, Yvan Labiche, Massimiliano Di Penta, and Han Yan-Bondoc. Anexperimental investigation of formality in UML-based development. IEEE Trans-actions on Software Engineering, 31(10):833–849, October 2005.

[32] Andrew Brooks, John Daly, John Miller, Mark Roper, and Murray Wood. Replica-tion of experimental results in software engineering. Technical Report EFoCS-17-95,Livingstone Tower, Richmond Street, Glasgow G1 1XH, UK, 1995.

[33] Frederick P. Brooks. No silver bullet: Essence and accidents of software engineering.IEEE Computer, 20(4):10–19, 1987.

[34] Alan Brown. An introduction to model driven architecture. Part I: MDA andtoday’s systems. The Rational Edge, pages 13–17, February 2004.

[35] Bernd Brugge and Allen H. Dutoit. Object-Oriented Software Engineering: UsingUML, Patterns and Java, Second Edition. Prentice-Hall, Inc., Upper Saddle River,NJ, USA, 2003.

[36] Laura A. Campbell, Betty H. C. Cheng, William E. McCumber, and Kurt Stirewalt.Automatically detecting and visualising errors in UML diagrams. RequirementsEngineering, 7:264–287, December 2002.

[37] Giovanni Cantone, Luca Colasanti, Zeiad A. Abdulnabi, Anna Lomartire, andGiuseppe Calavaro. Evaluating Checklist-Based and Use-Case Driven ReadingTechniques as Applied to Software Analysis and Design UML Artifacts, volume2765 of LNCS, pages 142–165. Springer Verlag, 2003.

[38] Massimo Carbone and Giuseppe Santucci. Fast & serious: a UML based metric foreffort estimation. In Houari A. Sahraoui, Coral Calero, Michele Lanza, Geert Poels,and Fernando Brito e Abreu, editors, Proceedings of the 6th ECOOP Workshop onQuantitative Approaches in Object-Oriented Software Engineering (QAOOSE’02),2002.

[39] Jeffrey Carver, Letizia Jaccheri, Sandro Morasca, and Forrest Shull. Issues in usingstudents in empirical studies in software engineering education. In Proceedings ofThe Ninth International Software Metrics Symposium, pages 239 – 249, September2003.

[40] Betty H. C. Cheng, Ryan Stephenson, and Brian Berenbach. Lessons learned fromautomated analysis of industrial UML class models (an experience report). InLionel C. Briand and Clay Williams, editors, Proceedings of the 8th InternationalConference on Model Driven Engineering Languages and Systems (MoDELS 2005),Montego Bay, Jamaica, volume 3713 of LNCS, pages 324–338. Springer, 2005.

[41] Shyam R. Chidamber and Chris F. Kemerer. A metrics suite for object-orienteddesign. IEEE Transactions on Software Engineering, 20(6):476–493, 1994.

[42] Elliott J. Chikofsky and James H. Cross II. Reverse engineering and design recov-ery: A taxonomy. Software, 7(1):13–17, January 1990.

[43] Ram Chillarege, Inderpal S. Bhandari, Jarir K. Chaar, Michael J. Halliday, Diane S.Moebus, Bonnie K. Ray, and Man-Yuen Wong. Orthogonal defect classification - aconcept for in-process measurements. IEEE Transactions on Software Engineering,18(11):943–956, November 1992.

146 Bibliography

[44] Joanna Chimiak-Opoka and Chris Lenz. Use of OCL in a model assessment frame-work: An experience report. In Proceedings of OCLApps workshop 2006, pages53–67, Genova, October 2006.

[45] Peter Coad and Edward Yourdon. Object Oriented Design. Prentice-Hall, firstedition, 1991.

[46] Jacob Cohen. A power primer. Psychological Bulletin, 112(1):155–159, July 1992.

[47] Jim Conallen. Modeling web application architectures with UML. Communicationsof the ACM, 42(10):63–70, 1999.

[48] Reidar Conradi, Parastoo Mohagheghi, Tayyaba Arif, Lars Christian Hedge,Geir Arne Bunde, and Anders Pedersen. Object-oriented reading techniques forinspection of UML models – an industrial experiment. In Proceedings of the Eu-ropean Conference on Object-Oriented Programming ECOOP’03, volume 2749 ofLNCS, pages 483–501. Springer, July 2003.

[49] Vittorio Cortellessa, Harshinder Singh, and Bojan Cukic. Early reliability as-sessement of UML based software models. In WOSP ’02: Proceedings of the 3rdinternational workshop on Software and performance, pages 302–309, New York,NY, USA, 2002. ACM Press.

[50] Ignatios Deligiannis, Ioannis Stamels, Lefteris Angelis, Manos Roumeliotis, andMartin Shepperd. A controlled experiment investigation of an object-oriented de-sign heuristic for maintainability. Journal of Systems and Software, 2(72):129–143,2004.

[51] Serge Demeyer, Stephane Ducasse, and Sander Tichelaar. Why unified is notuniversal. UML shortcomings for coping with round-trip engineering. In BernhardRumpe, editor, Proceedings of the 2nd International Conference on the UnifiedModeling Language (UML’99), volume 1723 of LNCS, Kaiserslautern, Deutschland,October 1999. Springer Verlag.

[52] Edsger Wybe Dijkstra. On the role of scientific thought (EWD447). In SelectedWritings on Computing: A Personal Perspective, pages 60–66. Springer-Verlag,1982.

[53] Brian Dobing and Jeffrey Parsons. Current practices in the use of UML. In Proceed-ings of the 1st Workshop on the Best Practices of UML, LNCS. Springer Verlag,2005.

[54] Dov Dori. Why significant UML change is unlikely. Communications of the ACM,45(11):82–85, 2002.

[55] Bart DuBois, Christian F. J. Lange, Serge Demeyer, and Michel R. V. Chaudron.A qualitative investigation of UML modeling conventions. In Thomas Kuhne,editor, Models in Software Engineering – Workshops and Symposia at MoDELS2006. Reports and Revised Selected Papers, volume 4364 of LNCS, pages 91–100,Heidelberg, January 2007. Springer.

[56] Steve Easterbrook and Bashar Nuseibeh. Using ViewPoints for inconsistency man-agement. BCS/IEE Software Engineering Journal, pages 31–43, January 1996.

[57] Alexander Egyed. Instant consistency checking for the UML. In Proceedings of the28th International Conference on Software Engineering (ICSE‘06), pages 381–390.ACM, May 2006.

Bibliography 147

[58] Holger Eichelberger. Aesthetics of class diagrams. In Proceedings of the First IEEEInternational Workshop on Visualizing Software for Understanding and Analysis(VISSOFT 2002), pages 23–31. IEEE CS Press, 2002.

[59] Holger Eichelberger. Aesthetics and Automatic Layout of UML Class Diagrams.Ph.D. thesis, Fakultat fur Mathematik und Informatik, Wurzburg University, Ger-many, July 2005.

[60] Markus Eiglsperger. Automatic Layout of UML Class Diagrams: A Topology-Shape-Metrics Approach. Phd thesis, Universitat Tubingen, Germany, November2003.

[61] Technische Universiteit Eindhoven. Pollweb system.http://ai5.wtb.tue.nl/enquetes/pollweb info.php.

[62] Khaled El Emam and Isabelle Wieczorek. The repeatability of code defect classifi-cations. In Proceedings of the 9th International Symposium on Software ReliabilityEngineering, pages 322–333, 1998.

[63] Gregor Engels, Reiko Heckel, and Stefan Sauer. UML – a universal modelinglanguage? In Proceedings of the 21st International Conference on Application andTheory of Petri Nets 2000, ICATPN 2000, volume 1825, pages 24–38. Springer,2000.

[64] Andy Evans, Jean-Michel Bruel, Robert France, Kevin Lano, and BernhardRumpe. Making UML precise. In Luis Andrade, Ana Moreira, Akash Deshpande,and Stuart Kent, editors, Proceedings of the OOPSLA’98 Workshop on FormalizingUML. Why? How?, 1998.

[65] Michael E. Fagan. Design and code inspections to reduce errors in program devel-opment. IBM Systems Journal, 15(3):182–211, 1976.

[66] Michael E. Fagan. Advances in software inspections. IEEE Transactions on Soft-ware Engineering, 12(7):744–751, 1986.

[67] Norman E. Fenton and Shari Lawrence Pfleeger. Software Metrics, A Rigorous andPractical Approach. Thomson Computer Press, second edition, 1996.

[68] Anthony C. W. Finkelstein, Dov M. Gabbay, Anthony Hunter, Jeff Kramer, andBashar Nuseibeh. Inconsistency handling in multi-perspective specifications. IEEETransactions on Software Engineering, 20(8):569–578, August 1994.

[69] Martin Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley Co., Inc., November 1999.

[70] Martin Fowler and Kendall Scott. UML Distilled: A brief guide to the StandardObject Modeling Language. Addison Wesley, Boston, 3rd edition, 2004.

[71] Robert B. France, Andy Evans, Kevin Lano, and Bernhard Rumpe. The UML asa formal modeling notation. Comput. Stand. Interfaces, 19(7):325–334, 1998.

[72] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns:Elements of reusable Object-Oriented Software. Addison Wesley, 1994.

[73] David Garvin. What does ‘product quality’ really mean? Sloan ManagementReview, 26(1):25–45, 1984.

148 Bibliography

[74] Marcela Genero, Mario Piattini, Esperanza Manso, and Giovanni Cantone. Build-ing UML class diagram maintainability prediction models based on early metrics.In Proceedings of the Ninth International Software Metrics Symposium (METRICS2003), pages 263–275. IEEE, 2003.

[75] Carlo Ghezzi, Mehdi Jazayeri, and Dino Mandrioli. Fundamentals of softwareengineering. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1991.

[76] Carlo Ghezzi and Bashar Nuseibeh. Special issue on managing inconsistency insoftware development (1). IEEE Transactions on Software Engineering, 24(11),November 1998.

[77] Carlo Ghezzi and Bashar Nuseibeh. Special issue on managing inconsistency insoftware development (2). IEEE Transactions on Software Engineering, 24(11),November 1999.

[78] Tom Gilb and Dorothy Graham. Software Inspection. Addison Wesley PublishingCo., 1993.

[79] Interactive Objects Software GmbH. Arcstyler. http://www.interactive-objects.com/products/arcstyler.

[80] Hassan Gomaa. Designing Software Product Lines with UML: From Use Cases toPattern-Based Software Architectures. Addison Wesley, first edition, 2004.

[81] Bas Graaf, Sven Weber, and Arie van Deursen. Migrating supervisory control archi-tectures using model transformations. In 10th European Conference on SoftwareMaintenance and Reengineering (CSMR 2006), pages 153–164. IEEE ComputerSociety, March 2006.

[82] Bas Graaf, Sven Weber, and Arie van Deursen. Model-driven migration of super-visory machine control architectures. Journal of Systems and Software, 2007. toappear.

[83] Martin Grossman, Jay E. Jackson, and Richard V. McCarthy. Does UML makethe grade? insights from the software development community. Information andSoftware Technology, 47(11):383–397, November 2005.

[84] Yann-Gael Gueheneuc. TAUPE: Towards understanding program comprehension.In Proceedings of the 16th IBM Centers for Advanced Studies Conference (CAS-CON), October 2006.

[85] Kai T. Hansen. Project visualization for software. IEEE Software, 23(4):84–92,July 2006.

[86] Sallie M. Henry and Dennis G. Kafura. Software structure metrics based on infor-mation flow. IEEE Transactions on Software Engineering, 7(5):510–518, 1981.

[87] IBM. Rational software. http://www.ibm.com/software/rational/.

[88] IEEE P1471-2000. IEEE recommended practice for architectural descrption ofsoftware-intensive systems, 2000.

[89] ISO/IED FCD 9126-1.2. Information Technology - Software Product Quality, parti: quality model edition, 1998.

[90] Ivar Hjalmar Jacobson, Magnus Christerson, Patrik Jonsson, and Gunnar Over-gaard. Object-Oriented Software Engineering: A Use Case Driven Approach.Addison-Wesley, 1992.

Bibliography 149

[91] Stephen H. Kan. Metrics and Models in Software Quality Engineering. AddisonWesley Professional, 2nd edition, September 2002.

[92] Diane Kelly and Terry Shepard. A case study in the use of defect classificationin inspections. In Proceedings of the IBM Centre of Advanced Studies Conference2001, pages 26–39. IBM, 2001.

[93] Mik Kersten and Gail Murphy. Mylar: A degree-of-interest model for IDEs. InProceedings of the 4th International Conference on Aspect-Oriented Software De-velopment (AOSD 2005), pages 159–168, 2005.

[94] Khashayar Khosravi and Yann-Gael Gueheneuc. Open issues with quality mod-els. In Fernando Brito e Abreu, Coral Calero, Michele Lanza, Geert Poels, andHouari A. Sahraoui, editors, proceedings of the 9th ECOOP workshop on Quanti-tative Approaches in Object-Oriented Software Engineering. Springer-Verlag, July2005.

[95] Barbara A. Kitchenham, Tore Dyba, and Magne Jorgensen. Evidence-based soft-ware engineering. In ICSE ’04: Proceedings of the 26th International Conferenceon Software Engineering, pages 273–281, Washington, DC, USA, 2004. IEEE Com-puter Society.

[96] Barbara A. Kitchenham and Shari Lawrence Pfleeger. Software quality: The elusivetarget. IEEE Software, 13(1):12–21, Januari 1996.

[97] Barbara A. Kitchenham, Shari Lawrence Pfleeger, Lesley M. Pickard, Peter W.Jones, Dafic C. Hoaglin, Khaled El Emam, and Janett Rosenberg. Preliminaryguidelines for empirical research in software engineering. IEEE Transactions onSoftware Engineering, 28(8):721–734, August 2002.

[98] Ralf Kollmann and Martin Gogolla. Metric-based selective representation of UMLdiagrams. In Tibor Gyimothy and Fernando Brito e Abreu, editors, Proceedings ofthe 6th European Conference on Software Maintenance and Reengineering (CSMR2002), pages 89–98, Los Alamitos, 2002. IEEE.

[99] Henk Koning, Claire Cormann, and Hans van Vliet. Practical guidelines for thereadability of IT-architecture diagrams. In Proceedings of SIGDOC’02, pages 90–99, Toronto, Canada, 2002. ACM.

[100] Elena Korshunova, Marija Petkovic, Mark G. J. v. d. Brand, and MohammadRezaMousavi. CPP2XMI: Reverse engineering of UML class, sequence, and activitydiagrams from C++ source code. In Proceedings of the 13th Working Conferenceon Reverse Engineering (WCRE 2006), pages 297–298, Washington, DC, USA,2006. IEEE Computer Society.

[101] John Krogstie. Conceptual Modeling for Computerized Information Systems Sup-port in Organizations. PhD thesis, Norwegian Institute of Technology, Universityof Trondheim, Trondheim, Norway, December 1995.

[102] Philippe Kruchten. The 4+1 view model of architecture. IEEE Software, 12(6):42–50, 1995.

[103] Philippe Kruchten. The Rational Unified Process, An Introduction (Third Edition).Addison-Wesley, 2003.

[104] Jochen Malte Kuster. Consistency Management of Object-Oriented BehavioralModels. Phd thesis, Universitat Paderborn, Paderborn, March 2004.

150 Bibliography

[105] Ludwik Kuzniarz, Zbigniew Huzar, Gianna Reggio, Jean-Louis Sourrouille, andMiroslaw Staron. 2nd Workshop on Consistency Problems in UML-based SoftwareDevelopment at the UML2003. Blekinge Institute of Technology, 2003.

[106] Ludwik Kuzniarz and Miroslaw Staron. Inconsistencies in student designs. InProceedings of the 2nd Workshop on Consistency Problems in UML-based SoftwareDevelopment, pages 9–17, 2003.

[107] Ludwik Kuzniarz, Miroslaw Staron, and Claes Wohlin. An empirical study onusing stereotypes to improve understanding of UML models. In Proceedings of the12th IEEE International Workshop on Program Comprehension (IWPC‘04), pages14–23. IEEE CS Press, 2004.

[108] Oliver Laitenberger, Colin Atkinson, Maud Schlich, and Khaled El Emam. Anexperimental comparison of reading techniques for defect detection in UML designdocuments. Technical Report NRC/ERB-1069, National Research Council Canada(NRC), Ottawa, Canada, December 1999.

[109] Oliver Laitenberger and Jean-Marc DeBaud. An encompassing life cycle centricsurvey of software inspection. Journal of Systems and Software, 50(1):5–31, 2000.

[110] Christian F. J. Lange. Replication packages of the experiments. http://www.win.tue.nl/~clange.

[111] Christian F. J. Lange. Empirical investigations in software architecture complete-ness. Master’s thesis, Technische Universiteit Eindhoven, September 2003. No.969.

[112] Christian F. J. Lange and Michel R. V. Chaudron. An empirical assessment ofcompleteness in UML designs. In Proceedings of the 8th International Conferenceon Empirical Assessment in Software Engineering (EASE‘04), pages 111–121, May2004.

[113] Christian F. J. Lange and Michel R. V. Chaudron. Combining metrics data andthe structure of UML models using GIS visualization approaches. In Proceedingsof the IEEE International Conference on Information Technology 2005, volume 2,pages 322–326, April 2005.

[114] Christian F. J. Lange and Michel R. V. Chaudron. Managing model quality inUML-based software development. In Proceedings of 13th IEEE InternationalWorkshop on Software Engineering and Practice (STEP ‘05), 2005.

[115] Christian F. J. Lange and Michel R. V. Chaudron. Effects of defects in UMLmodels - an experimental investigation. In Proceedings of the 28th InternationalConference on Software Engineering (ICSE‘06), pages 401–411. ACM, May 2006.

[116] Christian F. J. Lange and Michel R. V. Chaudron. Interactive views to improvethe comprehension of UML models – an experimental validation. In Proceedings ofthe 15th IEEE International Conference on Program Comprehension (ICPC 2007),pages 221–230. IEEE Computer Society, June 2007.

[117] Christian F. J. Lange, Michel R. V. Chaudron, and Johan Muskens. In practice:UML software architecture and design description. IEEE Software, 23(2):40–46,March 2006.

http://www.win.tue.nl/~clange

http://www.win.tue.nl/~clange

Bibliography 151

[118] Christian F. J. Lange, Bart DuBois, Michel R. V. Chaudron, and Serge Demeyer.An experimental investigation of UML modeling conventions. In Oscar Nierstrasz,Jon Whittle, David Harel, and Gianna Reggio, editors, Proceedings of the 9th Inter-national Conference on Model Driven Engineering Languages and Systems (MoD-ELS 2006), LNCS 4199, pages 27–41, Heidelberg, October 2006. Springer.

[119] Christian F. J. Lange, Martijn A. M. Wijns, and Michel R. V. Chaudron. Met-ricViewEvolution: UML-based views for monitoring model evolution and quality. InProceedings of the 11th European Conference on Software Maintenance and Reengi-neering (CSMR 2007), 2007.

[120] Christian F. J. Lange, Martijn A. M. Wijns, and Michel R. V. Chaudron. Avisualization framework for task-oriented modeling using UML. In Proceedings ofthe 40th Hawaiian International Conference on System Sciences (HICSS). IEEECS Press, January 2007.

[121] Christian F.J. Lange. Model size matters. In Thomas Kuhne, editor, Models inSoftware Engineering – Workshops and Symposia at MoDELS 2006. Reports andRevised Selected Papers, volume 4364 of LNCS, pages 211–216, Heidelberg, 2007.Springer.

[122] Guillaume Langelier and Houari A. Sahraoui. Animation coherence in representingsoftware evolution. In Proceedings of the 10th ECOOP Workshop on QuantitativeApproaches in Object-Oriented Software Engineering (QAOOSE’06), pages 41–50,2006.

[123] Guillaume Langelier, Houari A. Sahraoui, and Pierre Poulin. Visualization-basedanalysis of quality for large-scale software systems. In ASE ‘05: Proceedings ofthe 20th IEEE/ACM international Conference on Automated software engineering,pages 214–223, 2005.

[124] Michele Lanza and Stephane Ducasse. Polymetric views – a lightweight visualapproach to reverse engineering. IEEE Trans. Softw. Eng., 29(9):782–795, 2003.

[125] Michele Lanza, Stephane Ducasse, Harald Gall, and Martin Pinzger. CodeCrawler:an information visualization tool for program comprehension. In Proceedings of the27th International Conference on Software Engineering (ICSE‘05), pages 672–673,2005.

[126] Craig Larman. Applying UML and Patterns: An Introduction to Object-OrientedAnalysis and Design and Iterative Development (3rd Edition). Prentice-Hall, Inc.,2004.

[127] Diane Lending and Norman L. Chervany. The use of case tools. In SIGCPR ’98:Proceedings of the 1998 ACM SIGCPR conference on Computer personnel research,pages 49–58, New York, NY, USA, 1998. ACM Press.

[128] Felix Leung and Narasimha Bolloju. Analyzing the quality of domain modelsdeveloped by novice systems analysts. In Proceedings of the 38th Annual HawaiiInternational Conference on System Sciences (HICSS’05), page 188.2, Washington,DC, USA, 2005. IEEE Computer Society.

[129] Rensis A. Likert. A technique for the measurement of attitudes. Archives ofPsychology, (No. 140), 1932.

152 Bibliography

[130] Odd Ivar Lindland, Guttorm Sindre, and Arne Sølvberg. Understanding quality inconceptual modeling. IEEE Software, 11(2):42–49, March 1994.

[131] WenQian Liu, Steve M. Easterbrook, and John Mylopoulos. Rule based detectionof inconsistency in UML models. In L. Kuzniarz et al., editor, Proceedings of theWorkshop on Consistency Problems in the UML, 2002.

[132] Mark Lorenz and Jeff Kidd. Object Oriented Software Metrics. Prentice Hall, 1994.

[133] Neil MacKinnon and Steve Murphy. Designing UML diagrams for technical doc-umentation. In Proceedings of SIGDOC’03, pages 105–112, San Francisco, CA,2003. ACM.

[134] Jonathan I. Maletic, Andrian Marcus, and Michael L. Collard. A task orientedview of software visualization. In Proceedings of the IEEE Workshop of VisualizingSoftware for Understanding and Analysis (VISSOFT 2002), pages 32–40. IEEEComputer Society, 2002.

[135] Thomas J. McCabe. A complexity measure. IEEE Transactions on Software En-gineering, 2(4):308–320, 1976.

[136] James A. McCall. Quality Factors in Encyclopedia of Software Engineering. Wiley-Interscience, second edition, 2001.

[137] James A. McCall, Paul K. Richards, and Gene F. Walters. Factors in SoftwareQuality, volume vol 1-3 of AD/A-049-015/055. Springfield, 1977.

[138] Quinn McNemar. Note on the sampling error of the difference between correlatedproportions or percentages. Psychometrika, 12:153–157, 1947.

[139] Meerling. Methoden en technieken van psychologisch onderzoek, volume 2. Boom,Meppel, The Netherlands, 4th edition, 1989. in Dutch.

[140] Tom Mens, Ragnhild van der Straeten, and Jean-Francois Warny. Graph-based toolsupport to improve model quality. In Proceedings of the First Workshop on Qualityin Modeling (QIM), co-located with MoDELS 2006, pages 47–62. Department ofMathematica and Computer Science, Eindhoven University of Technology, 2006.

[141] James Miller. Statistical significance testing–a panacea for software technologyexperiments? Journal of Systems and Software, 73:183–192, 2004.

[142] David Miranda, Marcela Genero, and Mario Piattini. Empirical validation of met-rics for UML statechart diagrams. In Proceedings of the Fifth International Con-ference on Enterprise Information Systems (ICEIS’03), pages 87–95, 2003.

[143] Parastoo Mohagheghi, Bente Anda, and Reidar Conradi. Effort estimation of usecases for incremental large-scale software development. In ICSE ’05: Proceedingsof the 27th international conference on Software engineering, pages 303–311, 2005.

[144] Gail C. Murphy, Mik Kersten, Martin P. Robillard, and Davor Cubranic. Theemergent structure of development tasks. In Andrew P. Black, editor, Proceedingsof the 19th European Conference on Object-Oriented Programming (ECOOP 2005),LNCS, pages 33–48. Springer, July 2005.

[145] Steve Murphy, Scott R. Tilley, and Shihong Huang. Fourth workshop on graphicaldocumentation: UML style guidelines. In Proceedings of SIGDOC’04, pages 118–119, Memphis, 2004. ACM.

Bibliography 153

[146] Johan Muskens. Software Architecture Analysis Tool (M.Sc. thesis). TechnischeUniversiteit Eindhoven (The Netherlands), 2002.

[147] Johan Muskens, Reinder J. Bril, and Michel R. V. Chaudron. Generalizing con-sistency checking between software views. In Proceedings of the 5th WorkingIEEE/IFIP Conference on Software Architecture (WICSA’05), pages 169–180,Washington, DC, USA, 2005. IEEE Computer Society.

[148] Johan Muskens, Christian F. J. Lange, and Michel R. V. Chaudron. Experiencesin applying architecture and design metrics in multi-view models. In Proceedingsof EUROMICRO 2004, Rennes, France, pages 372–379, August 2004.

[149] Bashar Nuseibeh. A Multi-Perspective Framework for Method Integration. Phdthesis, Imperial College of Science, Technology and Medicine, London, England,October 1994.

[150] Object Management Group. UML 2.0 OCL Specification, (Object Constraint Lan-guage), formal/2006-05-01 edition, June.

[151] Object Management Group. MDA Guide, Version 1.0.1, omg/03-06-01 edition,June 2003.

[152] Object Management Group. Unified Modeling Language, UML 2.0 SuperstructureSpecification, formal/07-03-03 edition, July 2007.

[153] Object Management Group (OMG). http://www.omg.org,.

[154] Paul W. Oman and Curtis R. Cook. A taxonomy for programming style. InProceedings of the 18th ACM Computer Science Conference, pages 244–250, 1990.

[155] Paul W. Oman and Jack R. Hagemeister. Metrics for assessing software systemmaintainability. In Proceedings of the IEEE International Conference on SoftwareMaintenance (ICSM’92), pages 227–344, Los Alamitos, CA, 1992. IEEE CS Press.

[156] Mari Carmen Otero and Jose Javier Dolado. An initial experimental assessmentof the dynamic modelling in UML. Empirical Software Engineering, pages 27–47,July 2002.

[157] Mari Carmen Otero and Jose Javier Dolado. An empirical comparison of thedynamic modeling in OML and UML. Journal of Systems and Software, 77(2):91–102, August 2005.

[158] Michael J. Pacione, Marc Roper, and Murray Wood. A novel software visualisationmodel to support software comprehension. In Proceedings of the 11th WorkingConference on Reverse Engineering (WCRE), pages 70–79. IEEE CS Press, 2004.

[159] David Lorge Parnas. On the criteria to be used in decomposing systems intomodules. Commununications of the ACM, 15(12):1053–1058, 1972.

[160] Helen C. Purchase, Jo-Anne Allder, and David Carrington. Graph layout aestheticsin UML diagrams: User preferences. Journal of Graph Algoritms and Applications,6(3):255–279, 2002.

[161] Helen C. Purchase, Linda Colpoys, Matthew McGill, David Carrington, and CarolBritton. UML class diagram syntax: an empirical study of comprehension. In Pro-ceedings of the 2001 Asia-Pacific symposium on Information visualisation (APVis’01), pages 113–120, Darlinghurst, Australia, 2001. Australian Computer Society,Inc.

http://www.omg.org

154 Bibliography

[162] Helen C. Purchase, Matthew McGill, Linda Colpoys, and David Carrington. Graphdrawing aesthetics and the comprehension of UML class diagrams: an empiricalstudy. In Proceedings of the 2001 Asia-Pacific symposium on Information visu-alisation (APVis ’01), pages 129–137, Darlinghurst, Australia, 2001. AustralianComputer Society, Inc.

[163] Balasubramaniam Ramesh and Matthias Jarke. Toward reference models for re-quirements traceability. IEEE Transactions on Software Engineering, 27(1):58–93,2001.

[164] Iris Reinhartz-Berger and Dov Dori. OPM vs. UML – experimenting with compre-hension and construction of web application models. Empirical Software Engineer-ing, 10:57–79, 2005.

[165] Thijs Reus, Hans Geers, and Arie van Deursen. Harvesting software systems forMDA-based reengineering. In Proceedings of the Second European Conference onModel Driven Architecture - Foundations and Applications (ECMDA-FA 2006),pages 213–225. Springer, July 2006.

[166] Arther J. Riel. Object-Oriented Design Heuristics. Addison-Wesley, April 1996.

[167] Claudio Riva, Petri Selonen, Tarja Systa, and Jianli Xu. UML-based reverse engi-neering and model analysis approaches for software architecture maintenance. InProceedings of the 20th IEEE International Conference on Software Maintenance(ICSM 2004), pages 50–59. IEEE, September 2004.

[168] H. Dieter Rombach. Quantitative Bewertung von Software-Qualitats-Merkmalenauf Basis struktureller Kenngrossen. Phd thesis, Universitat Kaiserslautern, June1984. in German.

[169] James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddy, andWilliam Lorenson. Object-Oriented Modelling and Design. Prentice Hall, NewYork, 1991.

[170] James Rumbaugh, Ivar Jacobson, and Grady Booch. The Unified Modeling Lan-guage Reference Manual. Addison Wesley, 1999.

[171] Nick Russell, Wil M. P. van der Aalst, Arthur H. M. ter Hofstede, and Petia Wohed.On the suitability of UML 2.0 activity diagrams for business process modelling. InAPCCM ’06: Proceedings of the 3rd Asia-Pacific conference on Conceptual mod-elling, pages 95–104, Darlinghurst, Australia, 2006. Australian Computer Society,Inc.

[172] Reinhard Schauer and Rudolf K. Keller. Pattern visualization for software com-prehension. In Proceedings of the 6th International Workshop on Program Com-prehension (IWPC’98), pages 4–12, Ischia, Italy, June 1998.

[173] Ina Schieferdecker, Zhen Ru Dai, Jens Grabowki, and Axel Rennoch. The UML 2.0testing profile and its relation to TTCN-3. In Testing of Communication Systems,volume 2644 of LNCS, pages 79–94. Springer, October 2003.

[174] Douglas C. Schmidt. Model-driven engineering. Computer, 39(2):25–31, February2006.

[175] Claude E. Shannon. A mathematical theory of communication. The Bell SystemTechnical Journal, 27:379–423, July 1948.

Bibliography 155

[176] Keng Siau and Qing Cao. Unified Modeling Language: A complexity analysis.Journal of Database Management, 12(1):26–34, January 2001.

[177] Sandra A. Slaughter, Donald E. Harter, and Mayuram S. Krishnan. Evaluating thecost of software quality. Communications of the ACM, 41(8):67–73, August 1998.

[178] Ian Sommerville. Software Engineering. Addison-Wesley, eighth edition, May 2006.

[179] George Spanoudakis, Anthony Finkelstein, and David Till. Overlaps in require-ments engineering. Automated Software Engineering, 6(2):171–198, 1999.

[180] Miroslaw Staron. Adopting model driven development in industry – a case studyat two companies. In Oscar Nierstrasz, Jon Whittle, David Harel, and GiannaReggio, editors, Proceedings of the 9th International Conference on Model DrivenEngineering Languages and Systems (MoDELS 2006), LNCS 4199, pages 57–72,Genova, Italy, October 2006. Springer.

[181] Margarete-Anne D. Storey, Casey Best, and Jeff Michaud. SHriMP views: aninteractive environment for exploring javaprograms. In Proceedings of the 9th In-ternational Workshop on Program Comprehension (IWPC 2001), pages 111–112.IEEE CS Press, May 2001.

[182] Margarete-Anne D. Storey, Kenny Wong, and Hausi A. Muller. Rigi: A visualiza-tion environment for reverse engineering. In Proceedings of the 19th InternationalConference on Software Engineering (ICSE ‘97), pages 606–607. IEEE CS Press,May 1997.

[183] The Gartner Group. Hype cycle for emerging technologies. http://www.gartner.com, July 2006.

[184] Walter F. Tichy. Should computer scientists experiment more? Computer,31(5):32–40, 1998.

[185] Scott R. Tilley and Shihong Huang. A qualitative assessment of the efficacy of UMLdiagrams as a form of graphical documentation in aiding program understanding.In Proceedings of the 21st International Conference on Systems Documentation(SIGDOC 2003), pages 184–191. ACM, October 2003.

[186] Guilherme Travassos, Forrest Shull, Michael Fredericks, and Victor R. Basili. De-tecting defects in object-oriented designs: using reading techniques to increasesoftware quality. SIGPLAN Not., 34(10):47–56, 1999.

[187] Takuya Uemura, Shinji Kusumoto, and Katsuro Inoue. Function point measure-ment tool for uml design specification. In METRICS ’99: Proceedings of the 6thInternational Symposium on Software Metrics, page 62, Washington, DC, USA,1999. IEEE Computer Society.

[188] Ragnhild van der Straeten. Inconsistentiebeheer in modelgebaseerde ontwikkeling.Phd thesis, Vrije Universiteit Brussel, Brussel, Belgium, 2005.

[189] Dennis J. A. van Opzeeland. Automated techniques for reconstructing and assessingcorrespondence between UML designs and implementations. Msc thesis, TechnischeUniversiteit Eindhoven, Eindhoven, The Netherlands, August 2005.

[190] Dennis J. A. van Opzeeland, Christian F. J. Lange, and Michel R. V. Chaudron.Quantitative techniques for the assessment of correspondence between UML de-signs and implementation. In Houari A. Sahraoui, Coral Calero, Michele Lanza,

http://www.gartner.com

http://www.gartner.com

156 Bibliography

Geert Poels, and Fernando Brito e Abreu, editors, Proceedings of the 9th ECOOPWorkshop on Quantitative Approaches in Object-Oriented Software Engineering(QAOOSE’05), pages 1–18, 2005.

[191] Hans van Vliet. Software Engineering: Principles and Practice. John Wiley andSons, 2000.

[192] Jan Verelst. The influence of the level of abstraction on the evolvability of concep-tual models of information systems. Empirical Software Engineering, 10(4):467–494, 2005.

[193] Lucian Voinea, Alex Telea, and Jarke J. van Wijk. CVSscan: Visualization ofcode evolution. In Proceedings of the ACM Symposium on Software Visualization(SoftVis), pages 47–56, St. Louis, Missouri, 2005. ACM.

[194] Martijn A. M. Wijns. MetricView Evolution: Monitoring Architectural Qual-ity. Msc thesis, Technische Universiteit Eindhoven, Eindhoven, The Netherlands,March 2006.

[195] Claes Wohlin and Aybuke Aurum. An evaluation of checklist-based reading forentity-relationship diagrams. In Proceedings of the Ninth International SoftwareMetrics Symposium. IEEE CS, 2003.

[196] Claes Wohlin, Per Runeson, Martin Host, Magnus C. Ohlesson, Bjorn Regnell,and Anders Wesslen. Experimentation in Software Engineering - An Introduction.Kluwer Academic Publishers, 2000.

[197] Kenny Wong and Dabo Sun. On evaluating the layout of UML diagrams for pro-gram comprehension. Software Quality Journal, 14(3):233–259, September 2006.

[198] Jurgen Wust. The software design metrics tool for the UML.http://www.sdmetrics.com.

[199] Zhenchang Xing and Eleni Stroulia. Analyzing the evolutionary history of the logi-cal design of object-oriented software. IEEE Transactions on Software Engineering,31(10):850–868, October 2005.

[200] Zhenchang Xing and Eleni Stroulia. UMLDiff: an algorithm for object-orienteddesign differencing. In David F. Redmiles, Thomas Ellman, and Andrea Zisman,editors, Proceedings 20th IEEE/ACM International Conference on Automated Soft-ware Engineering (ASE 2005), pages 54–65. ACM, November 2005.

[201] Robert K. Yin. Case Study Research – Design and Methods, volume 5 of AppliedSocial Research Methods Series. Sage Publications, Inc., Thousand Oaks, Califor-nia, 3rd edition, 2003.

[202] Shehnaaz Yusuf, Huzefa H. Kagdi, and Jonathan I. Maletic. Assessing the compre-hension of UML class diagrams via eye tracking. In 15th International Conferenceon Program Comprehension (ICPC 2007), pages 113–122. IEEE Computer Society,June 2007.

[203] Marvin V. Zelkowitz and Dolores R. Wallace. Experimental models for validatingtechnology. Computer, 31(5):23–31, 1998.

Appendix A

Effects of DefectsExperiment

A.1 The Agreement Measure

For the analysis of the multiple-choice questions we were interested in the amountof misinterpretation that was caused by the presence of a model defect. Thereforewe needed a measure that captures the degree of agreement in the distributionof answers to each question. In our experiment each question has four answeralternatives. Essentially we want to measure a distributions’ magnitude of dis-crimination.

The measure should have the following properties:

Property 1 It should have its maximum value, when only one alternative receivedanswers.

Property 2 It should have its minimum value, when all alternatives received anequal number of answers.

Property 3 It should have the range [0, 1].

Property 4 It should be larger for distributions where X alternatives received‘many’ answers than for distributions where X+1 alternatives received ‘many’ an-swers.

Our agreement measure captures the attribute of a distribution that is the inverseof entropy. We also considered to use Shannon’s entropy measure [175]. The

157

158 Appendix A Effects of Defects Experiment

Table A.1. Definitions for the agreement measureName Description

K the number of alternatives for a questionki the number of times alternative i was selected, where 0 ≤ i < K

and (∀i : 0 ≤ i < K − 1 : ki ≥ ki+1)N the sum of answers over all alternatives: N =

∑0≤i<K ki

main difference between Shannon’s entropy and our agreement measure is thatShannon’s entropy measure behaves exponential and is not normalized between 0and 1. The linear behavior and the range between 0 and 1 of AgrM is similar to thedetection rate. Therefore we have chosen to develop a new measure as describedbelow.

First we define some abbreviations that we need for the explanation of the measure(see Table A.1).

For simplicity’s sake we begin by developing a function that behaves opposed tothe desired function, i.e. it has its maximum value when all alternatives receive anequal number of answers (opposed to Property 1).

The function must be such that it is larger, the broader the distribution of al-ternatives is. This means the value must increase, if the number of alternativesthat received answers increases and if difference in number of answers betweenthe alternative with the most answers and the other alternatives decreases. Weaccomplish this by multiplying the number of answers per alternative by a fac-tor and adding the products. The factors are chosen such that alternatives withfewer answers receive a larger factor. This leads to the following equation for theweighted sum S, where we choose 0, 1. . . K-1 as factors:

S =∑

0≤i<K

kii (A.1)

In the optimal case only one alternative receives answers (k0 > 0) and all otheralternatives receive no answers (for i > 0 is ki = 0)), hence the sum S = 0.

In the worst case, the answers are equally distributed over all alternatives ki. Inthis case the sum reaches its maximum Smax which can be calculated as follows:

Smax =N

K

∑0≤i<K

i =N

K

K(K − 1)2

=N(K − 1)

2(A.2)

We normalize the range of our measure such that it satisfies Property 3 by dividingS through Smax. This yields:

F =S

Smax(A.3)

A.2 Raw Result Data 159

Table A.2. Raw results of the student experiment (first run)Question Defect Type A B C D det.Q1.1 Message without Name 2 38 31 63 77Q1.2 Control 2 100 1 0 9Q2 Message without Method 71 6 0 9 43Q3 Message in the wrong Direction 34 7 8 8 67Q4 Control 2 106 2 1 6Q5.1 Control 1 103 3 2 5Q5.2 Class not instantiated in SD 18 53 4 39 52Q6.1 Class not in CD 3 96 1 23 20Q6.2 Control 0 78 42 0 9Q7.1 UC not in SD 1 1 43 9 55Q7.2 Control 0 100 1 6 5Q8 Message without Method (s.) 58 2 7 1 54Q9.1 Control 1 105 0 0 4Q9.2 Class not instantiated in SD (s.) 30 33 1 31 104Q10 Multiple class defs. 4 6 100 0 11

Now we have F which has Property 4 and Properties 1 and 2 opposed. We simplyhave to subtract F from 1 to obtain our measure which also has Properties 1 and 2.

Hence, our agreement measure (called AgrM) for K > 1 alternatives is:

AgrM(k0, ..., kK−1) = 1− 2

∑0≤i<K kii

N(K − 1)(A.4)

A.2 Raw Result Data

The raw results of the student experiment can be found in Tables A.2 (first run)and Table A.4 (second run). Table A.3 shows the raw results of the professionals’experiment.

160 Appendix A Effects of Defects Experiment

Table A.3. Raw results of the professionals’ experimentQuestion Defect Type A B C D det.Q1.1 Message without Name 2 7 7 16 27Q1.2 Control 1 32 1 1 7Q2 Message without Method 25 1 0 2 14Q3 Message in the wrong Direction 13 9 1 0 18Q4 Control 2 28 1 0 4Q5.1 Control 0 25 2 0 4Q5.2 Class not instantiated in SD 1 9 1 4 21Q6.1 Class not in CD 1 26 0 1 3Q6.2 Control 0 17 4 0 5Q7.1 UC not in SD 0 5 5 3 14Q7.2 Control 0 23 1 1 6Q8 Message without Method (s.) 21 0 2 0 9Q9.1 Control 0 25 0 0 2Q9.2 Class not instantiated in SD (s.) 2 2 1 2 26Q10 Multiple class defs. 2 7 14 0 6

Table A.4. Raw results of the student experiment (second run)Question Defect Type A B C D det.R1.1 Control 102 4 3 1 7R1.2 Message without Method (s.) 22 2 22 61 54R2 Message without Method 71 2 1 6 36R3 Method not instantiated in SD 41 68 6 3 15R4.1 Class not in CD 5 93 0 14 23R4.2 Control 0 76 39 0 15R5 Message without Method (s.) 1 0 5 81 33

Appendix B

Modeling ConventionsExperiment

B.1 Modeling Conventions

The set of modeling conventions used in this experiment is shown in Table B.1.

ID Name Description Category1 Abstraction

LevelClasses in the same package must beof the same abstraction level

Abstraction

2 Unique Names Classes, packages and use casesmust have unique names

Abstraction

3 Size of UseCases

All use cases should cover a similaramount of functionality

Abstraction

4 Homogenity ofAccessor Usage

When you specify get-ters/setters/constructors for aclass, specify them for all classes

Balance

5 Homogenity ofVisibility Usage

When you specify visibility some-where, specify it everywhere

Balance

6 Homogenity ofMethod Specifi-cation

Specify methods for the classes thathave methods! Don’t make a differ-ence in whether you specify or don’tspecify methods as long as there isnot a strong difference between theclasses.

Balance

161

162 Appendix B Modeling Conventions Experiment

7 Homogenity ofAttribute Speci-fication

Specify attributes for the classesthat have attributes! Don’t make adifference in whether you specify ordon’t specify attributes as long asthere is not a strong difference be-tween the classes.

Balance

8 DynamicClasses

For classes with a complex internalbehaviour, specify the internal be-haviour using a state diagram

Completeness

9 Model Class In-teraction

All classes that interact with otherclasses should be described in a se-quence diagram

Completeness

10 Use Case In-stantiation

Each Use Case must be described byat least one Sequence Diagram

Completeness

11 Specify ObjectTypes

The type of ClassifierRoles (Ob-jects) must be specified. (Whichclass in represented by the object?)

Completeness

12 Call Methods A method that is relevant for in-teraction between classes should becalled in a Sequence Diagram to de-scribe how it is used for interaction.

Completeness

13 Role Names ClassifierRoles (Objects) shouldhave a role name

Completeness

14 Specify MessageTypes

Each message must correspond to amethod (operation)

Consistency

15 No AbstractLeafs

Abstract classes should not be leafs(i.e. they should have subclasses)

Design

16 DIT at most 7 Inheritance trees should have nomore than 7 levels

Design

17 Abstract-Concrete

Abstract classes should not haveconcrete superclasses

Design

18 High Cohesion Classes should have high cohesion.Don’t overload classes with unre-lated functionality.

Design

19 Low Coupling Your classes should have low cou-pling. (The number of relations be-tween each class and other classesshould be small)

Design

20 No DiagramOverload

Don’t overload diagrams. Each dia-gram should focus on a specific con-cept/problem/functionality/

Layout

B.2 Post-test Questionnaire 163

21 No X-ing Lines Diagrams should not containcrossed lines (relations)

Layout

22 Use Names Classes, use cases, operations, at-tributes, packages, etc must have aname

Naming

23 MeaningfulNames

Naming should use commonlyaccepted terminology, be non-ambigious and precisely express thefunction / role / characteristic ofan element.

Naming

Table B.1: List of modeling conventions used in this experiment

B.2 Post-test Questionnaire

We present the questions of the post-test questionnaire in this section. The ques-tionnaire conducted as an online-questionnaire using the PollWeb [61] system ofthe Technische Universiteit Eindhoven. The student where identified by the sys-tem while logging on. This enables us to relate the data from the questionnaire tothe other data obtained during the experiment.

Introduction

During the execution of the course assignment you took part in a research experi-ment on the usefulness of modeling standards and metrics tooling. By completingthis questionnaire you provide us with useful information for the analysis of theresults. The information you give here does not influence your mark and will betreated confidentially! Please answer each question carefully. Thank you!

Background Questions

1. What is your practical work experience in software engineering (in years)?

2. What is your knowledge in the following fields?

Please answer according to this scale: 1 = no knowledge; 2 = gained knowledgethrough academic classes or literature study; 3 = applied it in academic context;4 = applied it in one industrial project; 5 = applied it in more than one industrialproject


1 2 3 4 5Unified Modeling Language (UML) © © © © ©Designing Software Systems © © © © ©Implementing Software Systems © © © © ©Reviewing Source Code © © © © ©Reviewing Software Designs © © © © ©Software Inspections © © © © ©

About the Assignment

3. Indicate how difficult the UML modeling task was to you.1 – very difficult 2 – difficult 3 – intermediate 4 – easy 5 – very easy

© © © © ©

4. Indicate how well you understood what was required of you in the UML mod-eling task.1 – very poor 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

5. What is the confidence of the quality of the UML model you delivered?1 – very poor 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

6. How much did you enjoy developing the UML model using traditional model-ing / modeling standards / modeling standards + analysis tool? (chose what isapplicable for you)1 – very poor 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

7. Indicate how motivated you were to perform well in the UML modeling task.1 – very poor 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

8. How was the work organized within your group ?© one person developed the model on his own© one person created the model, other persons reviewed the model (fixed roles)© one person created the model, other persons reviewed the model (roles changedover project)© several persons created the model and reviewed the model together© different approach:

B.2 Post-test Questionnaire 165

Questions about Modeling Conventions

Only the subjects who used modeling conventions were asked to answer the ques-tions of this section.

9. How well did you adhere to the modeling conventions?1 – not at all 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

10. How well did you understand the meaning of the modeling conventions?1 – very poor 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

11. Indicate your approach, how you applied the modeling conventions.© read the guidelines once in the beginning of the project and tried to follow them© read the guidelines several times during the project and tried to follow them© read the guidelines several times during the project and reviewed whether themodel adheres to the guidelines once© read the guidelines several times during the project and reviewed whether themodel adheres to the guidelines several times© different approach:

12. Do you think development using modeling standards leads to a better UMLmodel?1 – not at all 2 – rather not 3 – neutral 4 – probably yes 5 – yes, a lot

© © © © ©

Questions about using the Conformance Tool

Only the subjects who used the conformance tool were asked to answer the ques-tions of this section.

13. How well did you adhere to the critics and analysis results of the SDMetricstool?1 – not at all 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

14. How well did you understand the meaning of the tool output?1 – not at all 2 – poor 3 – intermediate 4 – good 5 – very good

© © © © ©

15. How many times did you analyze the model during the project using the tool?

16. Do you think tool-based model-analysis leads to a better UML-model?


1 – not at all 2 – rather not 3 – neutral 4 – probably yes 5 – yes, a lot© © © © ©

Appendix C

Task-Oriented ViewsExperiment

C.1 Subjects

Here we describe the background information of the subjects. Table C.1 showsthe institute where the subjects received their previous degree. Table C.2 showsthe major of the subjects’ previous degree. And Table C.3 summarizes the rel-evant skill and knowledge level of the subjects, as indicated in the backgroundquestionnaire.

C.2 Assumptions of the Student t-test

To get reliable results from parametric statistical tests such as the Student t-test the analyzed data must fulfill the test’s assumptions such as adherence tothe normal distribution. If the assumptions do not hold a non-parametric testmust be chosen. To choose the appropriate tests for hypothesis testing, we had tocheck the collected data for deviations from the normal distribution. We used theKolmogorov-Smirnov goodness-of-fit test and the Shapiro-Wilk test for this anal-ysis. Table C.4 shows the results of the Kolmogorov-Smirnov test and Table C.5shows the results of the Shapiro-Wilk test. The null hypothesis in these testsstates that the distribution of observed values does not deviate from the normaldistribution. A significant p-value (less than 0.05) indicates that we can rejectthe null hypothesis. The results show that we can assume a normal distributionfor the variable time (except for the PoS in the first run) and that we have to

167

168 Appendix C Task-Oriented Views Experiment

Table C.1. Subjects’ previous institutesInstitute PercentTU/e 54.0 %Other Dutch institutes 26.0 %Other European institute 8.0 %Outside Europe 12.0 %

Table C.2. Subjects’ majorMajor PercentComputer Science 80.0 %Electrical Engineering 13.0 %Different 7.0 %

reject the assumption for the variable correctness. As a result we conducted bothparametric and non-parametric tests. The experimental design of each run was abetween-subjects design, therefore the appropriate tests were the Student t-testand the Mann-Whitney test. The results of the (weaker) non-parametric Mann-Whitney test were consistent with the results of the parametric Student t-test. Asreported in Table 8.5 we conducted Levene’s test for equality of variances in orderto decide whether the we could use the standard Student t-test which assumesequality of variances of both samples. The significance of Levene’s test results wasrelatively low for three out of four pairs, therefore we decided to apply the weakervariant of the Student t-test which does not assume equality of variances.

C.2 Assumptions of the Student t-test 169

Table C.3. Subjects’ background (* indicates significant difference)Knowledge UML Design Impl. UML Metr.

Tools Tools1 – No knowledge 6.1% 7.1% 10.2% 5.1% 18.4%2 – Learned in university 31.6% 21.4% 15.3% 41.8% 54.1%3 – Appl. in university 35.7% 40.8% 37.8% 35.7% 26.5%4 – Appl. once in ind. 20.4% 18.4% 20.4% 11.2% 1.0%5 – Appl. > once in ind. 6.1% 12.2% 16.3% 6.1% 0.0%Average Group A 2.79 3.06 3.19 2.60 2.02Average Group B 2.96 3.06 3.13 2.81 2.17χ2 6.59 3.35 7.00 * 17.89 7.71

Table C.4. Kolmogorov-Smirnov resultsDependent First Run Second RunVariable Tool Statistic df Sig. Statistic df Sig.Correctness MVE .142 45 .023 .168 48 .002Correctness POS .182 51 .000 .121 43 .117Time MVE .063 45 > .200 .104 48 > .200Time POS .120 51 .064 .111 43 > .200

Table C.5. Shapiro-Wilk resultsDependent First Run Second RunVariable Tool Statistic df Sig. Statistic df Sig.Correctness MVE .937 45 .017 .911 48 .001Correctness POS .913 51 .001 .956 43 .095Time MVE .978 45 .548 .961 48 .114Time POS .946 51 .022 .986 43 .858

170 Appendix C Task-Oriented Views Experiment

Summary

Assessing and Improving the Quality of Modeling

A Series of Empirical Studies about the UML

This thesis addresses the assessment and improvement of the quality of modelingin software engineering. In particular, we focus on the Unified Modeling Language(UML), which is the de facto standard in industrial practice. The language is usedfor a wide variety of purposes, such as specification, maintenance, comprehension,communication, test case generation, and code generation. The UML has someinherent characteristics, that cause risks to the quality of UML modeling. Thecharacteristics of interest in this thesis are its lack of a formal semantics, its multi-diagram architecture, and its large complexity. These characteristics can lead toquality problems with respect to correctness, comprehensibility, consistency, non-redundancy, completeness, or unambiguity. In this thesis we assess the quality ofmodeling in practice, and we provide and evaluate techniques to improve the qual-ity of modeling. We conducted three large-scale experiments with 365 participantsin total. Additionally a series of industrial case studies was conducted.

To define the notion of quality that is used throughout this thesis, we present aframework that is based on existing work. The framework decomposes quality ofmodels into the following quality notions: system, semantic, syntactic, pragmatic,social, and communicative quality, and correspondence between the model andthe implementation. We use these notions throughout the thesis to denote whichaspect of quality is addressed in each of the presented studies. Additionally, wepropose a quality model for UML modeling. The purpose of the quality modelis to support developers in selecting metrics and rules to analyze the quality of amodel with respect to a particular purpose of modeling.

We report on a series of industrial case studies. We conducted the case studiesto assess the quality of UML models in practice. The results of the case studiesreveal the frequency of occurrence of defect types in modeling. This knowledgecan be used to focus quality assurance techniques on common quality problems.

171

172 Summary

Through the case studies we discovered defects in UML models. We conductedan experiment to study the effects of several of the discovered defect types. Theresults of the experiment show that defects often remain undetected by develop-ers. Furthermore, defects cause a variety of different interpretations of the modelamongst developers and, hence, lead to misinterpretations and miscommunication.An additional result of the experiment is a classification of defect types with re-spect to their likelihood of non-detection and misinterpretation. This objectiveclassification can be used to prioritize defects, such that removal effort is assignedto the most severe defects first.

As a preventive quality assurance technique we propose modeling conventions,similar to coding conventions for programming. Based on a literature review weprovide a classification of modeling conventions. We report on an exploratory ex-periment that studied the effectiveness of modeling conventions with respect todefect prevention and the extra effort entailed by modeling conventions. The re-sults show a slight improvement with respect to defect occurrences, however, theseresults are not statistically significant. The results showing an increase in develop-ment effort are significant. Based on the experiment we provide recommendationsto improve the benefit obtained by using modeling conventions. The recommen-dations include improved adherence to the modeling conventions, training andexperience, and developer motivation.

Finally, we propose task-oriented views for UML. Task-oriented views are visu-alizations of UML models that support developers by providing the informationthat is necessary for a particular task, such as maintenance, comprehension, orquality analysis. We argue that the existing UML diagram types and existingUML tools do not provide the required information effectively. For example, itis tedious to find the relations between model elements in different diagrams andto relate data such as metrics, bug data, or evolution data to model elements.We discuss a framework that we used as a basis for creating task-oriented views.The proposed views are: MetaView, ContextView, MetricView, UML-City-View,Quality Tree View, and Evolution View. We conducted an experiment to validatethe views with respect to comprehension tasks. The results are very promisingand show that the correctness of comprehension is improved by 4.5% and that theeffort for comprehension is reduced by 20%. The proposed task-oriented views areimplemented in the tool MetricView Evolution.

Samenvatting

Het onderwerp van dit proefschrift is de beoordeling en de verbetering van dekwaliteit van modellering van software. In het bijzonder concentreren wij ons opde modelleringstaal “Unified Modeling Language” (UML). De UML is de de-factomodelleringstaal in de industriele praktijk. Deze modelleringstaal wordt gebruiktvoor een grote verscheidenheid van doeleinden, zoals specificatie en onderhoud vansystemen, maar ook voor communicatie, testcase-generatie, en generatie van bron-code. De taal UML heeft enkele kenmerken die problemen kunnen veroorzaken bijhet modelleren. De kenmerken van UML die van belang zijn voor dit proefschriftzijn: het gebrek aan een formele semantiek, de multi-diagram-architectuur, ende grote hoeveelheid diagrammen en taal-constructies. Deze kenmerken kunnenleiden tot kwaliteitsproblemen met betrekking tot correctheid, begrijpelijkheid,consistentie, eenduidigheid, of volledigheid van een UML model. In dit proef-schrift evalueren we industriele UML modellen om inzicht in de kwaliteit van UMLmodellen in de praktijk te verkrijgen. Op basis van deze bevindingen, stellen wetechnieken voor voor de verbetering van de kwaliteit van UML modellen. In ditproefschrift beschrijven we een evaluatie van deze verbeteringstechnieken. Wijhebben drie grootschalige experimenten met in totaal 365 deelnemers uitgevoerd.Bovendien werd een reeks industriele case studies uitgevoerd.

Om de notie van kwaliteit die wij in dit proefschrift hanteren te bepalen, stellen wijeen raamwerk voor. Dit raamwerk splitst het begrip “kwaliteit van model” op inde volgende delen: kwaliteit van het systeem, kwaliteit van de semantiek, kwaliteitvan de syntax, en bovendien pragmatische-, sociale- en communicatieve-kwaliteit.Ook vormt de correspondentie tussen het model en de implementatie onderdeelvan het kwaliteitsbegrip. Wij gebruiken deze begrippen van dit raamwerk om peronderdeel van dit proefschrift aan te duiden, welk onderdeel van kwaliteit bedoeldis. Bovendien stellen wij een kwaliteitsmodel voor UML modellering voor. Hetdoel van dit kwaliteitsmodel is om ontwikkelaars te steunen bij het selecteren vanmetrieken en regels ten behoeve van het beoordelen en beheersen van de kwaliteitvan UML modellen. Afhankelijk van het doeleinde van het te beoordelen UMLmodel worden specifieke metrieken en regels geselecteerd.

173

174 Samenvatting

In de reeks van case studies hebben wij 16 industriele UML modellen geanalyseerd,om kwaliteitsproblemen in de praktijk te bestuderen. De resultaten van deze casestudies onthullen welke defecten in de praktijk in UML modellering voorkomen.Deze kennis kan worden gebruikt om de technieken voor kwaliteitsverbetering zo-danig verder te ontwikkelen, dat zij zich richten op vaak voorkomende defecten. Dedefecten die in UML modellering worden gemaakt hebben potentieel een effect opde activiteiten en producten die hier in de loop van een software ontwikkelingspro-ject op gebaseerd worden. Om de gevolgen van de in de case studies gevondendefecten te analyseren hebben wij een experiment uitgevoerd. De resultaten vanhet experiment tonen aan dat de defecten vaak door ontwikkelaars niet ontdektworden. Bovendien veroorzaken de defecten verschillende interpretaties van hetmodel bij verschillende ontwikkelaars. Dit impliceert dat defecten tot misinterpre-tatie en miscommunicatie leiden. Een aanvullend resultaat van dit experiment iseen objectieve classificatie van defecten met betrekking tot de waarschijnlijkheiddat deze defecten door ontwikkelaars niet ontdekt worden en dat deze defectenmisinterpretaties veroorzaken. Deze classificatie kan worden gebruikt om bij hetkwaliteitsverbetering voorrang te geven aan de meest gevaarlijke defecten.

Als preventieve techniek om de kwaliteit van UML modellen te verbeteren hebbenwij modelleringsconventies (“modeling conventions”) voorgesteld. Deze conven-ties zijn vergelijkbaar met de conventies voor broncode (“coding conventions”).Gebaseerd op de ervaringen uit de case studies en een literatuurstudie stellen wijeen classificatie van modelleringsconventies voor. Wij beschrijven een experimentwaarin wij modelleringsconventies bestuderen met betrekking tot hun effectiviteitom defecten te vermijden en evalueren de gevolgen van het gebruik van de con-venties voor de productiviteit van het modelleren. De resultaten tonen een lichteafname aan van het aantal defecten bij het gebruik van modelleringsconventies.Echter, deze resultaten zijn niet statistisch significant. De tijd die nodig is om temodelleren neemt toe indien modelleringsconventies gebruikt worden. Gebaseerdop dit experiment geven wij aanbevelingen voor het verbeteren van het gebruikvan modelleringsconventies. De belangrijkste aanbevelingen zijn de volgende: demodelleringsconventies moeten strikter gevolgd worden, de ontwikkelaars moetenbeter getraind zijn voor het gebruik van de conventies, en de motivatie van deontwikkelaars om de conventies te gebruiken moet verbeterd worden.

Tenslotte stellen wij task-oriented views voor de UML voor. Task-oriented viewszijn visualisaties van UML modellen die de ontwikkelaars steunen door precies dieinformatie aan te bieden die voor het uitvoeren van een bepaalde taak benodigdis. Een taak is bijvoorbeeld het onderhouden of begrijpen van een UML model, ofhet uitvoeren van een kwaliteitsanalyse. Wij argumenteren dat bestaande UMLtools en bestaande UML diagramtypes de vereiste informatie niet op een effectievemanier aanbieden. Bijvoorbeeld, in de huidige views is het uiterst moeilijk omde relaties tussen modelelementen uit verschillende diagrammen te vinden. Eenander voorbeeld is het met elkaar in verband brengen van metrieken en modelele-menten. Wij beschrijven een raamwerk dat wij als basis gebruikt hebben om task-

175

oriented views te creeren. De voorgestelde views zijn: MetaView, ContextView,MetricView, UML-city-view, Quality-tree-view, en Evolution View. Wij hebbeneen experiment uitgevoerd om te valideren of deze views tot een verbetering lei-den ten opzichte van bestaande technieken bij het uitvoeren van begripstaken. Deresultaten tonen aan dat de correctheid van het begrip met 4,5% verbeterd is endat de benodigde tijd voor het uitvoeren van een taak met 20% verminderd wordt.De voorgestelde task-oriented views zijn geımplementeerd in het tool MetricViewEvolution.

176 Samenvatting

Zusammenfassung

Der Bedarf an Software nimmt seit Jahren kontinuierlich zu. Gleichzeitig steigt dieKomplexitat von Softwaresystemen. Im Software Engineering werden Methodenzur Entwicklung und Instandhaltung von Softwaresystemen angewendet. DieseMethoden sollen Softwareentwickler bei der Bewaltigung der Komplexitat von Sys-temen unterstutzen. Eine dieser Methoden ist die Modellierung. Dabei werdenModelle als abstrakte Reprasentation eines Softwaresystems erstellt, um Eigen-schaften des Systems festzulegen, ohne die gesamte, detaillierte Komplexitat desSystems darzustellen. Dadurch kann bereits zu einem Zeitpunkt, zu dem nochnicht alle Details eines Systems bekannt sind, eine Darstellung des Systems er-stellt werden. Die Modellierungssprache “Unified Modeling Language” (UML)stellt einen Industriestandard zur Softwaremodellierung dar. Die Sprache bietet13 verschiedene Diagrammtypen, um verschiedene Perspektiven eines Systems,etwa die strukturelle oder die interaktive Perspektive, abzubilden.

Die vorliegende Arbeit beschaftigt sich mit der Qualitat der Modellierung im Soft-ware Engineering. Insbesondere befasst sie sich mit der UML die fur viele Zweckeverwendet wird, zum Beispiel zur Spezifikation, Instandhaltung und zum Verstand-nis von Softwaresystemen, zur Kommunikation, zur Erstellung von Testfallen undzur automatischen Generierung von Programmen. Die UML hat einige Eigen-schaften, die die Qualitat der Modellierung negativ beeintrachtigen konnen. Dieinsofern signifikanten Eigenschaften, mit denen sich die vorliegende Arbeit beschaf-tigt, sind das Fehlen einer formalen Semantik, die Multi-Diagramm Architek-tur der Sprache und ihre große Komplexitat. Diese Eigenschaften konnen zuQualitatsproblemen fuhren und letztendlich die Korrektheit, die Verstandlichkeit,die Konsistenz, die Nicht-Redundanz, die Vollstandigkeit oder die Eindeutigkeitder Modelle verschlechtern. Die Qualitat der Modellierung in der Praxis wird be-trachtet, insbesondere werden Methoden zur Qualitatsverbesserung in der Model-lierung vorgelegt und empirisch validiert. Es wurden drei große kontrollierte Ex-perimente mit insgesamt 365 Teilnehmern und 16 industrielle Fallstudien durchge-fuhrt.

177

178 Zusammenfassung

In einem vordefinierten Rahmen wird festgelegt, welcher Begriff von “Qualitat”in der vorliegenden Arbeit verwendet wird. Dieser Rahmen teilt Qualitat in ver-schiedene Unterbegriffe ein: Qualitat des beschriebenen Systems, Semantik desModells, Syntax des Modells, pragmatische Eigenschaft des Modells, soziale Eigen-schaft des Modells, kommunikative Eigenschaft des Modells und Konsistenz zwis-chen Modell und Implementierung. Fur jede Studie innerhalb der vorliegendenArbeit wird anhand der genannten Unterbegriffe beschrieben, welcher Teilbereichder Modellqualitat behandelt wird.

Außerdem wird ein Qualitatsmodell fur UML Modellierung vorgelegt. Ziel desQualitatsmodells ist es, Softwareentwickler bei der Qualitatsanalyse von UMLModellen zu unterstutzen. Die Unterstutzung besteht darin, dass abhangig vomZweck des UML Modells Kriterien zur Auswahl geeigneter Messzahlen und Regelnfur die Qualitatsbewertung gegeben werden.

Es werden 16 industrielle Fallstudien beschrieben, bei denen die Qualitat von UMLModellen in der Praxis ausgewertet wurde. Die Ergebnisse zeigen die Haufigkeitvon bestimmten Problemen in der Modellierung. Diese Daten konnen benutzt wer-den, um Qualitatssicherungsmethoden so abzustimmen, dass insbesondere haufigvorkommende Probleme vermieden bzw. gelost werden konnen.

In einem ersten kontrollierten Experiment wurden die Auswirkungen der in denFallstudien entdeckten Defekte (einer besonderen Art der Qualitatsprobleme) un-tersucht. Das Experiment zeigt, dass Defekte haufig nicht von Softwareentwicklernfestgestellt werden, wenn diese Modelle als Grundlage zur Programmierung ver-wendet werden. Des Weiteren wurde festgestellt, dass Defekte dazu fuhren, dassdie UML Modelle auf unterschiedliche Art und Weise interpretiert werden. Somitkonnen Fehlinterpretationen und Kommunikationsfehler entstehen. Ein weiteresErgebnis dieses Experiments ist eine objektive und quantitative Klassifizierungvon Defekt-Typen bezuglich der Wahrscheinlichkeit, dass sie unentdeckt bleibenund Fehlinterpretationen verursachen. Diese Klassifizierung kann benutzt wer-den um Prioritaten zur Defektbeseitigung zu setzen, so dass mit den schwer-wiegendsten Defekten begonnen werden kann. Durch eine solche Vorgehensweiseist sichergestellt, dass zu jedem Zeitpunkt im Beseitigungsprozess das bestmoglicheErgebnis erzielt wird.

Als eine vorbeugende Qualitatsmaßnahme werden in der vorliegenden Arbeit Mod-ellierungskonventionen (“modeling conventions”), ahnlich den Programmierkon-ventionen (“coding conventions”), vorgeschlagen. Aufbauend auf einer Literatur-recherche und den Erfahrungen aus den Fallstudien wird eine Klassifikation vonModellierungskonventionen beschrieben. Zur Analyse der Effektivitat von Model-lierungskonventionen, als auch deren Produktivitat, wurde ein Experiment durchge-fuhrt. Die Ergebnisse zeigen eine leichte Abnahme der Defekthaufigkeit, allerdingssind die Ergebnisse nicht statistisch signifikant. Des Weiteren zeigen die Ergebnisseeine signifikante Erhohung der benotigten Modellierungszeit. Gestutzt durch dasExperiment wird eine Reihe von Empfehlungen zur Verbesserung des Gebrauchs

179

von Modellierungskonventionen gegeben. Beispiele sind eine Steigerung der Beach-tung der Konventionen, verbessertes Training und eine Erhohung der Motivation.

Abschließend werden aufgaben-orientierte Darstellungen (“task-oriented views”)vorgelegt. Diese Darstellungen sind Visualisierungen von UML Modellen, dieSoftwareentwickler bei der Bewaltigung von Aufgaben, wie zum Beispiel der Sys-teminstandhaltung oder Qualitatsanalyse, unterstutzen. Dies soll dadurch erreichtwerden, dass genau die Informationen dargestellt werden, die zur Bewaltigungeiner Aufgabe erforderlich sind. Es wird festgestellt, dass die bestehenden UMLDiagramme und Darstellungswerkzeuge die notwendigen Informationen nicht ingeeigneter Weise darstellen. Es ist beispielsweise sehr langwierig, Beziehungenzwischen Modellelementen aus verschiedenen Diagrammen zu finden. Außerdemist es sehr aufwendig, Daten wie zum Beispiel Messzahlen und Fehlerdaten mit denModellelementen in Verbindung zu setzen. Es wird ein Rahmen beschrieben, derbenutzt wurde um aufgaben-orientierte Darstellungen zu erstellen. Die vorgeschla-genen Darstellungen sind: MetaView, ContextView, MetricView, UML-City-View,Quality Tree View, und Evolution View. Es wurde ein Experiment durchgefuhrt,um die Darstellung bezuglich ihrer Eignung zum Modellverstandnis zu validieren.Die Ergebnisse zeigen, dass die Korrektheit des Modellverstandnisses um 4,5%gesteigert wird, wahrend die benotigte Zeit um 20% reduziert wird. Die vorgeschla-genen aufgaben-orientierten Darstellungen sind im Werkzeug MetricView Evolu-tion implementiert.

180 Zusammenfassung

Curriculum Vitae


• born on May 6th, 1978, in Tegelen, The Netherlands

• 1984 – 1988: Elementary school in Nettetal-Leuth, Germany

• 1988 – 1997: Werner-Jaeger-Gymnasium, Secondary School, in Nettetal,Germany

• 1997 – 1998: Community service (alternative to military service), in themunicipal hospital of Nettetal, Germany

• 1998 – 2003: Study of Computer Science, Technische Universiteit Eindhoven,The Netherlands

– Master’s thesis under supervision of Michel Chaudron

– cum laude

– Award ‘Best Master’s Thesis of the Department of Mathematics andComputer Science of the Year’

– Stipend of the Konrad-Adenauer-Stiftung e.V., Begabtenforderung, (Ger-man foundation)

• 2001 – 2002: Visiting Student at the University of Calgary, Canada

• 2003 – 2007: PhD Student at the Technische Universiteit Eindhoven, TheNetherlands

181

Titles in the IPA Dissertation Series since 2002

M.C. van Wezel. Neural Networksfor Intelligent Data Analysis: theoret-ical and experimental aspects. Facultyof Mathematics and Natural Sciences,UL. 2002-01

V. Bos and J.J.T. Kleijn. FormalSpecification and Analysis of IndustrialSystems. Faculty of Mathematics andComputer Science and Faculty of Me-chanical Engineering, TU/e. 2002-02

T. Kuipers. Techniques for Un-derstanding Legacy Software Systems.Faculty of Natural Sciences, Mathe-matics and Computer Science, UvA.2002-03

S.P. Luttik. Choice Quantification inProcess Algebra. Faculty of NaturalSciences, Mathematics, and ComputerScience, UvA. 2002-04

R.J. Willemen. School TimetableConstruction: Algorithms and Com-plexity. Faculty of Mathematics andComputer Science, TU/e. 2002-05

M.I.A. Stoelinga. Alea Jacta Est:Verification of Probabilistic, Real-timeand Parametric Systems. Faculty ofScience, Mathematics and ComputerScience, KUN. 2002-06

N. van Vugt. Models of MolecularComputing. Faculty of Mathematicsand Natural Sciences, UL. 2002-07

A. Fehnker. Citius, Vilius, Melius:Guiding and Cost-Optimality in ModelChecking of Timed and Hybrid Sys-tems. Faculty of Science, Mathematicsand Computer Science, KUN. 2002-08

R. van Stee. On-line Scheduling andBin Packing. Faculty of Mathematicsand Natural Sciences, UL. 2002-09

D. Tauritz. Adaptive InformationFiltering: Concepts and Algorithms.Faculty of Mathematics and NaturalSciences, UL. 2002-10

M.B. van der Zwaag. Models andLogics for Process Algebra. Facultyof Natural Sciences, Mathematics, andComputer Science, UvA. 2002-11

J.I. den Hartog. Probabilistic Exten-sions of Semantical Models. Faculty ofSciences, Division of Mathematics andComputer Science, VUA. 2002-12

L. Moonen. Exploring Software Sys-tems. Faculty of Natural Sciences,Mathematics, and Computer Science,UvA. 2002-13

J.I. van Hemert. Applying Evo-lutionary Computation to ConstraintSatisfaction and Data Mining. Facultyof Mathematics and Natural Sciences,UL. 2002-14

S. Andova. Probabilistic Process Al-gebra. Faculty of Mathematics andComputer Science, TU/e. 2002-15

Y.S. Usenko. Linearization inµCRL. Faculty of Mathematics andComputer Science, TU/e. 2002-16

J.J.D. Aerts. Random RedundantStorage for Video on Demand. Fac-ulty of Mathematics and ComputerScience, TU/e. 2003-01

M. de Jonge. To Reuse or ToBe Reused: Techniques for componentcomposition and construction. Facultyof Natural Sciences, Mathematics, andComputer Science, UvA. 2003-02

J.M.W. Visser. Generic Traversalover Typed Source Code Representa-tions. Faculty of Natural Sciences,Mathematics, and Computer Science,UvA. 2003-03

S.M. Bohte. Spiking Neural Net-works. Faculty of Mathematics andNatural Sciences, UL. 2003-04

T.A.C. Willemse. Semantics andVerification in Process Algebras withData and Timing. Faculty of Mathe-matics and Computer Science, TU/e.2003-05

S.V. Nedea. Analysis and Simula-tions of Catalytic Reactions. Faculty ofMathematics and Computer Science,TU/e. 2003-06

M.E.M. Lijding. Real-time Schedul-ing of Tertiary Storage. Faculty ofElectrical Engineering, Mathematics &Computer Science, UT. 2003-07

H.P. Benz. Casual Multimedia Pro-cess Annotation – CoMPAs. Faculty ofElectrical Engineering, Mathematics &Computer Science, UT. 2003-08

D. Distefano. On Modelchecking theDynamics of Object-based Software: aFoundational Approach. Faculty ofElectrical Engineering, Mathematics &Computer Science, UT. 2003-09

M.H. ter Beek. Team Automata –A Formal Approach to the Modeling

of Collaboration Between System Com-ponents. Faculty of Mathematics andNatural Sciences, UL. 2003-10

D.J.P. Leijen. The λ Abroad – AFunctional Approach to Software Com-ponents. Faculty of Mathematics andComputer Science, UU. 2003-11

W.P.A.J. Michiels. PerformanceRatios for the Differencing Method.Faculty of Mathematics and ComputerScience, TU/e. 2004-01

G.I. Jojgov. Incomplete Proofs andTerms and Their Use in InteractiveTheorem Proving. Faculty of Mathe-matics and Computer Science, TU/e.2004-02

P. Frisco. Theory of Molecular Com-puting – Splicing and Membrane sys-tems. Faculty of Mathematics andNatural Sciences, UL. 2004-03

S. Maneth. Models of Tree Transla-tion. Faculty of Mathematics and Nat-ural Sciences, UL. 2004-04

Y. Qian. Data Synchronizationand Browsing for Home Environments.Faculty of Mathematics and ComputerScience and Faculty of Industrial De-sign, TU/e. 2004-05

F. Bartels. On Generalised Coin-duction and Probabilistic SpecificationFormats. Faculty of Sciences, Divi-sion of Mathematics and ComputerScience, VUA. 2004-06

L. Cruz-Filipe. Constructive RealAnalysis: a Type-Theoretical Formal-ization and Applications. Faculty of

Science, Mathematics and ComputerScience, KUN. 2004-07

E.H. Gerding. Autonomous Agentsin Bargaining Games: An Evolu-tionary Investigation of Fundamentals,Strategies, and Business Applications.Faculty of Technology Management,TU/e. 2004-08

N. Goga. Control and Selection Tech-niques for the Automated Testing ofReactive Systems. Faculty of Mathe-matics and Computer Science, TU/e.2004-09

M. Niqui. Formalising Exact Arith-metic: Representations, Algorithmsand Proofs. Faculty of Science, Math-ematics and Computer Science, RU.2004-10

A. Loh. Exploring Generic Haskell.Faculty of Mathematics and ComputerScience, UU. 2004-11

I.C.M. Flinsenberg. Route Plan-ning Algorithms for Car Navigation.Faculty of Mathematics and ComputerScience, TU/e. 2004-12

R.J. Bril. Real-time Scheduling forMedia Processing Using ConditionallyGuaranteed Budgets. Faculty of Math-ematics and Computer Science, TU/e.2004-13

J. Pang. Formal Verification of Dis-tributed Systems. Faculty of Sciences,Division of Mathematics and Com-puter Science, VUA. 2004-14

F. Alkemade. Evolutionary Agent-Based Economics. Faculty of Technol-ogy Management, TU/e. 2004-15

E.O. Dijk. Indoor Ultrasonic Posi-tion Estimation Using a Single BaseStation. Faculty of Mathematics andComputer Science, TU/e. 2004-16

S.M. Orzan. On Distributed Verifica-tion and Verified Distribution. Facultyof Sciences, Division of Mathematicsand Computer Science, VUA. 2004-17

M.M. Schrage. Proxima - APresentation-oriented Editor forStructured Documents. Faculty ofMathematics and Computer Science,UU. 2004-18

E. Eskenazi and A. Fyukov.Quantitative Prediction of Quality At-tributes for Component-Based Soft-ware Architectures. Faculty of Math-ematics and Computer Science, TU/e.2004-19

P.J.L. Cuijpers. Hybrid Process Al-gebra. Faculty of Mathematics andComputer Science, TU/e. 2004-20

N.J.M. van den Nieuwelaar.Supervisory Machine Control byPredictive-Reactive Scheduling. Fac-ulty of Mechanical Engineering, TU/e.2004-21

E. Abraham. An Assertional ProofSystem for Multithreaded Java -Theoryand Tool Support- . Faculty of Mathe-matics and Natural Sciences, UL. 2005-01

R. Ruimerman. Modeling and Re-modeling in Bone Tissue. Faculty ofBiomedical Engineering, TU/e. 2005-02

C.N. Chong. Experiments in RightsControl - Expression and Enforce-ment. Faculty of Electrical Engineer-ing, Mathematics & Computer Science,UT. 2005-03

H. Gao. Design and Verification ofLock-free Parallel Algorithms. Facultyof Mathematics and Computing Sci-ences, RUG. 2005-04

H.M.A. van Beek. Specification andAnalysis of Internet Applications. Fac-ulty of Mathematics and ComputerScience, TU/e. 2005-05

M.T. Ionita. Scenario-Based SystemArchitecting - A Systematic Approachto Developing Future-Proof System Ar-chitectures. Faculty of Mathematicsand Computing Sciences, TU/e. 2005-06

G. Lenzini. Integration of Analy-sis Techniques in Security and Fault-Tolerance. Faculty of Electrical En-gineering, Mathematics & ComputerScience, UT. 2005-07

I. Kurtev. Adaptability of ModelTransformations. Faculty of Electri-cal Engineering, Mathematics & Com-puter Science, UT. 2005-08

T. Wolle. Computational Aspects ofTreewidth - Lower Bounds and Net-work Reliability. Faculty of Science,UU. 2005-09

O. Tveretina. Decision Proceduresfor Equality Logic with UninterpretedFunctions. Faculty of Mathematicsand Computer Science, TU/e. 2005-10

A.M.L. Liekens. Evolution of Fi-nite Populations in Dynamic Environ-

ments. Faculty of Biomedical Engi-neering, TU/e. 2005-11

J. Eggermont. Data Mining us-ing Genetic Programming: Classifica-tion and Symbolic Regression. Facultyof Mathematics and Natural Sciences,UL. 2005-12

B.J. Heeren. Top Quality Type Er-ror Messages. Faculty of Science, UU.2005-13

G.F. Frehse. Compositional Verifi-cation of Hybrid Systems using Sim-ulation Relations. Faculty of Science,Mathematics and Computer Science,RU. 2005-14

M.R. Mousavi. Structuring Struc-tural Operational Semantics. Facultyof Mathematics and Computer Sci-ence, TU/e. 2005-15

A. Sokolova. Coalgebraic Analysisof Probabilistic Systems. Faculty ofMathematics and Computer Science,TU/e. 2005-16

T. Gelsema. Effective Models for theStructure of pi-Calculus Processes withReplication. Faculty of Mathematicsand Natural Sciences, UL. 2005-17

P. Zoeteweij. Composing ConstraintSolvers. Faculty of Natural Sciences,Mathematics, and Computer Science,UvA. 2005-18

J.J. Vinju. Analysis and Transfor-mation of Source Code by Parsing andRewriting. Faculty of Natural Sci-ences, Mathematics, and ComputerScience, UvA. 2005-19

M.Valero Espada. Modal Abstrac-tion and Replication of Processes withData. Faculty of Sciences, Division ofMathematics and Computer Science,VUA. 2005-20

A. Dijkstra. Stepping throughHaskell. Faculty of Science, UU. 2005-21

Y.W. Law. Key management andlink-layer security of wireless sensornetworks: energy-efficient attack anddefense. Faculty of Electrical En-gineering, Mathematics & ComputerScience, UT. 2005-22

E. Dolstra. The Purely FunctionalSoftware Deployment Model. Facultyof Science, UU. 2006-01

R.J. Corin. Analysis Models for Se-curity Protocols. Faculty of Electri-cal Engineering, Mathematics & Com-puter Science, UT. 2006-02

P.R.A. Verbaan. The Computa-tional Complexity of Evolving Systems.Faculty of Science, UU. 2006-03

K.L. Man and R.R.H. Schiffelers.Formal Specification and Analysis ofHybrid Systems. Faculty of Mathe-matics and Computer Science and Fac-ulty of Mechanical Engineering, TU/e.2006-04

M. Kyas. Verifying OCL Specifi-cations of UML Models: Tool Sup-port and Compositionality. Facultyof Mathematics and Natural Sciences,UL. 2006-05

M. Hendriks. Model Checking TimedAutomata - Techniques and Applica-

tions. Faculty of Science, Mathematicsand Computer Science, RU. 2006-06

J. Ketema. Bohm-Like Trees forRewriting. Faculty of Sciences, VUA.2006-07

C.-B. Breunesse. On JML: topics intool-assisted verification of JML pro-grams. Faculty of Science, Mathemat-ics and Computer Science, RU. 2006-08

B. Markvoort. Towards HybridMolecular Simulations. Faculty ofBiomedical Engineering, TU/e. 2006-09

S.G.R. Nijssen. Mining StructuredData. Faculty of Mathematics andNatural Sciences, UL. 2006-10

G. Russello. Separation and Adap-tation of Concerns in a Shared DataSpace. Faculty of Mathematics andComputer Science, TU/e. 2006-11

L. Cheung. Reconciling Nonde-terministic and Probabilistic Choices.Faculty of Science, Mathematics andComputer Science, RU. 2006-12

B. Badban. Verification techniquesfor Extensions of Equality Logic. Fac-ulty of Sciences, Division of Mathemat-ics and Computer Science, VUA. 2006-13

A.J. Mooij. Constructive formalmethods and protocol standardization.Faculty of Mathematics and ComputerScience, TU/e. 2006-14

T. Krilavicius. Hybrid Techniquesfor Hybrid Systems. Faculty of Electri-cal Engineering, Mathematics & Com-puter Science, UT. 2006-15

M.E. Warnier. Language Based Se-curity for Java and JML. Faculty ofScience, Mathematics and ComputerScience, RU. 2006-16

V. Sundramoorthy. At Home InService Discovery. Faculty of Electri-cal Engineering, Mathematics & Com-puter Science, UT. 2006-17

B. Gebremichael. Expressivity ofTimed Automata Models. Faculty ofScience, Mathematics and ComputerScience, RU. 2006-18

L.C.M. van Gool. FormalisingInterface Specifications. Faculty ofMathematics and Computer Science,TU/e. 2006-19

C.J.F. Cremers. Scyther - Seman-tics and Verification of Security Pro-tocols. Faculty of Mathematics andComputer Science, TU/e. 2006-20

J.V. Guillen Scholten. MobileChannels for Exogenous Coordina-tion of Distributed Systems: Seman-tics, Implementation and Composition.Faculty of Mathematics and NaturalSciences, UL. 2006-21

H.A. de Jong. Flexible Heteroge-neous Software Systems. Faculty ofNatural Sciences, Mathematics, andComputer Science, UvA. 2007-01

N.K. Kavaldjiev. A run-time recon-figurable Network-on-Chip for stream-ing DSP applications. Faculty ofElectrical Engineering, Mathematics &Computer Science, UT. 2007-02

M. van Veelen. Considerations onModeling for Early Detection of Abnor-

malities in Locally Autonomous Dis-tributed Systems. Faculty of Mathe-matics and Computing Sciences, RUG.2007-03

T.D. Vu. Semantics and Applicationsof Process and Program Algebra. Fac-ulty of Natural Sciences, Mathematics,and Computer Science, UvA. 2007-04

L. Brandan Briones. Theories forModel-based Testing: Real-time andCoverage. Faculty of Electrical En-gineering, Mathematics & ComputerScience, UT. 2007-05

I. Loeb. Natural Deduction: Shar-ing by Presentation. Faculty of Sci-ence, Mathematics and Computer Sci-ence, RU. 2007-06

M.W.A. Streppel. MultifunctionalGeometric Data Structures. Faculty ofMathematics and Computer Science,TU/e. 2007-07

N. Trcka. Silent Steps in Transi-tion Systems and Markov Chains. Fac-ulty of Mathematics and ComputerScience, TU/e. 2007-08

R. Brinkman. Searching in en-crypted data. Faculty of Electrical En-gineering, Mathematics & ComputerScience, UT. 2007-09

A. van Weelden. Putting types togood use. Faculty of Science, Math-ematics and Computer Science, RU.2007-10

J.A.R. Noppen. Imperfect Infor-mation in Software Development Pro-cesses. Faculty of Electrical Engineer-ing, Mathematics & Computer Science,UT. 2007-11

R. Boumen. Integration and Testplans for Complex Manufacturing Sys-tems. Faculty of Mechanical Engineer-ing, TU/e. 2007-12

A.J. Wijs. What to do Next?:Analysing and Optimising System Be-haviour in Time. Faculty of Sciences,

Division of Mathematics and Com-puter Science, VUA. 2007-13

C.F.J. Lange. Assessing and Improv-ing the Quality of Modeling: A Seriesof Empirical Studies about the UML.Faculty of Mathematics and ComputerScience, TU/e. 2007-14