RConradi: NIK, 22-24. Nov. 2010, Gjøvik110/20/20151 Controlled Experiments on Pair Programming (PP): Making Sense of Heterogeneous Results NIK’2010, Gjøvik,

RConradi: NIK, 22-24. Nov. 2010, Gjøvik 104/22/23 1

Controlled Experiments on Pair Programming (PP): Making

Sense of Heterogeneous Results

NIK’2010, Gjøvik, 22-24. Nov., 2010

Reidar Conradi, NTNU, Trondheim Muhammad Ali Babar, ITU, Copenhagen

www.idi.ntnu.no/grupper/su/oss/nik2010-pairprog-slides.ppt


Table of Contents

1. Motivation2. Two PP experiments: Simula vs Utah3. Meta-analysis of 18 PP experiments4. Cost/Benefits of PP with Extended Utah5. Reflections6. Conclusion

RConradi: NIK, 22-24. Nov. 2010, Gjøvik 3

1. Motivation (1)• Agile manfiesto and methods are ”hot”: XP, Scrum, Pair programming (PP), …• Similar with Evidence-Based Software Engineering (EBSE) to

systematically learn from primary, empirical studies. • Over 100 secondary, thematic sump-up papers as Structured

Literary Reviews (SRLs ) on Agile Methods, PP, Controlled Experiments, Effort estimation, Unit testing, …

• Guidelines to plan, execute and report from such studies?

• Always a struggle between rigor and relevance:• ”In vitro” controlled experiments w/ students on toy problems of

100-200 LOC.• ”In vivo” case studies over N years w/ proff.s on MLOC systems.

• And how to disseminate EBSE experience and knowledge in the ICT industry - by PP, job rotation, training, knowledge management, …? Avoid information graveyards!


1. Motivation on Pair Programming, intro (2)

• PP: One of the 12 methods in XP, with roots in generous teamwork elsewhere – cf. book by Gerald Weinberg: “The Psychology of Computer Programming”, 1971.

• So, what do we know on PP – either more effort, shorter duration, better “quality”– or just happier people with more shared knowledge and experience, learning from another.

• Moderate effects on programming-related factors: +- 15%.• • What about longitudinal studies of social/cognitive factors to

promote learning?


1. Motivation, a bit formal (3)

PP: Two persons work together as a pair: a ”secretary” and a ”coach”; seems like a waste of time and PP ”yelling” – but …

Research goals and challenges:

RQ1. How to compare the different primary studies of PP? Both PP-specific and lifecycle results.

RQ2. What common metrics could be applied in the primary studies on PP?

RChallenge. How to study and promote the social-cognitive aspects of PP – beyond that PP is “fun”?


2. Comparing two PP experiments (1)

• Simula experiment (Arisholm2007): 295 (!!) professionals with different expertise, one session each in randomized groups, done distributedly in 20 companies in 5 countries.

Update - in 4 steps - a given Java-program of 200-300 LOC in an easy/complex variant for a coffee vending machine. Many special rules and hypotheses.

• Utah experiment (Williams2000): 41 4th-year students in randomized groups with different expertise, done at normal university lab in 4 sessions over a semester.

• Write C-program of ca. 150 LOC from scratch for a specific (not-documented) problem in each session.

• Both: unknown # defects, using pre-made tests to check results. • May compare results of soloist/pair groups with low/high expertise.

• Otherwise almost nothing in common – so how to generalize?


3. Meta-analysis of 18 PP experiments (1)

Balance the "Project Triangle”:

Quality/Correctness

(as main result)

”Project-

Triangle”

Effort Duration/ deadline


3. Meta-analysis of PP: massive heterogeneity (2)

No common experiment organization or treatment wrt.: Goals and objectives. Team composition: software expertise (student vs.

practitioner, programming skills), soloists vs. pairs, application task, use of external reviewers, …

Software technology: Eclipse; UML; Java, C++, C, … Randomized vs. self-chosen team composition, …

Tree major dependent variables: Quality (hmm?) – see next slides Duration (hours) – elapsed time Effort (person-hours) – often just named “hours” No volume / productivity measures (LOC, FP, …)


3. Meta-analysis of 18 PP experiments (3)

Three main dependent variables: Quality (Correctness), Duration, Effort.

Using special, statistical techniques, extended with graphical tree plots.

Still too much missing data and ”apples and bananas”:

Quality: available data in 14 studies, i.e. missing in 4 studies!!Also 4 studies with only data for Q.Only 6 studies w/ all variables Q, D and E

.

Duration: in 11 studies; in 8 only with Q or with E. 2 studies with only D and E (non-interesting!), and 1 only with D.

Effort: in 11 studies; in 8 only with Q or with D; and 1 only with E.

X-discipllnarity in OSS; GoOpen, 20 Apr. 2010


3. Meta-analysis of 18 PP experiments (4)Studies

Quality (in 14 studies)

Duration (in 11 stud.) Effort (in 11 stud.)

P1 Arisholm et al. (2007) Q D E

P2 Baheti et al. (2002) Q D -

P3 Canfora et al. (2005) - (Q missing) D E

P4 Canfora et al. (2007) Q D E

P5 Domino et al. (2007) Q (only for this study) - -

P6 Heiberg et al. (2003) Q D -

P7 Madeyski (2006) Q (only for this study) - -

P8 Madeyski (2007) Q (only for this study) - -

P9 Müller (2005) Q D E

P10 Müller (2006) Q D E

P11 Nawrocki & Wojciechowski (2001) - (Q missing) D E

P12 Nosek (1998) Q D E

P13 Phongpaibul & Boehm (2006) Q - E

P14 Phongpaibul & Boehm (2007) Q - E

P15 Rostaher & Hericko (2002) - (Q missing) D (only for this study) -

P16 Vanhanen & Lassenius (2005) - (Q missing) - E (only for this study)

P17 Williams et al. (2000) Q (only for this study) - -

P18 Xu & Rajlich (2006) Q D E


4. Meta-analysis, chaotic PP results (5)

Much too much heterogeneity:• Programming task (part of treatment): 18 different variants.

• Quality (Correctness) in 12 different ways, by:• OO quality metrics,• quality rating of changed UML designs, • mixture of OO metrics and questions to the programmers, • test coverage as share of branches executed, • quality of requirements for later phases using PP vs. inspections, • defining a threshold value for share of test cases passed, • recording the actual share of test cases passed (4 cases incl. Utah), • mixture of test cases passed and correct programmer (?) answers about

two programs, • counting the share of similar teams with all tests passed - plus final

approval by external censor (Simula), • grading program performance using an external censor,• normal student grades, or• simply missing (4 cases).Maybe just success factor had been a better term?


3. Meta-analysis, more chaotic PP results (6)

Too much heterogeneity (cont’d):

• Duration (in person-hours, with four variants), as clock time from start till:• the team itself decided to stop,• a pre-set time limit expired, • a threshold test level was achieved, or• the team had passed all tests and a censor had approved the work

(Simula).

• Effort (person-hours, with two variants):• Duration multiplied by one for soloists or by two for pairs – for all.• But ignore correction effort that does not improve quality (in Simula – but

almost impossible to record).• Many primary papers also use “time” ambiguously for Duration and

Effort.


3. Meta-analysis of 18 PP experiments, data (7)P1 Arisholm et al. (2007)

295;Prof.

Java;

Coffee vending machine,maintenance extensions

Solo/pair; random; prog.expertise (S/M/L)+ prog.skills;task complexity (S/L)

% of teams in a group with all tests passed

P2 Baheti et al. (2002)

134;Stud.

Smalltalk or Java;OO prog. project Student grades (!)

P3 Canfora et al. (2005)

24;Stud.

LL;?? (Q missing)

P4 Canfora et al. (2007)

18;Prof.

UML&LL;??

Quality rating of changed UML-design: scale0..3

P5 Domino et al. (2007)

88;Prof. & Stud.

LL;??

Mixture of % test cases passed and % correct answers given for two programs

P6 Heiberg et al. (2003)

84/66:Stud.

LL;gamer system % test cases passed

P7 Madeyski (2006)

188;Stud.

LL;Financial system Some OO quality metrics

P8 Madeyski (2007)

98;Stud. Java; Test coverage by % branches

executed

P9 Müller (2005) 38;Stud.

LL;Polynomial & Shuffle Puzzle

% test cases passed


3. Meta-analysis of 18 PP experiments, data (8)

StudyNtot; Stud./ Prof

Lang.; Domain Indep. varbls Quality = ?

P10 Müller (2006) 18/16;Stud.

UML & LL;elevator system

Effort to achieve same level of correctness (i.e. same % test cases passed)

P11 Nawrocki & Wojciechowski

(2001)

15;Stud.

C++;4 lab sessions

(Q missing)

P12 Nosek (1998) 15;Prof.;

LL;DB consistency Grades of program performance, given

by an external censor

P13 Phongpaibul & Boehm (2006)

95;Stud.

LL;class assigment % test cases passed

P14 Phongpaibul & Boehm (2007)

36;Stud.

LL;Team projectextend a system

Dev. effort of rqmts with inspections vs. PP, plus that of later impl.

P15 Rostaher & Hericko (2002)

16;Prof.;

LL;4 user stories

(Q missing)

P16 Vanhanen & Lassenius (2005)

20;Stud.

LL (3-4000 LOC);team project

( Q missing)

P17 Williams et al. (2000)

41;Stud.

C?; 4 class assignments % test cases passed

P18 Xu & Rajlich (2006)

12;Stud.

LL;class assignment

Mixture of OO metrics and survey with both concrete and general qualitative questions


3. Meta-analysis, more SLRs on Agile Methods (9)

Study Context Productivity of XP vs. Traditional

Quality of XP vs. Traditional

S07 (Dalcher05), experiment

4 diff. lifecycle models, stud., 1 y, 15 teams, teamsize: 3-4

3 -> 13.1 LOC/h for XP, but XP gives 3.5x more code

Not recorded (in SLR)

S10 (Ilieva04), case study

Two similar projects, prof., 900h, t.size: 4

3.8 -> 5.4 LOC/h for XP, mostly in 1st of 3 releases

13% fewer defect reports by XP teams

S14 (Layman04), case study

Reimpl.project, prof., 3.5 months, w/ 3 of 10 prev. team-memb.

300 -> 340 LOC/month for XP

Much better pre/post-delivery quality/defects by XP team

S15 (Macias03), experiment

10 XP vs. 10 trad. teams, stud., 1 semester, t.size: 4-5

No difference in productivity

No difference in quality/defects

S32 (Wellington05), experiment

One XP and one trad. team, stud.,1 semester, t.size: 16

Traditional team: 78% more code than XP

XP team delivered much better quality


4. Cost/benefits of Utah experiment (1)

• Laurie Willams’ PhD thesis from U. Utah, 2000 and later extended w/ Net Product Value (NPV) model by Hakan Erdogmus in 2003.

• Surprisingly small differences for soloists and pairs:• Soloists: 25 LOC (C-code)/p-h, 75 % of pre-deliv. tests passed. Pairs: 21.7 LOC (C-code)/p-h, 85 % of pre-deliv. tests passed.

• #Pre-delivery defects: ??; Correction effort as tests hidden in Coding. • #Post-delivery defects ??; Correction efforts as ?? - see below.

• Solution: take industrial data from Capers Jones’ book from 1997: 150 systems, median system size: 2.15 MLOC!!!• 8 function points (FPs) => 1000 LOC of C-code (500 LOC Java).• 1 FP has 5 defects, 85 % are corrected pre-delivery, so 15 %

must be corrected post-delivery at an effort of 34 p-h/defect.


4. Cost/benefits of Utah PP experiment (2)

Effort estimates

Soloist Pairs

Coding 40 p-h (1000 LOC/25)

46 p-h (1000 LOC/21.7)

Correcting post-deliv. cod.-defects

10 of 40, at 340 (10*34.0) p-h

6 of 40, at204 (6*34.0) p-h

Total 380 p-h 250 p-h (best!!)Total revised (next slide)

40 + 340*0.103 = 40 + 35 = 75 p-h

46 + 204*103 =46 + 21 = 67 p-h (yep!)

Coding + Post-delivery, coding-defect correction:


4. Cost/benefits of Utah PP experiment (3)

TWO ERRORS in effort estimates, due to toy programs:

1. Too high post-delivery defect rate: 1 FP has 1.75 coding-defects, not 5.0 – i.e. no faulty requirements, wrong testcases, …

2. Correction effort of a post-delivery coding-defect is much too high, using 34 p-h from Capers Jones:

Toy program: 25 LOC/p-h, 5-10 p-h/coding-defect correction. Realistic one: 5 LOC/p-h. 10-80 p-h/all-defects correction.

So must reduce post-delivery, coding-defect, correction estimates by: (1.75/5.0)*(10/34) = 0.103 (or divide by factor 10)!

NB: has omitted interest rates from NPV model.

Even with reduced defect correction prognosis, PP looks promising.


4. Ex. Effort per Software Development Lifecycle: “4-2-4” rule (4)

Rqmts, Analysis, Design

Coding Testing: Unit, Func., Integrat.

Testing: System

SW Eng. projects - relative effort (since 1980s!)

40 % 20 % 15 % 25 %

- or absolute effort w/ ”LOC” as money unit

5 LOC/total-ph

Toy programs (like PP exper. )

(0) 25 LOC/ total-ph

++ (minitest) (0)


4. Ex. Complete Effort for Software Development (5)

RAD + C + Pre-sys-tem testing

System testing

Manage-ment

Other (SCM, process, doc. …)

Issues covered by traditional effort estimates:

33 % + 17 % + 10 %

System issues:(ca.)

20 % 10 % 20 %


5. Reflections (1)

RQ1: How to promote more quality (rigorous, ”comparable”) primary studies?:

• Better teaching?

• Refusing to publish poor empirical papers?

• But: how can a PhD student get sufficient credit for being, e.g., the 73rd replicator of experiment XX and the 27th replicator of case study YY?

• Lastly, can ISERN (International Software Engineering Research Network,

http://isern.iese.de) play a role as an advisor, coordinator, or shared archive here? Treatments cannot be made public, either!

• Above all: Cultural changes needed!• Meta-study cooperation in Cochrane style?• Must demonstrate practical use.

http://isern.iese.de/


5. Reflections (2)

RQ2: How to define a partially standard metrics for software?:

• Code size: a volume or code size measure for code (in LOC or Function Points). Also neutral, automatic and free measurement tools to perform reliable size calculations.

NB: “Much” code is no goal in itself, only the minimum amount to satisfy the specifications. Also distinguish between new development as for Utah, and maintenance as for Simula.

• Quality: e.g. defined as Defect rate (measured pre- or post-delivery).

• Effort (p-hs) – must account for software reuse.

• Duration (wall-clock hours).

• Personal background and similar context.


5. Reflections (3)

Research Challenge: how to study and promote the social and cognitive aspects of PP?

• Researchers should rather study the qualitative, socio-technical benefits of PP, since the quantitative effects on defects & efforts anyhow seem to be slight (+- 15 %).

• • Such aspects should then be studied in longitudinal case studies in

industrial settings – and go deeper than finding PP to be “fun” to do.

• So, more PP studies wrt. learning, human factors, teamware (e.g. job rotation), and knowledge management.

• And it is not clear if PP is well-suited to make novices learning from experts, or if other team structures are more suitable.


6. Conclusion

Serious problems in aggregating findings from PP experiments, due to heterogeneous contextual variables and missing data.

Hence, using SLRs for EBSE become difficult.

Since the overall cost/benefits of adopting PP seem slight (+- 10-15 %), we may rather apply and investigate PP to promote teamwork and learning.

Need experience-based cost/benefit models and estimates to compose teams optimally.

RConradi: NIK, 22-24. Nov. 2010, Gjøvik110/20/20151 Controlled Experiments on Pair Programming (PP): Making Sense of Heterogeneous Results NIK’2010, Gjøvik,

Documents

pair programming pp

metaanalysis of pp

socialcognitive aspects

programming skills

different expertise

case studies

empirical studies

programmingrelated factors