On Effective Testing of Health Care Simulation Software Christian Murphy, M.S. Raunak, Andrew King, Sanjian Chen, Christopher Imbriano, Gail Kaiser, Insup.

On Effective Testing ofHealth Care Simulation Software

Christian Murphy, M.S. Raunak, Andrew King,

Sanjian Chen, Christopher Imbriano, Gail Kaiser,

Insup Lee, Oleg Sokolsky, Lori Clarke, Lee Osterweil

University of Pennsylvania

Loyola University Maryland

Columbia University

University of Massachusetts Amherst

2 / 27

Overview Simulation software is used widely in the field of

health care

Simulators must not only accurately model the real world, but be free of software defects as well

It is particularly hard to test simulation software because often there is no “test oracle”

Our research shows that it is possible to detect defects if properties of the software are violated

3 / 27

Outline

Motivating examples

Overview of testing approach

Study #1: Demonstrating feasibility

Study #2: Measuring effectiveness

Future work & conclusion

4 / 27

Flow of Patients through ED

Length of Stay versus Utilization

0

50

100

150

200

250

300

0 2 4 6 8 10 12

number of beds

unit

s of

tim

e

0

2

4

6

8

10

12

14

16

perc

ent

utiliz

ation

LOS

DoctorUtilizationNurseUtilizationTriageUtilizationClerkUtilization

Raunak et al., “Simulating patient flow through an emergencydepartment using process-drivendiscrete event simulation”, SEHC’09

5 / 27

Glycemic Control (Insulin Pump)

King et al., “Prototyping closed loopphysiologic control with the MedicalDevice Coordination Framework”,SEHC’10

6 / 27

Problem Statement Partial oracles may exist for a limited subset

of the input domain in simulation software

Obvious errors (e.g., crashes) can be detected with certain inputs or testing techniques

However, it is difficult to detect subtle computational defects in simulators without test oracles in the general case

7 / 27

What do I mean by “defect”? Deviation of the implementation from the

specification Violation of a sound property of the software

“Discrete localized” calculation errors Off-by-one Incorrect sentinel values for loops Wrong comparison or mathematical operator

Misinterpretation of specification Parts of input domain not handled Incorrect assumptions made about input

8 / 27

Research Goals

Identify an approach for testing simulation software that is effective even without a test oracleReliably detect defects Increase confidence that the software works

Demonstrate feasibility of the approach

Measure the effectiveness of the approach

9 / 27

Outline

Motivating examples





10 / 27

Observation Many programs without oracles have

properties such that certain changes to the input yield predictable changes to the output

We can detect defects in these programs by looking for any violations of these “metamorphic properties”

This is known as “metamorphic testing”T.Y. Chen et al., HKUST Tech Report, 1998

11 / 27

Metamorphic Testing

If new test case output f(t(x)) is as expected, it is not necessarily correct

However, if f(t(x)) is not as expected, either f(x) or f(t(x)) – or both! – is wrong

x f f(x)Initial test case

t(x) f f(t(x))New test case

t f(x) and f(t(x))are “pseudo-oracles”

Transformation function based on

metamorphic properties of f

12 / 27

Metamorphic Testing Example Consider a function to determine the standard

deviation of a set of numbers

a b c d e fInitialinput

c e b a f dNew testcase #1

2a 2b 2c 2d 2e 2fNew testcase #3

sstd_dev

std_dev

std_dev

s ?

2s ?

std_dev s ?New testcase #2

a+2b+2c+2d+2e+2f+2

13 / 27

Related Work

Verification of simulation modelsO. Balci, 1997 Winter Simulation Conf.R. Sargent, 2005 Winter Simulation Conf.

Applying metamorphic testing to applications without test oraclesT.Y. Chen et al., Info. and Soft. Tech., 2002

14 / 27

Outline

Motivating examples





15 / 27

Feasibility Study

Goal: Demonstrate that metamorphic testing is feasible for testing simulation software

We first identify metamorphic properties in the applications of interestJSim: discrete event simulator (patients in ED)GCS: glycemic control simulator (insulin pump)

We then apply metamorphic testing and look for defects

16 / 27

Metamorphic Properties JSim: Flow of patients through ED

Increasing number of resources (e.g., beds) should not increase average patient length of stay

Increasing number of resources should not decrease other resources’ utilization rates

Multiplying the time necessary for each step by a positive constant c should increase the overall time by c

GCS: glycemic control system (insulin pump) A patient who weighs more should get more insulin A patient who produces more endogenous glucose should

get more insulin The modeled insulin absorption rate should vary inversely

with the insulin distribution volume

17 / 27

JSim Findings

18 / 27

Unexpected JSim FindingsID Arrival

TimeDeparture

TimeLength of

Stay

1 2 159 157

2 8 185 177

3 14 197 183

4 20 295 275

5 26 321 295

217.4

ID Arrival Time

Departure Time

Length of Stay

1 2 159 157

2 8 185 177

3 14 194 180

4 20 312 292

5 26 321 295

220.2

Average LOS with 1 nurse Average LOS with 2 nurses

19 / 27

Outline

Motivating examples





20 / 27

Measuring Effectiveness

Goal: Estimate the effectiveness of metamorphic testing at detecting defects in simulators

We first systematically seed the software with defects

We then measure the number that are detected

21 / 27

Methodology Mutation testing was used to seed defects into

each application Reverse comparison operators Change math operators Introduce off-by-one errors

For each program, we created multiple versions, each with exactly one mutation

We ignored mutants that yielded outputs that were obviously wrong, caused crashes, etc.

Effectiveness is determined by measuring what percentage of the mutants were “killed”

22 / 27

Results

Application JSim

GCS

Control

GCS

Patient

Mutants generated 104 306 644

Usable mutants 25 237 487

Mutants detected 25 58 333

Effectiveness 100% 24.4% 68.4%

23 / 27

Analysis: JSim “Statistical metamorphic testing” useful for killing

mutants related to non-deterministic event timing

If timing range is [A, B] and observed mean is μ, then mean μ’ for range [10A, 10B] should be around 10μ

Because of mutant, range is actually [A, B-1]

Over many executions, observed mean μ’ has statistically significant difference from expected mean 10μ

24 / 27

Analysis: GCS Metamorphic testing not as effective in control

algorithm (rules for delivering insulin)

Rules are usually of the form “if patient blood sugar is x then adjust infusion rate by y”

Single mutants did not have much effect on overall insulin delivered

These may be detected by more “straightforward” software testing approaches

25 / 27

Outline

Motivating examples





26 / 27

Future Work

Formalizing the process of identifying metamorphic properties for simulators

Consider the use of metamorphic testing for validation If a property is violated, does that mean there is a

defect, or is the property simply unsound? If the property is unsound, is this simulator

appropriate for the task it is meant to model?

27 / 27

Conclusion

We have demonstrated that metamorphic testing is an effective technique for testing simulation software

It can increase confidence in the implementation

It also helps increase understanding of how the software behaves

On Effective Testing ofHealth Care Simulation Software

Christian Murphy, University of Pennsylvania

[email protected]

M.S. Raunak, Loyola University Maryland

[email protected]

On Effective Testing of Health Care Simulation Software Christian Murphy, M.S. Raunak, Andrew King, Sanjian Chen, Christopher Imbriano, Gail Kaiser, Insup.

Documents

input slide

approach slide

testing simulation software

test oracles

sehc10 slide

sehc09 slide

violated slide

general case slide