Microsoft Reseach, CambridgeBrendan Murphy. [email protected] Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.

Microsoft Reseach, Cambridge

Brendan Murphy. [email protected]

1

Measuring System Behaviour in the field

Brendan Murphy

Microsoft Research Cambridge.



2

Agenda

History of monitoring systems in the field. Characterizing the behaviour of individual

systems. Characterizing the behaviour of multiple

systems and applications. Problems and opportunities.



3

Background to field measurements. Computer manufacturers (mid 80’s)

Hardware failure rates improving. Difference between theoretical and actual

reliability. Software reliability becoming a bigger driver of

overall system reliability. Changing customer profile and therefore

expectations.



4

Initial observations for analysing system behaviour Hardware reliability could be measured. Software reliability was more difficult to measure.

Crash rate could be measured but difficult to interpret.Was the crash due to a defect or an operator error?

Software life cycle impacts its failure rate. Operation errors started to become more important.

See Jim Gray’s paper from the early 90’s Still unclear how to use metrics as a measurement

of “Goodness”?



5

Does the following represent Goodness?Failure breakdown by Service company of systems in Microsoft.

Cause Of System Failure

Hardware25%

Software66%

Network7%

Maintenance1%

No Explanation1%

Cause Of Downtime

Hardware35%

Software61%

Network3%

Maintenance0%

No Explanation1%



6

Measuring system ReliabilityNeed for Filtering Reliability calculations impacted by clusters of

crashes (NT data collected from DOT COM sites).

Distribution of the length of system uptime

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 15 30 45 60 75 90 105

120

135

150

165

180

195

Time Between System Events (minutes)

Dis

trib

uti

on

Bluescreens

System Reboots

Distribution of the length of System uptime

0%

5%

10%

15%

20%

25%

30%

35%

40%

System uptime between events (minutes)

Dis

trib

uti

on

Bluescreens

System Reboots

VAX Crashes

VAX Reboots



7

Reliability MeasurementEvents to measure System Crashes/Panics/Bluescreens.

Good Points. Each event represent a defect.

Bad points. Does not include hangs. More a measure of fault management.

System Reboots Good Points.

Captures all defects. Bad points.

Captures all system management activity Can only be applied to servers.



8

Definition of a System Events is Operating System dependent. A crash is an action taken by the system fault

management that Shuts down the system gracefully Write the cause to a dump file and an event log.

Note event logs for UNIX & NT are derived from the VMS event log.

A system outage is captured by a reboot event occurring in the event log. A hang can sometimes be recognized by lack of

outage information.



9

System Availability Measurement. Using data in the Event Log.

If a shutdown and reboot event are captured then easy to calculate.

If only reboot events exist then use timestamps or use last event prior to shutdown.

Tools to monitor availability. Pinging system

Dependent upon Network availability. Background process continually logging

timestamps.



10

Interpreting System Availability VAX 6000The problems start.

©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

Availability DistributionCaptured during 1998

99.0%

99.2%

99.4%

99.6%

99.8%

100.0%

0% 4% 8% 11%

15%

18%

22%

26%

29%

33%

37%

40%

44%

48%

51%

55%

59%

62%

66%

70%

73%

77%

81%

84%

88%

92%

95%

99%

Systems sorted by availability.

Per

cen

tag

e

Peak

24x7

This level of availability implies systems are unlikely to be in a

production environment!



11

Measuring availability

Ignore workstations/clients. “Intelligently” filter out long outages. “Intelligently” filter out non-production

systems. Differentiate between system maintenance

outages and those due to ‘reliability’. Capture cause of outage from system managers.

Beware they do not always tell the truth! Assume usage by time of event.



12

Assuming usage based on day of event.

Distribution Of System Outages

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Monday Wednesday Friday Sunday

System OutagesSystem Crashes

VAX 6000 Systems Windows 2000 Systems


Distribution Of Outages on DOT COM sites

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Monday Wednesday Friday Sunday

Distribution

System Reboots

Bluescreens



13

Distribution of System Outages based on Time of Event (measured Monday – Friday).Distribution Of Outages on DOT COM sites.

0%

1%

2%

3%

4%

5%

6%

7%

8%

Hours of the day (Monday-Friday)

Dis

trib

uti

on

System Outages

Bluescreens

Windows 2000

Distribution Of System OutagesMonday - Friday

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Time Of Day

Percen

tag

e

System Outages

System Crashes

VAX 6000




14

Generic rules for measuring system behaviour. All reliability analysis requires a filter. Workstation/client behaviour can only be

characterized by its crash rate. Reliability of servers can be measured using

either the crash or system outage rate. Availability is most accurately measured

during Peak usage periods. Site availability calculated using the median

availability OR through removing outliers.



15

The hard part :- Interpretation! Availability and reliability is seasonal. Reliability of large servers is affected by their

life cycle. Software reliability is affected by its time

since installation and also its life cycle. Comparisons between different products is

very difficult.



16

Reliability of servers over its life cycle.

VAX 7600 Life Cycle

0

1

2

3

4

5

6

Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1

1993 1994 1995 1996 1997 1998

Sys

tem

Cra

sh R

ate

0

10

20

30

40

50

60

70

80

90

Sys

tem

Ou

tag

e R

ate

System Crashes

System Outages




17

Impact of installation on (VMS) Operating System Behaviour


Post Installation Behaviour

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8 9 10

Weeks From Installation

Rat

e O

f E

ven

ts

System Outages

System Crashes



18

Impact of (VMS) software life cycle on its Reliability.


Operating System Life Cycle

0

1

2

3

4

5

6

7

8

1st 6 Months 2nd 6 Months 3rd 6 Months 4th 6 Months

Time following Release

Sys

tem

Cra

sh R

ate

0

10

20

30

40

50

60

70

Sys

tem

Ou

tag

e R

ate

System Crashes

System Outages

Operating System behaviourimproves with age?

Few new patches are produced6 months after the release of any version of the Operating System.



19

Comparing (VMS) System ReliabilityUsing data collected from the same point in the life cycle.

OpenVMS Metrics (running on VAX Systems)Post Installation Behaviour

0

10

20

30

40

50

60

70

80

90

100

V5.5 V5.5-1 V5.5-2 V6.0 V6.1 V6.2 V7.0 V7.1

Operating System Version

Rate

Of

Syste

m O

uta

ges

Upper Confidence Bound

Average

Lower Confidence Bound

Only includes systems installingOS within 6 months of release




20

Overall rules for characterizing system behaviour. Hardware failure rates can be fully characterized. Software behaviour is characterized by its reliability,

availability and instability. As software matures then crash rates are no longer

the definitive measure of reliability. User perception is not necessarily reality. Comparisons between versions of software can be

performed, with care. Comparisons between products is difficult, requires

knowledge of product and usage characteristics.



21

Monitoring applications running across distributed systems. Three perspectives of system behaviour

Application behaviour on individual systems Total application behaviour from the System

Managers perspective. Total application behaviour from user perspective.

Four Measurements. Reliability. Availability Instability Degradation



22

Problems associated with ‘solution’ analysis. The relative importance of the metrics, varies

between users of the metrics. The users do not have a consistent set of

requirements for the application. The configuration of the distributed solution

changes over time. Very little research has been performed into

the behaviour of solutions on customer sites. i.e. big opportunities.



23

Examples of analysis of a ‘distributed’ solution. VMS Clusters The behaviour of individual systems was

captured. The configuration of the cluster was captured. Correlating the data gives cluster behaviour.

Node A

Node B

Node C

Cluster Down ?



24

Characterizing VMS Cluster Behaviour.


OpenVMS VAX Cluster Behaviour

0

10

20

30

40

50

60

1 2 3 4 5 6Servers in Cluster

An

nu

al

Ra

te o

f O

uta

ge

s

0

0.5

1

1.5

2

2.5

3

3.5

Av

era

ge

Do

wn

tim

e

Cluster ReliabilityCluster Downtime



25

Characterizing VMS Cluster Behaviour.Characterizing instability (recoverability).


OpenVMS Cluster Behaviour

Using a 1 week Filter

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6Number Of Servers in cluster

An

nu

al

Ra

te o

f O

uta

ge

s

0

10

20

30

40

50

60

70

80

90

100

Pe

rio

d o

f In

sta

bil

ity

Cluster ReliabilityPeriods of Instability



26

Opportunities for research into characterizing solution behaviour. Developing metrics to characterize solution

behaviour. Understand the relationships between the

metrics. E.g. identify network availability as the difference

between the end user and system availability. Correlating the relationship between

configuration and end user behaviour. Difficulty is monitoring production sites.

Microsoft Reseach, CambridgeBrendan Murphy. [email protected] Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.

Documents

system outage

system events

system gracefullywrite

system reliabilityneed

system managers

reboot event

time of event

day of event