Microsoft Reseach, Ca mbridge Brendan Murphy. BMurphy@Micro soft.com 1 Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
Jan 13, 2016
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
1
Measuring System Behaviour in the field
Brendan Murphy
Microsoft Research Cambridge.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
2
Agenda
History of monitoring systems in the field. Characterizing the behaviour of individual
systems. Characterizing the behaviour of multiple
systems and applications. Problems and opportunities.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
3
Background to field measurements. Computer manufacturers (mid 80’s)
Hardware failure rates improving. Difference between theoretical and actual
reliability. Software reliability becoming a bigger driver of
overall system reliability. Changing customer profile and therefore
expectations.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
4
Initial observations for analysing system behaviour Hardware reliability could be measured. Software reliability was more difficult to measure.
Crash rate could be measured but difficult to interpret.Was the crash due to a defect or an operator error?
Software life cycle impacts its failure rate. Operation errors started to become more important.
See Jim Gray’s paper from the early 90’s Still unclear how to use metrics as a measurement
of “Goodness”?
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
5
Does the following represent Goodness?Failure breakdown by Service company of systems in Microsoft.
Cause Of System Failure
Hardware25%
Software66%
Network7%
Maintenance1%
No Explanation1%
Cause Of Downtime
Hardware35%
Software61%
Network3%
Maintenance0%
No Explanation1%
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
6
Measuring system ReliabilityNeed for Filtering Reliability calculations impacted by clusters of
crashes (NT data collected from DOT COM sites).
Distribution of the length of system uptime
0%
5%
10%
15%
20%
25%
30%
35%
40%
0 15 30 45 60 75 90 105
120
135
150
165
180
195
Time Between System Events (minutes)
Dis
trib
uti
on
Bluescreens
System Reboots
Distribution of the length of System uptime
0%
5%
10%
15%
20%
25%
30%
35%
40%
System uptime between events (minutes)
Dis
trib
uti
on
Bluescreens
System Reboots
VAX Crashes
VAX Reboots
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
7
Reliability MeasurementEvents to measure System Crashes/Panics/Bluescreens.
Good Points. Each event represent a defect.
Bad points. Does not include hangs. More a measure of fault management.
System Reboots Good Points.
Captures all defects. Bad points.
Captures all system management activity Can only be applied to servers.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
8
Definition of a System Events is Operating System dependent. A crash is an action taken by the system fault
management that Shuts down the system gracefully Write the cause to a dump file and an event log.
Note event logs for UNIX & NT are derived from the VMS event log.
A system outage is captured by a reboot event occurring in the event log. A hang can sometimes be recognized by lack of
outage information.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
9
System Availability Measurement. Using data in the Event Log.
If a shutdown and reboot event are captured then easy to calculate.
If only reboot events exist then use timestamps or use last event prior to shutdown.
Tools to monitor availability. Pinging system
Dependent upon Network availability. Background process continually logging
timestamps.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
10
Interpreting System Availability VAX 6000The problems start.
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Availability DistributionCaptured during 1998
99.0%
99.2%
99.4%
99.6%
99.8%
100.0%
0% 4% 8% 11%
15%
18%
22%
26%
29%
33%
37%
40%
44%
48%
51%
55%
59%
62%
66%
70%
73%
77%
81%
84%
88%
92%
95%
99%
Systems sorted by availability.
Per
cen
tag
e
Peak
24x7
This level of availability implies systems are unlikely to be in a
production environment!
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
11
Measuring availability
Ignore workstations/clients. “Intelligently” filter out long outages. “Intelligently” filter out non-production
systems. Differentiate between system maintenance
outages and those due to ‘reliability’. Capture cause of outage from system managers.
Beware they do not always tell the truth! Assume usage by time of event.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
12
Assuming usage based on day of event.
Distribution Of System Outages
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Monday Wednesday Friday Sunday
System OutagesSystem Crashes
VAX 6000 Systems Windows 2000 Systems
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Distribution Of Outages on DOT COM sites
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Monday Wednesday Friday Sunday
Distribution
System Reboots
Bluescreens
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
13
Distribution of System Outages based on Time of Event (measured Monday – Friday).Distribution Of Outages on DOT COM sites.
0%
1%
2%
3%
4%
5%
6%
7%
8%
Hours of the day (Monday-Friday)
Dis
trib
uti
on
System Outages
Bluescreens
Windows 2000
Distribution Of System OutagesMonday - Friday
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Time Of Day
Percen
tag
e
System Outages
System Crashes
VAX 6000
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
14
Generic rules for measuring system behaviour. All reliability analysis requires a filter. Workstation/client behaviour can only be
characterized by its crash rate. Reliability of servers can be measured using
either the crash or system outage rate. Availability is most accurately measured
during Peak usage periods. Site availability calculated using the median
availability OR through removing outliers.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
15
The hard part :- Interpretation! Availability and reliability is seasonal. Reliability of large servers is affected by their
life cycle. Software reliability is affected by its time
since installation and also its life cycle. Comparisons between different products is
very difficult.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
16
Reliability of servers over its life cycle.
VAX 7600 Life Cycle
0
1
2
3
4
5
6
Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1
1993 1994 1995 1996 1997 1998
Sys
tem
Cra
sh R
ate
0
10
20
30
40
50
60
70
80
90
Sys
tem
Ou
tag
e R
ate
System Crashes
System Outages
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
17
Impact of installation on (VMS) Operating System Behaviour
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Post Installation Behaviour
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8 9 10
Weeks From Installation
Rat
e O
f E
ven
ts
System Outages
System Crashes
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
18
Impact of (VMS) software life cycle on its Reliability.
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Operating System Life Cycle
0
1
2
3
4
5
6
7
8
1st 6 Months 2nd 6 Months 3rd 6 Months 4th 6 Months
Time following Release
Sys
tem
Cra
sh R
ate
0
10
20
30
40
50
60
70
Sys
tem
Ou
tag
e R
ate
System Crashes
System Outages
Operating System behaviourimproves with age?
Few new patches are produced6 months after the release of any version of the Operating System.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
19
Comparing (VMS) System ReliabilityUsing data collected from the same point in the life cycle.
OpenVMS Metrics (running on VAX Systems)Post Installation Behaviour
0
10
20
30
40
50
60
70
80
90
100
V5.5 V5.5-1 V5.5-2 V6.0 V6.1 V6.2 V7.0 V7.1
Operating System Version
Rate
Of
Syste
m O
uta
ges
Upper Confidence Bound
Average
Lower Confidence Bound
Only includes systems installingOS within 6 months of release
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
20
Overall rules for characterizing system behaviour. Hardware failure rates can be fully characterized. Software behaviour is characterized by its reliability,
availability and instability. As software matures then crash rates are no longer
the definitive measure of reliability. User perception is not necessarily reality. Comparisons between versions of software can be
performed, with care. Comparisons between products is difficult, requires
knowledge of product and usage characteristics.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
21
Monitoring applications running across distributed systems. Three perspectives of system behaviour
Application behaviour on individual systems Total application behaviour from the System
Managers perspective. Total application behaviour from user perspective.
Four Measurements. Reliability. Availability Instability Degradation
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
22
Problems associated with ‘solution’ analysis. The relative importance of the metrics, varies
between users of the metrics. The users do not have a consistent set of
requirements for the application. The configuration of the distributed solution
changes over time. Very little research has been performed into
the behaviour of solutions on customer sites. i.e. big opportunities.
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
23
Examples of analysis of a ‘distributed’ solution. VMS Clusters The behaviour of individual systems was
captured. The configuration of the cluster was captured. Correlating the data gives cluster behaviour.
Node A
Node B
Node C
Cluster Down ?
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
24
Characterizing VMS Cluster Behaviour.
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
OpenVMS VAX Cluster Behaviour
0
10
20
30
40
50
60
1 2 3 4 5 6Servers in Cluster
An
nu
al
Ra
te o
f O
uta
ge
s
0
0.5
1
1.5
2
2.5
3
3.5
Av
era
ge
Do
wn
tim
e
Cluster ReliabilityCluster Downtime
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
25
Characterizing VMS Cluster Behaviour.Characterizing instability (recoverability).
©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.
OpenVMS Cluster Behaviour
Using a 1 week Filter
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6Number Of Servers in cluster
An
nu
al
Ra
te o
f O
uta
ge
s
0
10
20
30
40
50
60
70
80
90
100
Pe
rio
d o
f In
sta
bil
ity
Cluster ReliabilityPeriods of Instability
Microsoft Reseach, Cambridge
Brendan Murphy. [email protected]
26
Opportunities for research into characterizing solution behaviour. Developing metrics to characterize solution
behaviour. Understand the relationships between the
metrics. E.g. identify network availability as the difference
between the end user and system availability. Correlating the relationship between
configuration and end user behaviour. Difficulty is monitoring production sites.