1 1 Best Practices in Web Performance Monitoring Alistair A. Croll VP Product Management and Co-Founder 2 So you want to monitor things.
Jan 15, 2015
1
1
Best Practices inWeb Performance Monitoring
Alistair A. CrollVP Product Management and Co-Founder
2
So you want tomonitor things.
2
3
But there are toomany toys out there…
4
A top-down approach to webperformance monitoring
Metrics
Tools
Operating processes
Business goals
3
5
A top-down approach to webperformance monitoring
Metrics
Tools
Operating processes
Business goalsS
impl
ify &
inte
rpre
tStarthere!
6
What goals?(in plain English)
4
7
Goals
• Make the application available– I can use it
• Ensure user satisfaction– It’s fast & meets or exceeds my expectations
• Balance capacity with demand– It handles the peak loads– It doesn’t cost too much
• Minimize MTTR– When it breaks, I can fix it efficiently
• Align operations tasks with business priorities– I work on what matters first
8
They can use it
5
9
Make the application available
• The most basic goal• App should be reachable, responsive, and
functionally correct• 3 completely different issues
– Can I communicate with the service?– Can I get end-to-end responses in a timely
manner?– Is the application behaving properly?
10
They’re happy &productive
6
11
Ensure user satisfaction
• How fast is fast enough?• Depends on the task
– Login versus reports• Depends on user expectations
– ATMs versus banking systems• Depends on the user’s state of mind
– Deeply engaged versus browsing
12
Balance capacity with demand
• Performance degrades with demand
Load (requests per second)
Performance(end-to-end
delay)
Maximumacceptable delay
Maximum capacity
7
13
I can fix it fast
14
Minimize MTTR
• Fix it efficiently• Know the costs of downtime• Application- and business-dependent
– Direct (operational) costs– Penalties– Opportunity costs– Abandonment costs
8
15
Minimize MTTR
• Don’t just think about lost revenue
16
Minimize MTTR
• And consider the whole resolution cycle
Eventoccurs
ITAware Reproduced Diagnosed Resolved Deployed
Time to recover
Verified
9
17
I worry about whatmatters
18
Align operations tasks withbusiness priorities
• Know what the business goals are• Fix problems, not incidents• Know the real impact of an issue
10
19
Align operations tasks withbusiness priorities
• Tackle problems, not incidents
Incident
Bob fromHouston hada 500 error
Problem
Houstoncan’t use the
order app
SLMviolation
10% ofrequests aregetting 500
errors
So dideveryone
else inHouston!
Andthey’re allcoming
fromHouston!
20
Align operations tasks withbusiness priorities
• Know the real impact of issues
Good requestsTime
Requests Errored requests
Affected users
Total impactChange from “normal”
11
21
So I have these goals…
• Make the application available• Ensure user satisfaction• Balance capacity with demand• Minimize MTTR• Align operations tasks with business
priorities
• How do I make sure I meet themrepeatably and predictably?
22
Okay, got the goals
12
23
But how do I makethis real?
24
A top-down approach to webperformance monitoring
Metrics
Tools
Operating processes
Business goals
Goals driveprocesses
13
25
Processes
• Reporting & overcommunication• Capacity planning• SLA definition• Problem detection• Problem localization & resolution
26
Keep people informed
14
27
Reporting & overcommunication:Know the audience
Network operations Network latency, throughput,retransmissions, service outages
Marketing Abandonment, conversion,demographics
Server operations Host latency, server errors,session concurrency
Security Anomalies, fraudulent activity
Finance Capacity planning, time out ofSLA, IT repair costs
Different stakeholders The same data sources
28
I have enough juice
15
29
Capacity planning
• Define peak load• Define acceptable performance &
availability• Select margin of error
– Cost of being wrong– Variance and confidence in the data
• Build capacity & monitor– Performance versus load
30
Capacity planning
16
31
We all agree on what’s“good enough”
32
SLA definition
• Select a metric• Select an SLA target
– That you control– That can be reliably measured
• Define how many transactions can exceedthis target before being in violation
• Monitor– Metric, percentile
17
33
SLA definition
• 95% of all searches by zipcode by all HRpersonnel will take under 2 seconds forthe network to deliver
95%95% Percentiles, not averagesAll searches by All searches by zipcodezipcode Application function, not portAll HR personnelAll HR personnel User-centric, actual requestsUnder 2 secondsUnder 2 seconds Performance metricFor the network to deliverFor the network to deliver A specific element of delay
34
I know whereproblems are…
18
35
Problem detection
• Detect incidents as soon as they affecteven one user
• Is the incident part of a bigger problem?• Prioritize problems by business impact
– Number of users affected– Dollar value lost– Severity of the issue
36
…and I can figure outwhat’s behind them
19
37
Problem localization & resolution
• Reproduction of the error– Capture a sample incident
• Deductive reasoning– Check tests to see what else is failing– Do incidents share a common element?– Do incidents happen at a certain load?– Do incidents recur around a certain time?
38
Problem localization & resolution
20
39
Problem localization & resolution
• What do they have in common?
40
Problem localization & resolution
21
41
A top-down approach to webperformance monitoring
Metrics
Tools
Operating processes
Business goals
Selecttools that
makeprocesseswork best
42
Tools:The three-legged stool
Synthetic
Real User
Device
22
43
Device monitoring:Watching the infrastructure
• Less relation to application availability• Vital for troubleshooting and localization• Will show “hard down” errors
– But good sites are redundant anyway• Correlation between a metric (CPU, RAM)
and performance degradation showswhere to add capacity
44
Synthetic testing:Checking it yourself
• Local or outside• Same test each time• Excellent for network
baselining when youcan’t control end-user’s connection
• Use to check if aregion or function isdown for everyone
• Limited usefulness forproblem re-creation
23
45
Synthetic testing:Checking it yourself
46
Real User Monitoring:2 main uses
• Tactical– Detect an incident as soon as 1 user gets it– Capture session forensics
• Long-term– Actual user service delivery– Performance/load relations– Capacity planning
24
47
Real user monitoring:2 main uses
• Outlined in ITIL
Service support
Incident management
Problem management
Service delivery
Service level management
Availability management
Capacity planning
48
OK, I’ve got the tools.What do I look at?
25
49
A top-down approach to webperformance monitoring
Metrics
Tools
Operating processes
Business goals
Use the rightmetrics for
the audience& question
50
Metrics
• Measure everything– A full performance model
• Availability– Can I use it?
• User satisfaction– What’s the impact of bad performance?
• Use percentiles– Averages lie
26
51
A full performance model
• The HTTP data model– Redirects– Containers– Components– User sessions
• HTTP-specific latency– SSL– Redirect time– Host latency– Network latency– Idle time– Think time
52
Availability
• Network errors– High
retransmissions,DNS resolutionfailure
27
53
Availability
• Client errors– 404 not found
54
Availability
• Application errors– HTTP 500
28
55
Availability
• Service errors
56
Availability
• Content & back-end errors– “ODBC Error
#1234”
29
57
Availability
• Custom errors– Specific to your
business
58
User satisfaction:Satisfied, tolerating, frustrated
What metric? What function?
Targetperformance
Impact onusers
Percentiledata
30
59
Averages lie:Use percentiles
60
Averages lie:Use percentiles
Average varies wildly,making it hard to
threshold properly orsee a real slow-down.
31
61
Averages lie:Use percentiles
80th percentileonly spikes oncefor a legitimate
slow-down (20%of users affected)
62
Averages lie:Use percentiles
Setting a usefulthreshold on
percentiles givesless false positivesand more real alerts
32
63
A top-down approach to webperformance monitoring
MetricsMetrics
ToolsTools
Operating processesOperating processes
Business goalsBusiness goals
64
Questions?
acroll<at>coradiant.com(514) 944-2765