Service Monitoring CPTE 433 John Beckett
Service Monitoring
CPTE 433John Beckett
The Control Cycle
Process
ActualOutcomeDesired
Outcome OK?
EvaluationProcess
New ProcessDesign
Arrive at Destination
Location
Driving
OtherInformation
Low on Gas
Get Gas
Revise ETA
Getting Gas
The Control Cycle
Process
ActualOutcomeDesired
Outcome OK?
EvaluationProcess
New ProcessDesign
Deliver Apps Remotely
It Works Slowly
Use Network
OtherInformation
Price of add’l Bandwidth
Move to Clients
Revise Expectations
Change workflow accordingly
Use Client Also
Two Types of Monitoring
• Real-Time– “Where is our bandwidth going?”– “Who is doing ____?”
• Historical– Analysis of trends over time
• Perhaps a year• Perhaps a day• Perhaps an hour
– Tracking problems down
Need a reason before you monitor
The reason comes later
Reasons to Monitor
• Business requirement– Lost Revenue if errors are made– “E-commerce will probably need to
implement everything presented in this chapter.”
• Legal requirement– If you are required to prove you didn’t
do something, abundant records of the alternative are all you can get. • Test: “Reasonable and prudent”
Real-Time
• Alerts/Notifications• Drill-down• Root-Cause Analysis
• Note: uses bandwidth!
Monitoring versus Alerting
• The Best System…• Moves activity from monitoring to
alerting• Focuses work on productive
measures• Note emphasis on “automatic root-
cause analysis”• Does it work?• It may be too early to tell
Who Else Might Hear?
• “I’m Hot. I’m Wet.”
Built-in Escalation System
• Can your system “know” whom to notify?
• Do you give lower levels authority to select escalation path?
System, Heal Thyself?
• Network routing systems have done this for decades.– Symptom is often that the automatic routing
looks strange.– Good application for alert system
• The key is reporting and corrective action.• Problems tend to have “children.”
– Automatic re-routing causes strain elsewhere– Unknown dependencies– Power systems chronically snowball problems
Device Discovery
• Major key to successful monitoring• Speeds installation of the monitoring
system• Finds unknown devices• Catches unreported changes• Finds rogue devices!• And by the way…
– Improves service
End-To-End Tests
• The Mailping case• Develop metrics to watch for service
degradation• Excellent preparation for scaling-up