Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen Rutgers University Vivo Project http://vivo.cs.rutgers.edu Funding from NSF grants: #EIA- 0103722, #EIA-9986046, and #CCR- 0100798.
34
Embed
Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Understanding and Dealing with Operator Mistakes in Internet Services
Kiran Nagaraja, Fábio OliveiraRicardo Bianchini, Richard P. Martin, Thu D. Nguyen
Rutgers University
Vivo Projecthttp://vivo.cs.rutgers.edu
Funding from NSF grants: #EIA- 0103722, #EIA-9986046, and #CCR-0100798.
Kiran Nagaraja, Rutgers University Vivo Project2
Motivation
Internet services are ubiquitous, e.g., Google, Yahoo!, Amazon, Ebay, etc.Expectation of 24 x 7 availability, but service outages still happen!
Sorry....We apologize for the inconvenience, but the system is currently unavailable. Please try your request in an hour. If you require assistance please call Customer Service at 1-866-325-3457.
A significant number of outages in Internet services are a result of operator actions [Oppenheimer03]
#1: Architecture is complex
#2: Systems are constantly evolving
#3: Lack of tools for operators to reason about the impact of their actionsOffline testing, emulation, simulation
Very little detail on operator mistakesDetails strongly guarded by companies and administrators
Kiran Nagaraja, Rutgers University Vivo Project3
Talk Outline
Approach and Contributions
Operator Study: Understanding the Mistakes
Validation: Preventing Exposure of Mistakes
Conclusion and Future Work
Kiran Nagaraja, Rutgers University Vivo Project4
This Work
Understanding: Gather detailed data on operators’ mistakes What categories of mistakes?
What’s the impact on the service?
How do mistakes correlate with experience, impact ?
Approaches to deal with operator mistakes: prevention, recovery, automation
Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the serviceSimilar to offline testing, but:
Virtual environment (extension of online environment)
Real workload
Migration back and forth with minimal operator involvement
Kiran Nagaraja, Rutgers University Vivo Project5
Contributions
Detailed information on operator tasks and mistakes43 experiments detailed data on operator behavior including 42 mistakes
64% immediately degraded throughput
57% were software configuration mistakes
Demonstrate that human experiments are possible and valuable
Designed and prototyped a validation infrastructureImplemented on 2 cluster-based services: cooperative Web server (PRESS)
and a multi-tier auction service
2 techniques to allow operators to validate their actions
Demonstrated that validation is a promising technique for reducing impact of operator mistakes
66% of all mistakes observed in operator study were caught
6 of 9 mistakes caught in live operator experiments with validation
Successfully tested with synthetically injected mistakes
Kiran Nagaraja, Rutgers University Vivo Project6
Talk Outline
Approach and ContributionsApproach and Contributions
Operator Study: Understanding the MistakesRepresentative environment
Choice of human subjects and experiments
Results
Validation: Preventing Exposure of Mistakes Validation: Preventing Exposure of Mistakes
Conclusion and Future WorkConclusion and Future Work
Kiran Nagaraja, Rutgers University Vivo Project7
Multi-Tiered Internet Services
Client Requests
Web ServerWeb ServerWeb ServerWeb Server
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
DatabaseDatabase
Tier 1:
Web servers
Tier 2:
App servers
Tier 3:
Database server
Kiran Nagaraja, Rutgers University Vivo Project8
Tasks, Operators & Training
TasksScheduled maintenance tasks (proactive), e.g. upgrade Apache
Diagnose-and-repair tasks (reactive), e.g. diagnose a disk failure
Operator composition14 computer science graduate students
5 professional programmers (Ask Jeeves)
2 sysadmins from our department
Categorization of operators - based on filled in questionnaire11 novices – some familiarity with set up
5 intermediates – experience with a similar service
5 experts - in-charge of a service requiring high uptime
Operator trainingNovice operators given warm-up tasks
Material describing service, and detailed steps for tasks
Kiran Nagaraja, Rutgers University Vivo Project9
Experimental Setup
Service3-tier auction service, and client emulator from
Operator assistance & data capture Monitor service throughput
Modified bash shell for command and result trace
Manual observationNoting anomalies in operator behavior
Bailing out ‘lost’ operators
Kiran Nagaraja, Rutgers University Vivo Project10
Example Trace
Task: Add an application serverMistake: Apache misconfiguration
Impact: Degraded throughput
Application server addedFirst Apache misconfigured and
restarted Second Apache misconfigured and
restarted
Kiran Nagaraja, Rutgers University Vivo Project11
Sampling of Other Mistakes
Adding a new application serverOmission of new application server from backend member list
Syntax errors, duplicate entries, wrong hostnames
Launching the wrong version of software
Migrating the database for performance upgradeIncorrect privileges for accessing the database
Security vulnerability
Database installed on wrong disk
Kiran Nagaraja, Rutgers University Vivo Project12
Operator Mistakes: Category Vs Impact
64% of all mistakes had immediate impact on service performance36% resulted in latent faults
Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment
Obs. #2: Undetectable latent errors will still require online-recovery techniques
0
2
4
6
8
10
12
14
16
18
20
Degradedthroughput
Serviceinaccessible
IncreasedMTTR
Incomplete componentintegration
Securityvulnerability
Web serverpotentially
inaccessible
Reducedsystem
capacity
Potentialdatabase
crash
Impact Category
# o
f M
ista
ke
s
Kiran Nagaraja, Rutgers University Vivo Project13
0
2
4
6
8
10
12
14
16
Local config Global config Incorrectrestart
Start ofwrong SW
version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrong choiceof HW
Mistake Categories
# o
f M
ista
ke
sOperator Mistakes
Misconfigurations account for 57% of all errorsConfiguration mistakes spanning multiple components are more likely
Obs. #1: Tools to manipulate and check configurations are crucial
Obs. #2: Be extremely careful when maintaining multiple versions of s/w
Kiran Nagaraja, Rutgers University Vivo Project14
Operator Categories
Experts also made mistakes!Complexity of tasks executed by experts were higher
0
0.2
0.4
0.6
0.8
1
1.2
Local config Global config Incorrectrestart
Start of wrongSW version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrong choiceof HW
Mistake Categories
Ra
tio
of
mis
tak
es
/ex
pe
rim
en
ts
Novice Intermediate Expert
Kiran Nagaraja, Rutgers University Vivo Project15
Operator Study Summary
43 experiments 42 mistakes
27 (64%) mistakes caused immediate impact on service performance
24 (57%) were software configuration mistakes
Mistakes were made across all operator categories
Trace of operator commands & service performance for all experimentsAvailable at http://vivo.cs.rutgers.edu
Kiran Nagaraja, Rutgers University Vivo Project16
Talk Outline
Approach and ContributionsApproach and Contributions
Operator Study: Understanding the MistakesOperator Study: Understanding the Mistakes
Validation: Preventing Exposure of MistakesTechnique
Experimental Evaluation
Conclusion and Future WorkConclusion and Future Work
Kiran Nagaraja, Rutgers University Vivo Project17
Validation of Operator’s Actions
ValidationAllow operator to check correctness of his/her actions prior to
exposing their impact to the service interface (clients)
Correctness is tested by:
1. Migrate the component(s) to virtual sand-box environment,
2. Subject to a real load,
3. Compare behavior to a known correct one, and
4. Migrate back to online environment
Types of validation: Replica-based: Compare with online replica (real time)
Trace-based: Compare with logged behavior
Kiran Nagaraja, Rutgers University Vivo Project18
Validating a Component: Replica-Based
Web ServerWeb ServerWeb ServerWeb Server
DatabaseDatabase
Tier 1
Tier 3
Tier 2
Validation slice Online slice
ApplicationServer
ApplicationServer
DatabaseProxy
DatabaseProxy
Web ServerProxy
Web ServerProxy
ApplicationServer
ApplicationServer
ApplicationServer
ApplicationServer
Client Requests
Compare
Compare
Application State
ShuntCompare
Kiran Nagaraja, Rutgers University Vivo Project19
Validating a Component: Trace-Based
Validation slice Online slice
ApplicationServer
ApplicationServer
DatabaseProxy
DatabaseProxy
Web ServerProxy
Web ServerProxy
State
Compare
Compare
Web ServerWeb ServerWeb ServerWeb Server
DatabaseDatabase
Tier 1
Tier 3
Tier 2Application
Server
ApplicationServer
ApplicationServer
ApplicationServer
Client Requests
Shunt
State
Kiran Nagaraja, Rutgers University Vivo Project20
Implementation Details
Shunting performed in middleware layerEach request tagged with a unique ID all along the request path
Component proxies can be constructed with little effortReuse discovery and communication interfaces, common messaging core
State management requires well-defined export and import APIStateful servers often support such API
Comparator functions to detect errorsSimple throughput, flow, and content comparators
Kiran Nagaraja, Rutgers University Vivo Project21
Validating Our Prototype: Results
Live operator experimentsOperator given option of type of validation, duration, and to skip validation
Validation caught 6 out of 9 mistakes from 8 experiments with validation
Mistake-injection experimentsValidation caught errors in data content (inaccessible files, corrupted files)
and configuration mistakes (incorrect # of workers in Web Server degraded throughput)
Operator-emulation experimentsOperator command scripts derived from the 42 operator mistakes
Both trace-based and replica validation caught 22 mistakesMulti-component validation caught 4 latent (component interaction)
mistakes
Kiran Nagaraja, Rutgers University Vivo Project22
Reduction in Impact with Validation
0
2
4
6
8
10
12
14
16
18
20
Degradedthroughput
Serviceinaccessible
IncreasedMTTR
Incomplete componentintegration
Securityvulnerability
Web serverpotentially
inaccessible
Reducedsystemcapacity
Potentialdatabase
crash
Impact Categories
# o
f M
ista
ke
s
Mistakes
Mistakes with validation
Kiran Nagaraja, Rutgers University Vivo Project23
Reduction in Mistakes with Validation
0
2
4
6
8
10
12
14
16
18
Local config Global config Incorrectrestart
Start ofwrong SW
version
Unnecessaryrestart of SW
UnnecessaryHW
replacement
Wrongchoice of HW
Mistake Categories
# o
f M
ista
ke
s
Mistakes
Mistakes with validation
Kiran Nagaraja, Rutgers University Vivo Project24
Shunting & Buffering Overheads
Shunting overhead for replica-based validation 39% additional CPU All requests and responses are captured and forwarded to validation slice
Trace-based validation is slightly better 32 % additional CPU
Overhead is incurred on single component, and only during validation
Various optimizations can reduce overhead to 13-22%Examples: response summary (64byte), sampling (session boundaries)
Buffering capacity during state check pointing and duplicationRequired to buffer only about 150 requests for small state sizes
Kiran Nagaraja, Rutgers University Vivo Project25
Caveats, Limitations, and Open Issues
Non-determinism increases complexity of comparators and proxiesE.g., choice of back-end server, remote cache vs. local disk, pseudo-
random session-id, time stamps
Hard state management may require operator interventionComponent requires initialization prior to online migration
Bootstrapping the validationValidating an intended modification of service behavior – no traces or
replica for comparison!
How long to validate? What types of validation?Duration spent in validation implies reduced online capacity
Kiran Nagaraja, Rutgers University Vivo Project26
Conclusions & Future Work
Gathered data on operator execution & mistakes Majority of the mistakes were configuration errors Many of them degraded system throughput
Validation is an effective technique to check operator mistakesSimple techniques caught majority of mistakesFeasible in overhead and implementation effort ‘Validation ready’ components: hooks for logging, forwarding &
buffering messages, saving/restoring state
Future work: Taking validation further…Validate operator actions on databases, network componentsCombine validation with diagnosis for assisting operatorsOther validation techniques: Model-based validation
Kiran Nagaraja, Rutgers University Vivo Project27
Acknowledgements
We are thankful to our volunteer operators: fellow students, professional programmers, and LCSR staff members
We also would like to express our gratitude to Christine Hung, Neeraj Krishnan, and Brian Russell for their help in building the monitoring infrastructure in the early stages of the project
Thank you!
Questions?
For more information and traces of operator experiments:
http://vivo.cs.rutgers.edu
Back up Slides
Kiran Nagaraja, Rutgers University Vivo Project30
Operator Mistakes: Category Vs Impact
0
2
4
6
8
10
12
14
16
18
20
Degradedthroughput
Serviceinaccessible
IncreasedMTTR
Incomplete componentintegration
Securityvulnerability
Web serverpotentially
inaccessible
Reducedsystemcapacity
Potentialdatabase
crash
Impact Category
# o
f M
ista
ke
s
Wrong choice of HW component (1)
Unnecessary HW replacement (3)
Unnecessary restart of SW component (3)
Start of wrong SW version (8)
Incorrect restart (3)
Global misconfiguration (16)
Local misconfiguration (8)
64% of all mistakes had immediate impact on service performance36% resulted in latent faults
Obs. #1: Significant no. of mistakes can be checked by testing with a realistic workload
Obs. #2: Undetectable latent errors will still require online-recovery techniques
Kiran Nagaraja, Rutgers University Vivo Project31
Mendosus and Slice Isolation
Mendosus virtualizes a network of nodes on an Ethernet LAN
Injects network level failures including network partitionsAllows easy isolation of nodes into online and validation slices
Migration does not require any network level modifications
Kiran Nagaraja, Rutgers University Vivo Project32
Validation Techniques
Trace-Based ValidationRequest/response trace logged to disk
State managementState checkpointed to disk is used for initializing
Validation scenariosCan have higher directed coverage
Replica-Based ValidationReal-time forwarding from online-replica
State managementState from replica is directly used for initializing
Validation scenariosReflects current online characteristics
Multi-Component ValidationTest interaction with working components from online slice
Kiran Nagaraja, Rutgers University Vivo Project33
Implementation Details
Shunting performed in middleware layerE.g., Auction service: Apaches’ mod_jk module, Tomcat valves, JDBC driver
Each request tagged with a unique ID all along the request path
Component proxies can be constructed with little effortReuse discovery and communication interfaces , and add a common
request/response messaging core
E.g., Auction service required 4 proxies – derived by adding/modifying only 232, 307, 274 and 384 lines of C/Java code
State management requires well-defined export and import APIStateful servers often support such API
For Tomcat App server, regular state manager required small modification to export API to validation infrastructure
Simple comparator functions to detect errorsThroughput, flow and content comparators
Kiran Nagaraja, Rutgers University Vivo Project34
Shunting & Buffering Overheads
Various optimizations can reduce overhead to 13-22%Example, summary (64byte), sampling (session boundaries)
Buffering capacity during state checkpointing and duplicationRequired to buffer only about 150 requests for small state sizes