/ Monitoring and Observability in Complex Architectures Tuesday, October 2, 12
Jan 15, 2015
/
Monitoring and Observability
in Complex Architectures
Tuesday, October 2, 12
Hi! I’m @postwait
I founded @OmniTI and @MessageSystems and @Circonus
Tuesday, October 2, 12
Hi! I’m @postwait
I am very active in @TheOfficialACMparticipating in @ACMQueueand the practitioners board.
Tuesday, October 2, 12
Hi! I’m @postwait
I (regrettably) build complex systems.
Tuesday, October 2, 12
Why we are here
We’re here to talk aboutcoping with breakage
Tuesday, October 2, 12
Rule #1
Direct observation of failureleads to quicker rectification.
Tuesday, October 2, 12
Rule #2
You cannot correctwhat you cannot measure.
Tuesday, October 2, 12
Solution Approach #1
Debugging failures requires eithervisibility into theprecipitating state
Tuesday, October 2, 12
Precipitating State
Single threaded applications
✓ Easy
Tuesday, October 2, 12
Precipitating State
Multi-threaded applications
✓ Challenging
Tuesday, October 2, 12
Precipitating State
Distributed applications
here there be dragons
Tuesday, October 2, 12
Solution Approach #2
ordirect observation of a(and likely very many)failing transaction
Tuesday, October 2, 12
Direct Observation
Observing something fail...is priceless.
Tuesday, October 2, 12
Direct Observation
Observation leads tointelligent questioning.
Tuesday, October 2, 12
Direct Observation
Questioning leads to answers...but only through more observation.
Tuesday, October 2, 12
Direct Observation
Questioning leads to answers...but only through more observation.
and herein lies the rub.
Tuesday, October 2, 12
Leaning Towards Scientific Process
In production you don’t have• repeatability• control groups• external verification
Tuesday, October 2, 12
Leaning Towards Scientific Process
In production you don’t have• repeatability• control groups• external verification
... or do you?
Tuesday, October 2, 12
What’s monitoring got to do with it?
Monitoring is all about thepassive observation oftelemetry data.
Tuesday, October 2, 12
Monitoring Telemetry
cannot pinpoint problems
can provides evidence ofthe existence of a problem
Tuesday, October 2, 12
Monitoring
Gives you evidence thatthere is a problem
Tuesday, October 2, 12
Monitoring
Gives you evidence thatyou have fixed a problem(or at least the symptoms)
Tuesday, October 2, 12
Monitoring Tactically
If it could be of interest,measure it andexpose the measurement
Tuesday, October 2, 12
Monitoring: embedded
statsdhttps://github.com/etsy/statsd
resmonhttp://labs.omniti.com/labs/resmon
metricshttps://github.com/codahale/metrics
folsomhttps://github.com/boundary/folsom
metrics.jshttps://github.com/mikejihbe/metrics
metrics-nethttps://github.com/danielcrenna/metrics-net
Tuesday, October 2, 12
Monitoring: collection
reconnoiterhttp://labs.omniti.com/labs/reconnoiter
graphitehttp://graphite.wikidot.com/
OpenTSDBhttp://opentsdb.net/
circonushttp://circonus.com/
libratohttps://metrics.librato.com/
Tuesday, October 2, 12
Monitoring: Bling
visualizing an architecture rollout
Tuesday, October 2, 12
Monitoring: Bling
visualizing the impact on service times
Tuesday, October 2, 12
average API service time latency
Tuesday, October 2, 12
actual API service time latency
http://www.slideshare.net/postwait/atldevops
Tuesday, October 2, 12
Monitoring: Bling
Tuesday, October 2, 12
Repeatability is a Pipe Dream
You production problem is a(hopefully pathological)outcome of circumstance.
A circumstance which oftencannot be repeated.
Tuesday, October 2, 12
Control Groups
Control groups cancompensate for theinability toprecisely repeat an experiment.
Tuesday, October 2, 12
Control Groups
Most architectures have redundancy.
Tuesday, October 2, 12
Control Groups
With the right design,you can turn that redundancyinto a debugging environment.
[1] http://omniti.com/surge/2012/sessions/xtreme-deployment
Tuesday, October 2, 12
Control Groups: Simple Example
I have 10 web serversI fix 1I verify 1 is fixedI verify 9 are still broken
Tuesday, October 2, 12
Control Groups: Seems Easy
Web servers tend to be:• homogeneous• share-(nothing|little)• independent
Tuesday, October 2, 12
Control Groups: Not So Easy
Most other services aren’t so homogeneous and equal:databases, batch processes (think billings), orchestration middleware, message queues
Tuesday, October 2, 12
Observability
Some might claim thatseeing telemetry data isobservation...
It is doubly indirect at best.
Tuesday, October 2, 12
Observability
I want todirectly seetheerrant behaviour
Tuesday, October 2, 12
Observability is forgiving
In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.
Tuesday, October 2, 12
Observing the network
tcpdump / snoopwireshark
Tuesday, October 2, 12
Observing the network
Looking at just thearrival of new connections
tcpdump -nnq -tttt -s384'tcp port 80 and (tcp[13] & (2|16) == 2)'
Tuesday, October 2, 12
Observing the network
Looking at just the dataarrival and departure timestcpdump -nnq -tt-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
snoop -rq -ta-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
Tuesday, October 2, 12
Observing the network
Finding the difference betweena client’s question anda server’s answer(tcpdump | awk filter).{ gsub(".[0-9]+(: | >)"," \& "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);
if(S[EP] == "C" && $4 == ".80") { printf("%f %s\n", $1 - L[EP], EP); }
S[EP]= ($4==".80")?"S":"C"; L[EP]= $1;}
Tuesday, October 2, 12
Observing the network
Tuesday, October 2, 12
Observing the network
Tuesday, October 2, 12
Observing user-space
strace[1] / trussgstack / pstackgcore + gdb / dbx / mdb[2]
[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf
Tuesday, October 2, 12
System call tracing
Watching sshdis a good way to get familiar.truss -f -p `pgrep sshd`
Tuesday, October 2, 12
System call tracing
An active web server is going to belike a firehose.truss -f -p `pgrep httpd`
Tuesday, October 2, 12
Observing the system
DTrace
Live production demo or GTFO.
Tuesday, October 2, 12
Thank You
Questions?
Tuesday, October 2, 12