Mike Brittain @ mikebrittain
Director of engineering, Infrastructure
Metrics-Driven Engineering
October 13, 2011
Tools and Process at Etsy
How many new visits?How many listings created?How many registrations?
How do people use Etsy?How many convos sent?
How many purchases?How many new shops?
Search indexing?How fast are pages generating?
Async tasks currently in queue?
What is the application doing?Developer API auth and rate limiting?
Images resized and stored?Error and warning rates?
Replication slave lag?Memcache hits/misses?
Available connections?
Are the servers in good shape ?Database queries per second?
Total outgoing bandwidth?CPU, Memory, I/O?
Business Metrics
Application Metrics
System Metrics
Visibility EVERYWHERE
Constant Change
$314 Million GMS 2010
$180 Million GMS 2009$87 Million GMS 2008
$26 Million GMS 2007
credit: pentarux (flickr)
25 Million Unique Visitors1 Billion page views per month
credit: pentarux (flickr)
Engineering team grew 500% over 18 months
credit: martin_heigan (flickr)
Less talk, more do.
Always Be Shipping
credit: ibailemon (flickr)
Always Be Shipping(even if it’s your first day)
credit: ibailemon (flickr)
90+ Engineers40+ Deploys / day
credit: misswired (flickr)
credit: digidave (flickr)
Code Reviews
Automated Tests
$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);
Config FlagsEnable and disable features quickly
$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);
Config FlagsEnable and disable features quicklyPlus “admin-only,” percentage ramp-up, A/B testing,whitelists, blacklists, etc...
Failure is not an option
Failure is not an optioninevitable!
Failure is not an optioninevitable!
a learning opportunity!
Failure is not an optioninevitable!
a learning opportunity!
DETECTABLE!
Access
Detect problems quickly
CONFIDENCE
Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears
the pagers, blah, blah, blah...
A:
Engineers build the application
OPS
LoggingGraphingTrendingAlerting
ENG
“Engineers are too busy writing features to build metrics.”
Metrics are part of every feature...and so are config flags
Dead Simple
Simple, open source tools
Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)
LoggingLogsterStatsD
Ganglia
Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)
Ganglia
Graphite
Single-instanceCreate new metrics on-the-fly
Customize via URLs and display functions
Graphite
Logging
It’s 2:48 PM.
Do you know where yourlogs are?
Logger::log_error("User login failed. Reason: $msg for $username", “login”);
Logger::log_error("User login failed. Reason: $msg for $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V
%{etsy_ab_selections}n %{etsy_request_uuid}n
%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n
%{php_time_microsec}n %D" combined
apache_note()
LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V
%{etsy_ab_selections}n %{etsy_request_uuid}n
%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n
%{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V
%{etsy_ab_selections}n %{etsy_request_uuid}n
%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n
%{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V
%{etsy_ab_selections}n %{etsy_request_uuid}n
%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n
%{php_time_microsec}n %D" combined
grep "/listing/" access.log | \awk '{sum=sum+$(NF-2)} END {print sum/NR}'
web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling
Fatals Errors Warnings
Logster
github.com/etsy
Run by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers
Logster
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
^.+ \[.+\] \[(?P<log_level>.+)\]
if (fields['log_level'] == “fatal”): self.fatals += 1
elif (fields['log_level'] == “error”): self.errors += 1
elif (fields['log_level'] == “warning”): self.warnings += 1
...
MetricObject("fatals", (self.fatals / self.duration), "per sec")
MetricObject("errors", (self.errors / self.duration), "per sec")
MetricObject("warning", (self.warnings / self.duration), "per sec")
Fatals Errors Warnings
StatsD
github.com/etsy
StatsDNetwork daemon (node.js)
Accepts data over UDPFlushes to Graphite every 10 sec
One-line of code
StatsD::increment("logins.success");
StatsD::increment("logins.success");
logins
StatsD::timing("gearman.time", $msec);
StatsD::timing("gearman.time", $msec);
90th pct
average
lower
Ad hocname value timestamp
echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003
Vertical Line Technology!target=drawAsInfinite(events.deploy.site)
We could stare at graphs all day...
http://graphite/render?from=-1hours&width=600&height=200
&target=webs.errorLog.warning&rawData=1
http://graphite/render?from=-1hours&width=600&height=200
&target=webs.errorLog.warning&rawData=1
webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None
Holt-Winters Confidence Bands
lower
upper
Holt-Winters Aberration
Business metrics+ Confidence bands
_____________ Alertable metrics
40,000+ metrics at EtsySystems, Applications, Business
Dashboards
Dashboards
<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
Kind of Hard :-/
$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);
Super Easy!
Metrics!
Metrics!Metrics + Events
Metrics!Metrics + EventsMetrics + Alerts
Metrics!Metrics + EventsMetrics + Alerts
Metrics + Metrics
High-level, real-time visibility
Detect problems quickly
CONFIDENCE
Make them required features
Make them dead simple
Make them accessible
Make them!
Thank You
Homeworkcodeascraft.etsy.comgithub.com/etsy
We’re always looking for people who are interested in this kind of stuff...
etsy.com/careers
Get in touchmike @ etsy . com
@ mikebrittain