Failure Happens F***, the F***ing thing is F***king F***ed* *Official WebOps term from Artur Bergman Jesse Robbins [email protected]
Jan 15, 2015
Failure HappensF***, the F***ing thing is F***king F***ed*
*Official WebOps term from Artur Bergman
Jesse [email protected]
This will be on the test:
FAILURE HAPPENS!
25%
75%
25%
75%Paranoid
25%
75%
Pyromaniac
Paranoid
Good Book!
“multiple and unexpected interactions of failures are
inevitable”-‐Charles Perrow
Failure Happens
define:Nines (roughly)
define:Nines (roughly)
99% 5256 min (3.5 days)
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
99.9999% 30 Seconds
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
99.9999% 30 Seconds
99.99999% 3 Seconds
Internet Routing... won’t.
!"#$$%"&'(')*)"+,-.,-/01,( +/.01210*"345467"89: #
;''-1(<"=/-)"3.1>0?-'"@'-':
#googlefail
YOU
Continuous Power... isn’t
365 Main SF
365 364.96 Main SF
Failure happens
A single datacenter is the problem• Since they all fail at some point
Recovery procedures after failure• Power was gone ~45 minutes• Most services took hours to come back• Some unnamed ones more than 12 hours
Truck 1, Rackspace 0
Geography is a Single Point of Failure
!"#$%&''( )*+#,$-#$,%./-$0,1
+2304,$5%67"#,-8$1
!"#$%#&'()(#*&+,&!"#$%&!'()* #%-#%*%,.&'(/,.#+%*&0+.1&-#%2+3&(/."4%*&(2&".&)%"*.&5678
Taser weilding robbers
C I Hosts' Chicago facility robbed twice!
(the other two times were merely "break-ins where things were stolen")
Providers are baskets too.
Failure Happens.
Anyone promising otherwise is either foolish or lying
(or both).