Top Banner
1 Dependability in the Internet Era
31

1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

Mar 26, 2015

Download

Documents

Audrey McMahon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

1

Dependability in the

Internet Era

Page 2: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

2

Outline

• The glorious past (Availability Progress)

• The dark ages (current scene)

• Some recommendations

Page 3: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

3

PreviewThe Last 5 Years: Availability Dark Ages

Ready for a Renaissance? • Things got better, then things got a lot worse!

9%

99%

99.9%

99.99%

99.999%

99.999%

1950 1960 1970 1980 1990 2000

Computer Systems

Telephone Systems

Cellphones

Internet

Ava

ilabi

lity

Page 4: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

4

DEPENDABILITY: The 3 ITIES• RELIABILITY / INTEGRITY:

Does the right thing. (also MTTF>>1)

• AVAILABILITY: Does it now.

(also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?

(=>89% of transactions are serviced on time).

• Holistic vs. Reductionist view

SecurityIntegrityReliability

Availability

Page 5: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

5

Fail-Fast is Good, Repair is Needed

Improving either MTTR or MTTF gives benefit

Simple redundancy does not help much.

Fault Detect

Repair

Return

Lifecycle of a moduleLifecycle of a modulefail-fast gives fail-fast gives short fault latencyshort fault latency

High Availability High Availability

is low UN-Availabilityis low UN-Availability

Unavailability ~ Unavailability ~ MTTRMTTR MTTFMTTF

Page 6: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

6

Fault Model• Failures are independent

So, single fault tolerance is a big win

• Hardware fails fast (dead disk, blue-screen)

• Software fails-fast (or goes to sleep)

• Software often repaired by reboot:– Heisenbugs

• Operations tasks: major source of outage– Utility operations

– Software upgrades

Page 7: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

7

Disks (raid) the BIG Success Story

• Duplex or Parity: masks faults• Disks @ 1M hours (~100 years) • But

– controllers fail and – have 1,000s of disks.

• Duplexing or parity, and dual path gives “perfect disks”

• Wal-Mart never lost a byte (thousands of disks, hundreds of failures).

• Only software/operations mistakes are left.

Page 8: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

8

Fault Tolerance vs Disaster Tolerance

• Fault-Tolerance: mask local faults– RAID disks– Uninterruptible Power Supplies– Cluster Failover

• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..– Redundant system and service

at remote site.

Page 9: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

9

Case Study - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi

Watanabe).

Vendor (hardware and software) 5 MonthsApplication software 9 MonthsCommunications lines 1.5

YearsOperations 2 YearsEnvironment 2 Years

10 Weeks1,383 institutions reported (6/84 - 7/85)

7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES

To Get 10 Year MTTF, Must Attack All These Areas

42%

12%

25%9.3%

11.2%

Vendor

Environment

OperationsApplication

Software

Tele Comm lines

Page 10: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

10

Case Studies - Tandem Trends

MTTF improved

Shift from Hardware & Maintenance to from 50% to 10%

to Software (62%) & Operations (15%)

NOTE: Systematic under-reporting of EnvironmentOperations errorsApplication Software

unknown environment operations maintenance hardware software

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

1985 1987 1989

0

20

40

60

80

1 00

1 20

1985 19 87 1 989

Outag es/ 1000 Syste m Yearsby Primar y Cause

% of Outage s by Pri mary Cause

Page 11: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

11

Dependability Status circa 1995 • ~4-year MTTF => 5 9s for well-managed sys.

Fault Tolerance Works.

• Hardware is GREAT (maintenance and MTTF).

• Software masks most hardware faults.

• Many hidden software outages in operations:

–New Software.

–Utilities.

• Make all hardware/software changes ONLINE.

• Software seems to define a 30-year MTTF ceiling.• Reasonable Goal: 100-year MTTF.

class 4 today => class 6 tomorrow.

Page 12: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

12

What’s Happened Since Then?

• Hardware got better• Software got better

(even though it is more complex)• Raid is standard,

Snapshots coming standard• Cluster in a box: commodity failover• Remote replication is standard.

Page 13: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

13

Availability99 999well-managed nodes

well-managed packs & clones

well-managed GeoPlex

Masks some hardware failures

Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures

Masks site failures (power, network, fire, move,…) Masks some operations failuresA

vaila

bilit

yUn-managed

Page 14: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

14

Outline

• The glorious past (Availability Progress)

• The dark ages (current scene)

• Some recommendations

Page 15: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

15

Progress?

• MTTF improved from 1950-1995• MTTR has not improved much

since 1970 failover• Hardware and Software online change

(pNp) is now standard• Then the Internet arrived:

– No project can take more than 3 months.– Time to market is everything– Change is good.

Page 16: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

16

The Internet Changed Expectations

1990Phones delivered 99.999%

ATMs delivered 99.99%

Failures were front-page news.

Few hackers

Outages last an “hour”

2000Cellphones deliver 90%

Web sites deliver 98%

Failures are business-page news

Many hackers.

Outages last a “day”

This is progress?

Page 17: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

17

Why (1) Complexity• Internet sites are MUCH

more complex.– NAP– Firewall/proxy/ipsprayer– Web– DMZ– App server– DB server– Links to other sites– tcp/http/html/dhtml/dom/xml/

com/corba/cgi/sql/fs/os…

• Skill level is much reduced

Page 18: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

18

One of the Data Centers (500 servers)

C is c o 7 0 0 0

ICPMSCOMC7501

C is c o 7 0 0 0

ICPMSCOMC7502

C a ta lyst5 0 0 0

ICPMSCOMC5001(MSCOM1)

ATM0/0/0.1

FE4/0/0Port 1/1

HSRP

FE4/1/0 FE4/1/0

HSRP

Port 2/1 Port 2/1C a ta lyst

5 0 0 0

ICPMSCOMC5002(MSCOM2)

FE4/0/0

ATM0/0/0.1

Port 1/1

C is c o 7 0 0 0

ICPMSCOMC7503

C a ta lyst5 0 0 0

ICPMSCOMC5003(MSCOM3)

ATM0/0/0.1

FE4/0/0Port 1/1

HSRP

FE4/1/0 FE4/1/0

HSRP

Port 2/1 Port 2/1 C a ta lyst5 0 0 0

ICPMSCOMC5004(MSCOM4)

FE4/0/0

ATM0/0/0.1

Port 1/1

C is c o 7 0 0 0

ICPMSCOMC7504

SD

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SYSTEMS

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

AC AC

48V DC 48V DC

5VDC OK 5VDC OK

SHUTDOWN SHUTDOWN

CAUTION:Double Pole/neutral fusing CAUTION:Double Pole/neutral fusingF12A/250V F12A/250V

ASX-1000

B DB DB D B D

A CA CA CA C

SD

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SYSTEMS

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

SERETH

NEXT

SELECT

RESET

TXCRXL

PWR

AC AC

48V DC 48V DC

5VDC OK 5VDC OK

SHUTDOWN SHUTDOWN

CAUTION:Double Pole/neutral fusing CAUTION:Double Pole/neutral fusingF12A/250V F12A/250V

ASX-1000

B DB DB D B D

A CA CA CA C

ICPMDISTFA1001 ICPMDISTFA1002

3A22A2

2A2

1A2

ATM0/0/0.1

4A2

ATM0/0/0.1

4A2

1A2

C is c o 7 0 0 0

ICPMSCOMC7505

Catalyst 2926

ICPMSFTDLC2921(MSCOM DL1)

Port 1/1

FE4/0/0

HSRP

C is c o 7 0 0 0

ICPMSCOMC7506

Catalyst 2926

ICPMSFTDLC2922(MSCOM DL2)

Port 1/1

FE5/0/0

HSRP

Port 1/2Port 1/2

FE4/0/0

HSRP

FE5/0/0

HSRP

IIS

IIS

IIS

IIS

IIS

IIS

CPMSFTWBW26CPMSFTWBW28CPMSFTWBW30

CPMSFTWBW37CPMSFTWBW38CPMSFTWBW39

WWW.MICROSOFT.COMWWW.MICROSOFT.COM

CPMSFTWBW24CPMSFTWBW31CPMSFTWBW32CPMSFTWBW33CPMSFTWBW34

CPMSFTWBW35CPMSFTWBW40CPMSFTWBW41CPMSFTWBW42CPMSFTWBW43

SEARCH.MICROSOFT.COM

CPMSFTWBS01CPMSFTWBS02CPMSFTWBS03CPMSFTWBS04CPMSFTWBS05CPMSFTWBS06CPMSFTWBS07CPMSFTWBS08CPMSFTWBS09

CPMSFTWBS10CPMSFTWBS11CPMSFTWBS12CPMSFTWBS13CPMSFTWBS14CPMSFTWBS15CPMSFTWBS16CPMSFTWBS17CPMSFTWBS18

WWW.MICROSOFT.COM

CPMSFTWBW08CPMSFTWBW13CPMSFTWBW14CPMSFTWBW29

CPMSFTWBW36CPMSFTWBW44CPMSFTWBW45

WWW.MICROSOFT.COM

CPMSFTWBW01CPMSFTWBW15CPMSFTWBW25

CPMSFTWBW27CPMSFTWBW46CPMSFTWBW47

REGISTER.MICROSOFT.COM

CPMSFTWBR03CPMSFTWBR04CPMSFTWBR05

CPMSFTWBR09CPMSFTWBR10

SUPPORT.MICROSOFT.COM

CPMSFTWBT01CPMSFTWBT02

CPMSFTWBT03CPMSFTWBT07

CPMSFTWBT04CPMSFTWBT05

WINDOWS.MICROSOFT.COM

CPMSFTWBY01CPMSFTWBY02

CPMSFTWBY03CPMSFTWBY04

WINDOWS98.MICROSOFT.COM

CPMSFTWBJ01

WINDOWSMEDIA.MICROSOFT.COM

PREMIUM.MICROSOFT.COM

CPMSFTWBP01CPMSFTWBP02

CPMSFTWBP03

SUPPORT.MICROSOFT.COM

CPMSFTWBT06CPMSFTWBT08

CPMSFTWBR07CPMSFTWBR08

CPMSFTWBR01CPMSFTWBR02CPMSFTWBR06

REGISTER.MICROSOFT.COM

WINDOWSMEDIA.MICROSOFT.COM WINDOWSMEDIA.MICROSOFT.COM

CPMSFTWBJ01CPMSFTWBJ02

CPMSFTWBJ03CPMSFTWBJ05

CPMSFTWBJ06CPMSFTWBJ07CPMSFTWBJ08

CPMSFTWBJ09CPMSFTWBJ10

CPMSFTWBJ06CPMSFTWBJ07CPMSFTWBJ08

CPMSFTWBJ09CPMSFTWBJ10

MSDN.MICROSOFT.COM

CPMSFTWBN01CPMSFTWBN02

CPMSFTWBN03CPMSFTWBN04KBSEARCH.MICROSOFT.COM

CPMSFTWBT40CPMSFTWBT41CPMSFTWBT42

CPMSFTWBT43CPMSFTWBT44

INSIDER.MICROSOFT.COM

CPMSFTWBI01 CPMSFTWBI02

3D2

C a ta lyst5 0 0 0

IUSCCMQUEC5002(COMMUNIQUE2)

C a ta lyst5 0 0 0

IUSCCMQUEC5001(COMMUNIQUE1)

C a ta lyst5 0 0 0

C a ta lyst5 0 0 0

ICPMSCBAC5001ICPMSCBAC5502

Port 1/1 Port 1/2Port 2/12

C is c o 7 0 0 0

ICPCMGTC7501

C is c o 7 0 0 0

ICPCMGTC7502

FE4/1/0

Port 1/1

FE4/1/0SQL

Microsoft.com SQL Servers

Microsoft.com Stagers,Build and Misc. Servers

FTP 6

Build Servers 32

IIS 210

Application 2

Exchange 24

Network/Monitoring 12

SQL 120

Search 2

NetShow 3

NNTP 16

SMTP 6

Stagers 26

Total 459

Microsoft.com Server Count

Drawn by: Matt GroshongLast Updated: April 12, 2000

IP addresses removed by J im Gray to protect security

CPMSFTSQLB05CPMSFTSQLB06CPMSFTSQLB08CPMSFTSQLB09CPMSFTSQLB14CPMSFTSQLB16CPMSFTSQLB18CPMSFTSQLB20CPMSFTSQLB21

Backup SQL Servers

CPMSFTSQLB22CPMSFTSQLB23CPMSFTSQLB24CPMSFTSQLB25CPMSFTSQLB26CPMSFTSQLB27CPMSFTSQLB36CPMSFTSQLB37CPMSFTSQLB38CPMSFTSQLB39

CPMSFTSQLA05CPMSFTSQLA06CPMSFTSQLA08CPMSFTSQLA09CPMSFTSQLA14CPMSFTSQLA16CPMSFTSQLA18CPMSFTSQLA20CPMSFTSQLA21CPMSFTSQLA22

Live SQL ServersCPMSFTSQLA23CPMSFTSQLA24CPMSFTSQLA25CPMSFTSQLA26CPMSFTSQLA27CPMSFTSQLA36CPMSFTSQLA37CPMSFTSQLA38CPMSFTSQLA39

IIS

IIS

IIS IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

Consolidator SQL Servers

CPMSFTSQLC02CPMSFTSQLC03CPMSFTSQLC06CPMSFTSQLC08CPMSFTSQLC16CPMSFTSQLC18CPMSFTSQLC20CPMSFTSQLC21CPMSFTSQLC22CPMSFTSQLC23

CPMSFTSQLC24CPMSFTSQLC25CPMSFTSQLC26CPMSFTSQLC27CPMSFTSQLC30CPMSFTSQLC36CPMSFTSQLC37CPMSFTSQLC38CPMSFTSQLC39

DOWNLOAD.MICROSOFT.COM DOWNLOAD.MICROSOFT.COM

HTMLNEWS(pvt).MICROSOFT.COM

CPMSFTWBV01CPMSFTWBV02CPMSFTWBV03

CPMSFTWBV04CPMSFTWBV05

CPMSFTWBD01CPMSFTWBD05CPMSFTWBD06

CPMSFTWBD07CPMSFTWBD08

CPMSFTWBD03CPMSFTWBD04CPMSFTWBD09

CPMSFTWBD10CPMSFTWBD11

ACTIVEX.MICROSOFT.COM

CPMSFTWBA02 CPMSFTWBA03

FTP.MICROSOFT.COM

CPMSFTFTPA03CPMSFTFTPA04

CPMSFTFTPA05CPMSFTFTPA06

NTSERVICEPACK.MICROSOFT.COM

CPMSFTWBH01CPMSFTWBH02

CPMSFTWBH03

HOTFIX.MICROSOFT.COM

CPMSFTFTPA01

ASKSUPPORT.MICROSOFT.COM

CPMSFTWBAM03CPMSFTWBAM04

CPMSFTWBAM01CPMSFTWBAM01

MSDNNews.MICROSOFT.COM

CPMSFTWBV21CPMSFTWBV22

CPMSFTWBV23

MSDNSupport.MICROSOFT.COM

CPMSFTWBV41 CPMSFTWBV42

NEWSLETTERS.MICROSOFT.COM

CPMSFTSMTPQ01 CPMSFTSMTPQ02

NEWSLETTERS

CPMSFTSMTPQ11CPMSFTSMTPQ12CPMSFTSMTPQ13CPMSFTSMTPQ14CPMSFTSMTPQ15

NEWSWIRE

CPMSFTWBQ01CPMSFTWBQ02CPMSFTWBQ03

Misc. SQL Servers

INTERNAL SMTP

CPMSFTSMTPR01CPMSFTSMTPR02

NEWSWIRE.MICROSOFT.COM

CPITGMSGR01 CPITGMSGR02

NEWSWIRECPITGMSGD01CPITGMSGD02CPITGMSGD03

OFFICEUPDATE.MICROSOFT.COM

CPMSFTWBO01CPMSFTWBO02

CPMSFTWBO04CPMSFTWBO07

PremOFFICEUPDATE.MICROSOFT.COM

CPMSFTWBO30CPMSFTWBO31

CPMSFTWBO32

SearchMCSP.MICROSOFT.COM

CPMSFTWBM03

SvcsWINDOWSMEDIA.MICROSOFT.COM

CPMSFTWBJ21 CPMSFTWBJ22

STATSCPITGMSGD04CPITGMSGD05CPITGMSGD07CPITGMSGD14CPITGMSGD15CPITGMSGD16CPMSFTSTA14CPMSFTSTA15CPMSFTSTA16

WINDOWS_Redir.MICROSOFT.COM

CPMSFTWBY05

COMMUNITIES

COMMUNITIES.MICROSOFT.COM

CPMSFTNGXA01CPMSFTNGXA02CPMSFTNGXA03

CPMSFTNGXA04CPMSFTNGXA05

CODECS.MICROSOFT.COM

CPMSFTWBJ16CPMSFTWBJ17CPMSFTWBJ18

CPMSFTWBJ19CPMSFTWBJ20

CGL.MICROSOFT.COM

CPMSFTWBG03CPMSFTWBG04CPMSFTWBG05

CPMSFTWBG04CPMSFTWBG05

CDMICROSOFT.COM

CPMSFTWBC01CPMSFTWBC02

CPMSFTWBC03

BACKOFFICE.MICROSOFT.COM

CPMSFTWBB01CPMSFTWBB03

CPMSFTWBB04

Build Servers

INTERNET-BUILDINTERNET-BUILD1INTERNET-BUILD2INTERNET-BUILD3INTERNET-BUILD4INTERNET-BUILD5INTERNET-BUILD6INTERNET-BUILD7INTERNET-BUILD8INTERNET-BUILD9INTERNETBUILD10INTERNETBUILD11INTERNETBUILD12INTERNETBUILD13INTERNETBUILD14INTERNETBUILD15INTERNETBUILD16

INTERNETBUILD17INTERNETBUILD18INTERNETBUILD19INTERNETBUILD20INTERNETBUILD21INTERNETBUILD22INTERNETBUILD23INTERNETBUILD24INTERNETBUILD25INTERNETBUILD26INTERNETBUILD27INTERNETBUILD30INTERNETBUILD31INTERNETBUILD32INTERNETBUILD34INTERNETBUILD36INTERNETBUILD42

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IIS

IISIIS

IIS IIS

SQL

SQL

SQL

SQL

SQLSQL

SQL

SQL

SQL

SQL

SQL

StagersCPMSFTCRA10CPMSFTCRA14CPMSFTCRA15CPMSFTCRA32CPMSFTCRB02CPMSFTCRB03CPMSFTCRP01CPMSFTCRP02CPMSFTCRP03

CPMSFTCRS01CPMSFTCRS02CPMSFTCRS03CPMSFTSGA01CPMSFTSGA02CPMSFTSGA03CPMSFTSGA04CPMSFTSGA07

PPTP / Terminal Servers

CPMSFTPPTP01CPMSFTPPTP02CPMSFTPPTP03CPMSFTPPTP04

CPMSFTTRVA01CPMSFTTRVA02CPMSFTTRVA03

CPMSFTSQLD01CPMSFTSQLD02CPMSFTSQLE01CPMSFTSQLF01CPMSFTSQLG01CPMSFTSQLH01CPMSFTSQLH02CPMSFTSQLH03CPMSFTSQLH04CPMSFTSQLI01CPMSFTSQLL01CPMSFTSQLM01CPMSFTSQLM02CPMSFTSQLP01CPMSFTSQLP02CPMSFTSQLP03CPMSFTSQLP04CPMSFTSQLP05CPMSFTSQLQ01CPMSFTSQLQ06

CPMSFTSQLR01CPMSFTSQLR02CPMSFTSQLR03CPMSFTSQLR05CPMSFTSQLR06CPMSFTSQLR08CPMSFTSQLR20CPMSFTSQLS01CPMSFTSQLS02CPMSFTSQLW01CPMSFTSQLW02CPMSFTSQLX01CPMSFTSQLX02CPMSFTSQLZ01CPMSFTSQLZ02CPMSFTSQLZ04CPMSFTSQL01CPMSFTSQL02CPMSFTSQL03

Monitoring Servers

CPMSFTHMON01CPMSFTHMON02CPMSFTHMON03

CPMSFTMONA01CPMSFTMONA02CPMSFTMONA03

Canyon Park Data CenterMicrosoft.com Network Diagram

Page 19: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

19

A Schematic of HotMail• ~7,000 servers • 100 backend stores

with 120TB (cooked)• 3 data centers• Links to

– Passport– Ad-rotator– Internet Mail gateways– …

• ~ 1B messages per day• 150M mailboxes, 100M active• ~400,000 new per day.

Sw

ittc

hed

Eth

ern

et

Inte

rnet

Telnet Management

Local Director

Local Director

Local Director

Local Director

MSERVS

MSERVSMSERVSFrontDoors

MSERVSMSERVSIncoming

MailServers

MSERVSMSERVSAD Servers

Local Director

MSERVSMSERVSGraphicsServers

DataDataData

DataUSTORES

MemberDirectory

Local Director

MSERVSMSERVSLoginServers

Page 20: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

20

Why (2) Velocity

• No project can take more than 13 weeks.

• Time to market is everything

• Functionality is everything• Faster, cheaper, badder

Schedule Quality

Functionality

trend

Page 21: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

21

Why (3) Hackers• Hacker’s are a new increased threat• Any site can be attacked from anywhere• Motives include ego, malice, and greed.• Complexity makes it hard to protect sites.• Concentration of wealth makes attractive target:

• Why did you rob banks?• Willie Sutton: Cause that’s where the money is!

Note: Eric Raymond’s How to Become a Hacker http://www.tuxedo.org/~esr/faqs/hacker-howto.html

is the positive use of the term, here I mean malicious and anti-social hackers.

Page 22: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

22

How Bad Is It?http://www-iepm.slac.stanford.edu/

Connectivity is poor.

Page 23: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

23

How Bad Is It?• Median monthly % ping packet loss for 2/ 99

http://www-iepm.slac.stanford.edu/pinger/

Page 24: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

24

Microsoft.Com• Operations mis-configured

a router• Took a day to diagnose

and repair.

• DOS attacks cost a fraction of a day.

• Regular security patches.

Page 25: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

25

BackEnd Servers are More Stable• Generally deliver 99.99%

• TerraServer for example single back-end failed after 2.5 y.

• Went to 4-nodecluster

• Fails every 2 mo.Transparent failover in 30 sec.Online software upgradesSo… 99.999% in backend…

Time %

Total Up Time 8754:07:22 99.93%

Total Down Time 5:52:38 0.07%Total Time 8760:00:00 100.00%Scheduled Down 2:50:45Scheduled Availabilty 8757:09:15 99.97%

Un-Scheduled Down 3:01:53Time %

Up Time 12888:21:49 99.519%Scheduled Down 4:00:25 0.031%

Unscheduled Down 58:20:46 0.451%

Total Time 12950:43:00 99.52%Total Down 62:21:11 0.48%

Year 1

Through18

Months

Down 30 hours in July (hardware stop, auto restart failed, operations failure)

Down 26 hours in September (Backplane failure, I/O Bus failure)

Page 26: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

26

eBay: A very honest site

• Publishes operations log.Publishes operations log.

• Has 99% of scheduled uptimeHas 99% of scheduled uptime

• Schedules about 2 hours/week down.Schedules about 2 hours/week down.

• Has had some operations outagesHas had some operations outages

• Has had some DOS problems.Has had some DOS problems.

http://www2.ebay.com/aw/announce.shtml#top

Page 27: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

27

Outline

• The glorious past (Availability Progress)

• The dark ages (current scene)

• Some recommendations

Page 28: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

28

Not to throw stones but…

• Everyone has a serious problem.

• The BEST people publish their stats.

• The others HIDE their stats (check Netcraft to see who I mean).

• We have good NODE-level availability5-9s is reasonable.

• We have TERRIBLE system-level availability2-9s is the goal.

Page 29: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

29

Recommendation #1

• Continue progress on back-ends.– Make management easier

(AUTOMATE IT!!!)– Measure – Compare best practices– Continue to look for better algoritims.

• Live in fear– We are at 10,000 node servers– We are headed for 1,000,000 node servers

Page 30: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

30

Recommendation #2• Current security approach is unworkable:

– Anonymous clients– Firewall is clueless– Incredible complexity

• We cant win this game!

• So change the rules (redefine the problem):– No anonymity– Unified authentication/authorization model – Single-function devices (with simple interfaces)– Only one-kind of interface (uddi/wsdl/soap/…).

Page 31: 1 Dependability in the Internet Era. 2 Outline The glorious past (Availability Progress) The dark ages (current scene) Some recommendations.

31

ReferencesAdams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and

Development. 28(1): 2-14.0Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in

Distributed Software and Database Systems. 3-12.Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on

Reliability. 39(4): 409-418.Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan

Kaufmann.Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An

Advanced Course. ACM, Springer-Verlag.Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-

11.Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium

on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host Reliability,''

Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, p. 2-9

http://www.netcraft.com/ They have even better for-fee data as well, but for-free is really excellent.http://www2.ebay.com/aw/announce.shtml#top eBay is an Excellent benchmark of best Internet practices http://www-iepm.slac.stanford.edu/pinger/ Network traffic/quality report, dated, but the others have died off!