1 Building Peta-Byte Servers Jim Gray Microsoft Research [email protected]/talks Kilo10 3 Mega10 6 Giga10 9 Tera10.

1

Building Peta-Byte ServersBuilding Peta-Byte Servers

Jim GrayJim Gray

Microsoft ResearchMicrosoft Research

[email protected]@Microsoft.com

http://www.Research.Microsoft.com/~Gray/talkshttp://www.Research.Microsoft.com/~Gray/talks

KiloKilo 101033

MegaMega 101066

GigaGiga 101099

TeraTera 10101212 today, we are here today, we are here PetaPeta 10101515

ExaExa 10101818

2

OutlineOutline

• The challenge: Building GIANT data storesThe challenge: Building GIANT data stores

– for example, the EOS/DIS 15 PB systemfor example, the EOS/DIS 15 PB system

• Conclusion 1Conclusion 1

– Think about Maps and SCANSThink about Maps and SCANS

• Conclusion 2:Conclusion 2:

– Think about ClustersThink about Clusters

3

The Challenge -- EOS/DISThe Challenge -- EOS/DIS

• Antarctica is melting -- Antarctica is melting -- 77% of fresh water liberated77% of fresh water liberated

– sea level rises 70 meters sea level rises 70 meters – Chico & Memphis are beach-front propertyChico & Memphis are beach-front property– New York, Washington, SF, SB, LA, London, Paris New York, Washington, SF, SB, LA, London, Paris

• Let’s study it! Let’s study it! Mission to Planet EarthMission to Planet Earth

• EOS: Earth Observing System EOS: Earth Observing System (17B$ => 10B$)(17B$ => 10B$)

– 50 instruments on 10 satellites 1997-200150 instruments on 10 satellites 1997-2001– Landsat (added later)Landsat (added later)

• EOS DIS: Data Information System:EOS DIS: Data Information System:

– 3-5 MB/s raw, 30-50 MB/s cooked.3-5 MB/s raw, 30-50 MB/s cooked.– 4 TB/day, 4 TB/day, – 15 PB by year 200715 PB by year 2007

4

The Process FlowThe Process Flow• Data arrives and is pre-processed.Data arrives and is pre-processed.

– instrument data is instrument data is calibrated, calibrated, griddedgriddedaveragedaveraged

– Geophysical data is derived Geophysical data is derived • Users ask Users ask for stored data for stored data

OROR to analyze and combine data.to analyze and combine data.• Can make the pull-push split dynamicallyCan make the pull-push split dynamically

Pull Processing Push ProcessingOther Data

5

Designing EOS/DIS (for success) Designing EOS/DIS (for success) • Expect that millions will use the system Expect that millions will use the system (online)(online)

Three user categories:Three user categories:– NASA 500 -- NASA 500 -- funded by NASA to do sciencefunded by NASA to do science– Global Change 10 k - Global Change 10 k - other dirt bagsother dirt bags– Internet 20 m - Internet 20 m - everyone elseeveryone else

Grain speculatorsGrain speculatorsEnvironmental Impact ReportsEnvironmental Impact Reportsschool kidsschool kidsNew applicationsNew applications

=> discovery & access must be automatic => discovery & access must be automatic

• Allow anyone to set up a peer- nodeAllow anyone to set up a peer- node (DAAC & SCF)(DAAC & SCF)

• Design for Ad Hoc queries, Design for Ad Hoc queries, Not Just Standard Data ProductsNot Just Standard Data Products If push is 90%, then 10% of data is read (on average). If push is 90%, then 10% of data is read (on average).

=> A failure: no one uses the data, in DSS, push is 1% or less.=> A failure: no one uses the data, in DSS, push is 1% or less.

=> computation demand is enormous=> computation demand is enormous (pull:push is 100: 1)(pull:push is 100: 1)

6

The (UC alternative) ArchitectureThe (UC alternative) Architecture

• 2+N data center design2+N data center design

• Scaleable DBMS to manage the dataScaleable DBMS to manage the data

• Emphasize Pull vs Push processingEmphasize Pull vs Push processing

• Storage hierarchyStorage hierarchy

• Data PumpData Pump

• Just in time acquisitionJust in time acquisition

7

2+N Data Center Design2+N Data Center Design• Duplex the archive (for fault tolerance)Duplex the archive (for fault tolerance)

• Let anyone build an extract (the +N)Let anyone build an extract (the +N)

• Partition data by time and by space Partition data by time and by space (store 2 or 4 ways).(store 2 or 4 ways).

• Each partition is a free-standing DBMSEach partition is a free-standing DBMS(similar to Tandem, Teradata designs).(similar to Tandem, Teradata designs).

• Clients and Partitions interact via standard protocolsClients and Partitions interact via standard protocols

– DCOM/CORBA, OLE-DB, HTTP,…DCOM/CORBA, OLE-DB, HTTP,…

• Use the (Next Generation) Internet Use the (Next Generation) Internet

8

Obvious Point: Obvious Point: EOS/DIS will be a EOS/DIS will be a ClusterCluster of SMPs of SMPs

• It needs 16 PB storageIt needs 16 PB storage

= 1 M disks in current technology= 1 M disks in current technology

= 500K tapes in current technology= 500K tapes in current technology

• It needs 100 TeraOps of processing It needs 100 TeraOps of processing

= 100K processors (current technology)= 100K processors (current technology)

and ~ 100 Terabytes of DRAMand ~ 100 Terabytes of DRAM

• 1997 requirements are 1000x smaller1997 requirements are 1000x smaller

– smaller data ratesmaller data rate

– almost no re-processing workalmost no re-processing work

9

Hardware ArchitectureHardware Architecture

• 2 Huge Data Centers2 Huge Data Centers

• Each has 50 to 1,000 nodes in a clusterEach has 50 to 1,000 nodes in a cluster

– Each node has about 25…250 TB of storage (FY00 prices)Each node has about 25…250 TB of storage (FY00 prices)– SMP SMP .5Bips to 50 Bips .5Bips to 50 Bips 20K$ 20K$

– DRAMDRAM 50GB to 1 TB50GB to 1 TB 50K$ 50K$– 100 disks 100 disks 2.3 TB to 230 TB2.3 TB to 230 TB 200K$ 200K$– 10 tape robots10 tape robots 50 TB to 500 TB 50 TB to 500 TB 100K$ 100K$– 2 Interconnects2 Interconnects 1GBps to 100 GBps1GBps to 100 GBps 20K$ 20K$

• Node costs 500K$ Node costs 500K$

• Data Center costs 25M$ (capital cost)Data Center costs 25M$ (capital cost)

10

Scaleable DBMSScaleable DBMS• Adopt cluster approach Adopt cluster approach (Tandem, Teradata, VMScluster,..)(Tandem, Teradata, VMScluster,..)

• System must scale to many processors, disks, linksSystem must scale to many processors, disks, links

• Organize data as a Database, not a collection of filesOrganize data as a Database, not a collection of files

– SQL rather than FTP as the metaphorSQL rather than FTP as the metaphor

– add object types unique to EOS/DIS (Object Relational DB)add object types unique to EOS/DIS (Object Relational DB)

• DBMS based on standard object modelDBMS based on standard object model

– CORBA or DCOM (not vendor specific) CORBA or DCOM (not vendor specific)

• Grow by adding componentsGrow by adding components

• System must be self-managingSystem must be self-managing

11

Storage HierarchyStorage Hierarchy• Cache hot 10% (1.5 PB) on disk.Cache hot 10% (1.5 PB) on disk.

• Keep cold 90% on near-line tape.Keep cold 90% on near-line tape.

• Remember recent results on speculation|Remember recent results on speculation| research challenge: how trade push +store vs. pull.research challenge: how trade push +store vs. pull.

• (more on this later Maps & SCANS) (more on this later Maps & SCANS)

15 PB of Tape Robot

1 PB of Disk

10-TB RAM 500 nodes

10,000 drives

4x1,000 robots

12

Data PumpData Pump

• Some queries require reading ALL the data Some queries require reading ALL the data (for reprocessing)(for reprocessing)

• Each Data Center scans the data every 2 days.Each Data Center scans the data every 2 days.– Data rate 10 PB/day = 10 TB/node/day = 120 MB/sData rate 10 PB/day = 10 TB/node/day = 120 MB/s

• Compute on demand small jobsCompute on demand small jobs• less than 1,000 tape mountsless than 1,000 tape mounts• less than 100 M disk accessesless than 100 M disk accesses• less than 100 TeraOps.less than 100 TeraOps.• (less than 30 minute response time)(less than 30 minute response time)

• For BIG JOBS scan entire 15PB database For BIG JOBS scan entire 15PB database • Queries (and extracts) “snoop” this data pump.Queries (and extracts) “snoop” this data pump.

13

Just-in-time acquisition 30%Just-in-time acquisition 30%• Hardware prices decline 20%-40%/yearHardware prices decline 20%-40%/year

• So buy at last momentSo buy at last moment

• Buy best product that day: commodityBuy best product that day: commodity

• Depreciate over 3 years so that facility is fresh. Depreciate over 3 years so that facility is fresh. • (after 3 years, cost is 23% of original). 60% decline peaks at 10M$(after 3 years, cost is 23% of original). 60% decline peaks at 10M$

1996

EOS DIS Disk Storage Size and Cost

1994 1998 2000 2002 2004 2006 2008

Storage Cost M$

Data Need TB

1

10

10

10

10

10

2

3

4

5 assume 40% price decline/year

14

Just-in-time acquisition 50%!!!!!!!Just-in-time acquisition 50%!!!!!!!• Hardware prices decline 50%/year latelyHardware prices decline 50%/year lately

• The PC revolution!The PC revolution!

• Its amazing!Its amazing!

1

10

100

1,000

10,000

100,000

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Total Storage Capacity (TB)

Disk Cost (M$)40% /year price cut

Disk Cost (M$)50% /year price cut

EOS-DIS STORAGE NEEDS

15

TPC C improved fastTPC C improved fast(250%/year!)(250%/year!)

1.52.755676

$/tpmC vs time

$10

$100

$1,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96 Dec-96 Jun-97 Jan-98

date

$/tp

mC 250 %/year

improvement!

tpmC vs time

100

1,000

10,000

100,000

Mar-94 Sep-94 Apr-95 Oct-95 May-96

Dec-96 Jun-97 Jan-98

date

tpm

C

250 %/year improvement!

40% hardware, 100% software, 100% PC Technology

16

ProblemsProblems

• HSM (hierarchical storage management)HSM (hierarchical storage management)

• Design and Meta-dataDesign and Meta-data

• IngestIngest

• Data discovery, search, and analysisData discovery, search, and analysis

• reorganize-reprocessreorganize-reprocess

• disaster recoverydisaster recovery

• management/operations costmanagement/operations cost

17

http://msrlab/terraserver

Demo Demo

http://t2b2c/terra-server/

18

OutlineOutline




– Think about Maps and SCANSThink about Maps and SCANS



19

Meta-Message:Meta-Message: Technology Ratios Are Important Technology Ratios Are ImportantMeta-Message:Meta-Message: Technology Ratios Are Important Technology Ratios Are Important

• If everything gets faster & cheaper If everything gets faster & cheaper

at the same rate at the same rate THEN nothing really changes.THEN nothing really changes.

•

Things getting MUCH BETTER:Things getting MUCH BETTER:

– communication speed & cost 1,000xcommunication speed & cost 1,000x– processor speed & cost 100xprocessor speed & cost 100x– storage size & cost 100xstorage size & cost 100x

• Things staying about the sameThings staying about the same– speed of light (more or less constant)speed of light (more or less constant)– people (10x more expensive)people (10x more expensive)– storage speed (only 10x better)storage speed (only 10x better)

20

Today’s Storage Hierarchy : Today’s Storage Hierarchy : Speed & Capacity vs Cost TradeoffsSpeed & Capacity vs Cost TradeoffsToday’s Storage Hierarchy : Today’s Storage Hierarchy : Speed & Capacity vs Cost TradeoffsSpeed & Capacity vs Cost Tradeoffs

1015

1012

109

106

103

Typ

ical

Sys

tem

(by

tes)

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

Online Tape

104

102

100

10-2

10-4

$/M

B

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

DiscNearline

TapeOffline Tape

Online Tape

21

Storage Ratios ChangedStorage Ratios Changed

• 10x better access time10x better access time

• 10x more bandwidth10x more bandwidth

• 4,000x lower media price4,000x lower media price

• DRAM/DISK 100:1 to 10:10 to 50:1DRAM/DISK 100:1 to 10:10 to 50:1

Disk Performance vs Time

1

10

100

1980 1990 2000

Year

acce

ss t

ime

(ms)

1

10

100

ban

dw

idth

(M

B/s

)

Disk Performance vs Time(accesses/ second & Capacity)

1

10

100

1980 1990 2000

Year

Acc

esse

s p

er

Sec

on

d

0.1

1

10

Dis

k C

apac

kty

(GB

)

Storage Price vs Time

0.01

0.1

1

10

100

1000

10000

1980 1990 2000

Year

$/M

B

22

What's a TerabyteWhat's a Terabyte1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images

Library of Congress (in ASCI) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes

1997: 200 K$ of magnetic disc 120 discs 250 K$ of optical disc robot 200 platters 25 K$ of tape silo 25 tapes

Terror Byte !!.1% of a PetaByte!!!!!!!!!!!!!!!!!!

150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

23

The Cost of Storage & AccessThe Cost of Storage & AccessThe Cost of Storage & AccessThe Cost of Storage & Access

• File Cabinet: File Cabinet: cabinet (4 drawer)cabinet (4 drawer) 250$250$paper (24,000 sheets)paper (24,000 sheets) 250$250$

space (2x3 @ 10$/ft2)space (2x3 @ 10$/ft2) 180$180$ totaltotal700$700$ 3 3

¢/sheet¢/sheet

• Disk:Disk: disk (9 GB =)disk (9 GB =) 2,000$ 2,000$ASCII: ASCII: 5 m pages 5 m pages

0.20.2 ¢/sheet ¢/sheet (15x cheaper(15x cheaper

• Image:Image: 200 k pages200 k pages

1 ¢/sheet (similar to paper)1 ¢/sheet (similar to paper)

24

Trends: Trends: Application Storage Demand GrewApplication Storage Demand GrewTrends: Trends: Application Storage Demand GrewApplication Storage Demand Grew

• The New World:The New World:

– Billions of objectsBillions of objects

– Big objects (1MB)Big objects (1MB)

People

Name Address Papers Picture Voice

Mike

Won

David NY

Berk

Austin

The Old World:– Millions of

objects– 100-byte objectsPeople

Name Address

Mike

Won

David NY

Berk

Austin

25

Trends:Trends:New ApplicationsNew ApplicationsTrends:Trends:New ApplicationsNew Applications

The paperless office

Library of congress online (on your campus)

All information comes electronically entertainment publishing business

Information Network, Knowledge Navigator, Information at Your Fingertips

Multimedia: Text, voice, image, video, ...

26

Thesis: Performance =Storage AccessesThesis: Performance =Storage Accesses not Instructions Executed not Instructions ExecutedThesis: Performance =Storage AccessesThesis: Performance =Storage Accesses not Instructions Executed not Instructions Executed• In the “old days” we counted instructions and IO’sIn the “old days” we counted instructions and IO’s

• Now we count memory referencesNow we count memory references

• Processors wait most of the timeProcessors wait most of the time

SortDisc Wait

Where the time goes: clock ticks used by AlphaSort Components

SortDisc WaitOS

Memory Wait

D-Cache Miss

I-Cache MissB-Cache

Data Miss

27

The Pico ProcessorThe Pico ProcessorThe Pico ProcessorThe Pico Processor

1 M SPECmarks

106 clocks/ fault to bulk ram

Event-horizon on chip.

VM reincarnated

Multi-program cache

Terror Bytes!

10 microsecond ram

10 millisecond disc

10 second tape archive 100 petabyte

100 terabyte

1 terabyte

Pico Processor

10 pico-second ram1 MM

3

megabyte

10 nano-second ram 10 gigabyte

28

Storage Latency: How Far Storage Latency: How Far Away is the Data?Away is the Data?Storage Latency: How Far Storage Latency: How Far Away is the Data?Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

10 6

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromeda

29

The Five Minute RuleThe Five Minute Rule

• Trade DRAM for Disk AccessesTrade DRAM for Disk Accesses

• Cost of an access (DriveCost / Access_per_second)Cost of an access (DriveCost / Access_per_second)

• Cost of a DRAM page ( $/MB / pages_per_MB)Cost of a DRAM page ( $/MB / pages_per_MB)

• Break even has two terms:Break even has two terms:

• Technology term and an Economic termTechnology term and an Economic term

• Grew page size to compensate for changing ratios.Grew page size to compensate for changing ratios.

• Still at 5 minute for random, 1 minute sequentialStill at 5 minute for random, 1 minute sequential 1

ofDRAMPricePerMB

skDrivePricePerDi

skecondPerDiAccessPerS

ofDRAMPagesPerMBtervaleferenceInBreakEvenR

1ofDRAMPricePerMB

skDrivePricePerDi

skecondPerDiAccessPerS

ofDRAMPagesPerMBtervaleferenceInBreakEvenR

30

Shows Best Page Index Page Size ~16KBShows Best Page Index Page Size ~16KB

Index Page Utility vs Page Size and Index Elemet Size

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Page Size (KB)

Uti

lity

16 B 0.64 0.72 0.78 0.82 0.79 0.69 0.54

32 B 0.54 0.62 0.69 0.73 0.71 0.63 0.50

64 B 0.44 0.53 0.60 0.64 0.64 0.57 0.45

128 B 0.34 0.43 0.51 0.56 0.56 0.51 0.41

2 4 8 16 32 64 128

16 byte entries

32 byte

64 byte

128 byte

Index Page Utility vs Page Size and Disk Performance

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Page Size (KB)

Uti

lity

40 MB/s 0.65 0.74 0.83 0.91 0.97 0.99 0.94

10 MB/s 0.64 0.72 0.78 0.82 0.79 0.69 0.54

5 MB/s 0.62 0.69 0.73 0.71 0.63 0.50 0.34

3 MB/s 0.51 0.56 0.58 0.54 0.46 0.34 0.22

1 MB/s 0.40 0.44 0.44 0.41 0.33 0.24 0.16

2 4 8 16 32 64 128

10 MB/s

5 MB/s

3 MB/s

1MB/s

31

Standard Storage MetricsStandard Storage MetricsStandard Storage MetricsStandard Storage Metrics• Capacity: Capacity:

– RAM: RAM: MB and $/MB: today at 10MB & 100$/MBMB and $/MB: today at 10MB & 100$/MB– Disk:Disk: GB and $/GB: today at 10 GB and 200$/GBGB and $/GB: today at 10 GB and 200$/GB– Tape: Tape: TB and $/TB: today at .1TB and 25k$/TB TB and $/TB: today at .1TB and 25k$/TB

(nearline)(nearline)

• Access time (latency)Access time (latency)– RAM:RAM: 100 ns100 ns– Disk: Disk: 10 ms 10 ms– Tape: 30 second pick, 30 second position Tape: 30 second pick, 30 second position

• Transfer rateTransfer rate– RAM:RAM: 1 GB/s 1 GB/s– Disk:Disk: 5 MB/s - - - Arrays can go to 1GB/s 5 MB/s - - - Arrays can go to 1GB/s– Tape: 5 MB/s - - - striping is problematicTape: 5 MB/s - - - striping is problematic

32

New Storage Metrics: New Storage Metrics: Kaps, Maps, SCAN?Kaps, Maps, SCAN?New Storage Metrics: New Storage Metrics: Kaps, Maps, SCAN?Kaps, Maps, SCAN?

• Kaps: How many kilobyte objects served per secondKaps: How many kilobyte objects served per second– The file server, transaction processing metricThe file server, transaction processing metric– This is the OLD metric.This is the OLD metric.

• Maps: How many megabyte objects served per secondMaps: How many megabyte objects served per second– The Multi-Media metricThe Multi-Media metric

• SCAN: How long to scan all the dataSCAN: How long to scan all the data– the data mining and utility metricthe data mining and utility metric

• AndAnd– Kaps/$, Maps/$, TBscan/$Kaps/$, Maps/$, TBscan/$

33

For the Record (good 1997 devices)For the Record (good 1997 devices)

DRAM DISK TAPE robotUnit capacity (GB) 1 9 35

Unit price $ 15000 2000 10000$/GB 15000 222 20

Latency (s) 1.E-7 1.E-2 3.E+1Bandwidth (Mbps) 500 5 5

Kaps 5.E+5 1.E+2 3.E-2Maps 5.E+2 4.76 3.E-2

Scan time (s/TB) 2 1800 98000$/Kaps 3.E-10 2.E-7 3.E-3$/Maps 3.E-7 4.E-6 3.E-3

$/TBscan $0.32 $4 $296

X 14

34

How To Get Lots of Maps, SCANsHow To Get Lots of Maps, SCANsHow To Get Lots of Maps, SCANsHow To Get Lots of Maps, SCANs• parallelism: use many little devices in parallelparallelism: use many little devices in parallel

• Beware of the media mythBeware of the media myth

• Beware of the access time mythBeware of the access time myth

1 Terabyte

10 MB/s

At 10 MB/s: 1.2 days to scan

1 Terabyte

1,000 x parallel: 100 seconds SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

35

The Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a CardThe 100GB disc cardThe 100GB disc cardAn array of discsAn array of discsCan be used asCan be used as 100 discs100 discs 1 striped disc1 striped disc 10 Fault Tolerant discs10 Fault Tolerant discs ....etc....etcLOTS of accesses/secondLOTS of accesses/second bandwidthbandwidth

14"

Life is cheap, its the accessories that cost ya.

Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).

36

Tape Farms for Tertiary StorageTape Farms for Tertiary StorageNot Mainframe SilosNot Mainframe SilosTape Farms for Tertiary StorageTape Farms for Tertiary StorageNot Mainframe SilosNot Mainframe Silos

Scan in 27 hours.many independent tape robots(like a disc farm)

10K$ robot 14 tapes500 GB 5 MB/s 20$/GB 30 Maps

100 robots

50TB 50$/GB 3K Maps

27 hr Scan

1M$

37

0.01

0.1

1

10

100

1,000

10,000

100,000

1,000,000

1000 x Disc Farm STC Tape Robot 6,000 tapes, 8 readers

100x DLT Tape Farm

GB/K$

Maps

SCANS/Day

Kaps

The Metrics: The Metrics: Disk and Tape Farms Win Disk and Tape Farms Win The Metrics: The Metrics: Disk and Tape Farms Win Disk and Tape Farms Win

Data Motel:Data checks in, but it never checks out

38

Tape & Optical: Tape & Optical: Beware of the Beware of the Media MythMedia MythTape & Optical: Tape & Optical: Beware of the Beware of the Media MythMedia Myth

Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc)

Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

39

Tape & Optical Tape & Optical RealityReality: : Media is 10% of System CostMedia is 10% of System CostTape & Optical Tape & Optical RealityReality: : Media is 10% of System CostMedia is 10% of System Cost

Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB

(1x…10x cheaper than disc)

Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB

( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

40

The Access Time MythThe Access Time MythThe Access Time MythThe Access Time MythThe Myth: seek or pick time dominatesThe Myth: seek or pick time dominatesThe reality: (1) Queuing dominatesThe reality: (1) Queuing dominates (2) Transfer dominates BLOBs(2) Transfer dominates BLOBs (3) Disk seeks often short(3) Disk seeks often shortImplication: many cheap servers Implication: many cheap servers

better than one fast expensive server better than one fast expensive server– shorter queuesshorter queues– parallel transferparallel transfer– lower cost/access and cost/bytelower cost/access and cost/byte

This is now obvious for disk arraysThis is now obvious for disk arraysThis will be obvious for tape arraysThis will be obvious for tape arrays

Seek

Rotate

Transfer

Seek

Rotate

Transfer

Wait

41

OutlineOutline




– Think about Maps and SCAN & 5 minute ruleThink about Maps and SCAN & 5 minute rule



42

Scaleable ComputersScaleable ComputersBOTH SMP and Cluster BOTH SMP and Cluster

SMPSuper Server

DepartmentalServer

PersonalSystem

Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard

Grow Out with ClusterGrow Out with Cluster

Cluster has inexpensive partsCluster has inexpensive parts

Clusterof PCs

43

What do TPC results say?What do TPC results say?• Mainframes do not compete on performance or priceMainframes do not compete on performance or price

They have great legacy code (MVS)They have great legacy code (MVS)

• PC nodes performance is 1/3 of high-end UNIX nodesPC nodes performance is 1/3 of high-end UNIX nodes

– 6xP6 vs 48xUltraSparc 6xP6 vs 48xUltraSparc

• PC Technology is 3x cheaper than high-end UNIXPC Technology is 3x cheaper than high-end UNIX

• Peak performance is a clusterPeak performance is a cluster

– Tandem 100 node clusterTandem 100 node cluster

– DEC Alpha 4x8 clusterDEC Alpha 4x8 cluster

• Commodity solutions WILL come to this marketCommodity solutions WILL come to this market

44

Cluster AdvantagesCluster Advantages

• Clients and Servers made from the same stuff.Clients and Servers made from the same stuff.

– Inexpensive: Built with commodity components Inexpensive: Built with commodity components

• Fault tolerance: Fault tolerance: – Spare modules mask failures Spare modules mask failures

• Modular growthModular growth

– grow by adding small modulesgrow by adding small modules

• Parallel data searchParallel data search

– use multiple processors and disksuse multiple processors and disks

45

Clusters being builtClusters being built• Teradata 500 nodes Teradata 500 nodes (50k$/slice) (50k$/slice)• Tandem,VMScluster 150 nodes Tandem,VMScluster 150 nodes (100k$/slice)(100k$/slice)• Intel, 9,000 nodes @ 55M$ Intel, 9,000 nodes @ 55M$ ( (

6k$/slice)6k$/slice)• Teradata, Tandem, DEC moving to NT+low slice priceTeradata, Tandem, DEC moving to NT+low slice price

• IBM: 512 nodes ASCI @ 100m$ (200k$/slice)IBM: 512 nodes ASCI @ 100m$ (200k$/slice)• PC clusters (bare handed) at dozens of nodes PC clusters (bare handed) at dozens of nodes

web servers (msn, PointCast,…), DB serversweb servers (msn, PointCast,…), DB servers

• KEY TECHNOLOGY HERE IS THE APPS.KEY TECHNOLOGY HERE IS THE APPS.– Apps distribute dataApps distribute data– Apps distribute executionApps distribute execution

46

Clusters are winning the high endClusters are winning the high end• Until recently a 4x8 cluster has best TPC-C performanceUntil recently a 4x8 cluster has best TPC-C performance• Clusters have best data mining story (TPC-D)Clusters have best data mining story (TPC-D)• This year, a 32xUltraSparc cluster won the MinuteSort This year, a 32xUltraSparc cluster won the MinuteSort

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1985 1990 1995 2000

Sort Records/second vs Time

M68000

Cray YMP

IBM 3090

Tandem

Hardware Sorter

Sequent

Intel Hyper

SGIIBM RS6000

NOW

Alpha

Next NOW (100 nodes)

47

Clusters (Plumbing)Clusters (Plumbing)

• Single system imageSingle system image

– namingnaming

– protection/securityprotection/security

– management/load balancemanagement/load balance

• Fault ToleranceFault Tolerance

• Hot Pluggable hardware & SoftwareHot Pluggable hardware & Software

48

So, What’s New?So, What’s New?• When slices cost 50k$, you buy 10 or 20.When slices cost 50k$, you buy 10 or 20.• When slices cost 5k$ you buy 100 or 200.When slices cost 5k$ you buy 100 or 200.• Manageability, programmability, usability Manageability, programmability, usability

become key issues (total cost of ownership).become key issues (total cost of ownership).• PCs are MUCH easier to use and programPCs are MUCH easier to use and program

New MPP &NewOS

New App

New MPP &NewOS

New App

New MPP &NewOS

New App

New MPP &NewOS

New App

StandardOS & Hardware

Apps

Customers

MPPVicious CycleNo Customers!

CP/CommodityVirtuous Cycle:Standards allow progressand investment protection

50

Where We Are TodayWhere We Are Today• Clusters moving fast Clusters moving fast

– OLTPOLTP

– SortSort

– WolfPackWolfPack

• Technology ahead of scheduleTechnology ahead of schedule

– cpus, disks, tapes,wires,..cpus, disks, tapes,wires,..

• Databases are evolvingDatabases are evolving

• Parallel DBMSs are evolvingParallel DBMSs are evolving

• Operations (batch) has a long way to go on Unix/PC.Operations (batch) has a long way to go on Unix/PC.

51

OutlineOutline• The challenge: Building GIANT data storesThe challenge: Building GIANT data stores



– Think about Maps and SCANs & 5 minute ruleThink about Maps and SCANs & 5 minute rule



• Slides & paper: http:\\www.research.Microsoft.com\Slides & paper: http:\\www.research.Microsoft.com\~Gray\talks ~Gray\talks December SIGMOD RECORDDecember SIGMOD RECORDhttp:\\www.research.Microsoft.com\~Gray\5_Min_Rule_Sigmod.doc http:\\www.research.Microsoft.com\~Gray\5_Min_Rule_Sigmod.doc

1 Building Peta-Byte Servers Jim Gray Microsoft Research [email protected]/talks Kilo10 3 Mega10 6 Giga10 9 Tera10.

Documents

data slide

instrument data

data information system

n partition data

standard data products

stored data orto

process flow data

n data center design2