Top Banner
1 Storage Bricks Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen Helped me sharpen these arguments
27

Helped me sharpen these arguments

Jan 09, 2016

Download

Documents

Laurie

Storage Bricks Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen. Helped me sharpen - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Helped me sharpen  these arguments

1

Storage Bricks Jim Gray

Microsoft Researchhttp://Research.Micrsoft.com/~Gray/talksFAST 2002 Monterey, CA, 29 Jan 2002

Acknowledgements:

Dave Patterson explained this to me long ago Leonard Chung

Kim Keeton Erik Riedel Catharine Van Ingen

Helped me sharpen these arguments

Page 2: Helped me sharpen  these arguments

2

First Disk 1956• IBM 305 RAMAC

• 4 MB

• 50x24” disks

• 1200 rpm

• 100 ms access

• 35k$/y rent

• Included computer & accounting software(tubes not transistors)

Page 3: Helped me sharpen  these arguments

3

10 years later1.

6 m

eter

s

Page 4: Helped me sharpen  these arguments

4

Disk Evolution• Capacity:100x in 10 years

1 TB 3.5” drive in 2005 20 GB 1” micro-drive

• System on a chip

• High-speed SAN

• Disk replacing tape

• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

Page 5: Helped me sharpen  these arguments

5

Disks are becoming computers• Smart drives

• Camera with micro-drive

• Replay / Tivo / Ultimate TV

• Phone with micro-drive

• MP3 players

• Tablet

• Xbox

• Many more…

Disk Ctlr + 1Ghz cpu+1GB RAM

Comm:Infiniband, Ethernet, radio…

ApplicationsWeb, DBMS, Files

OS

Page 6: Helped me sharpen  these arguments

6

Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks

Storage

Network

Display

ASIC

ASIC

ASICToday:

P=50 mips

M= 2 MB

In a few years

P= 500 mips

M= 256 MB

Processing decentralized

Moving to data sources

Moving to power sources

Moving to sheet metal

? The end of computers ?

Page 7: Helped me sharpen  these arguments

7

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer• You get a

– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

Page 8: Helped me sharpen  these arguments

8

The Absurd Design?• Segregate processing from storage

• Poor locality

• Much useless data movement

• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips

ProcessorsDisks

~ 1 Tips

RAM

~ 1 TB

~ 100TB

100 GBps10 TBps

Page 9: Helped me sharpen  these arguments

9

The “Absurd” Disk• 2.5 hr scan time

(poor sequential access)• 1 aps / 5 GB

(VERY cold data)• It’s a tape!• Optimizations:

– Reduce management costs– Caching– Sequential 100x faster than random

1 TB100 MB/s

200 Kaps

200$

Page 10: Helped me sharpen  these arguments

10

Disk = Node• magnetic storage (1TB)• processor + RAM + LAN• Management interface

(HTTP + SOAP) • Application execution

environment• Application

– File

– DB2/Oracle/SQL

– Notes/Exchange/TeamServer

– SAP/Seibold/…

– Quickbooks /Tivo/ PC.…

OS KernelLAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 11: Helped me sharpen  these arguments

11

Implications

• Offload device handling to NIC/HBA

• higher level protocols: I2O, NASD, VIA, IP, TCP…

• SMP and Cluster parallelism is important.

Terabyte/s Backplane

• Move app to NIC/device controller

• higher-higher level protocols: SOAP/DCOM/RMI..

• Cluster parallelism is VERY important.

CentralProcessor &

Memory

Conventional Radical

Page 12: Helped me sharpen  these arguments

12

Intermediate Step: Shared Logic• Brick with 8-12 disk drives• 200 mips/arm (or more)

• 2xGbpsEthernet• General purpose OS • 10k$/TB to 50k$/TB• Shared

– Sheet metal

– Power

– Support/Config

– Security

– Network ports

• These bricks could run applications (e.g. SQL or Mail or..)

Snap ~1TB 12x80GB NAS

NetApp ~.5TB 8x70GB NAS

Maxstor ~2TB 12x160GB NAS

Page 13: Helped me sharpen  these arguments

13

Example• Homogenous machines leads

to quick response through reallocation

• HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

• $4k/TB (street), • 2.5processors/TB,

1GB RAM/TB• JIT storage & processing

3 weeks from order to deploy

Slide courtesy of Brewster Kahle, @ Archive.org

Page 14: Helped me sharpen  these arguments

14

What if Disk Replaces Tape?How does it work?

• Backup/Restore– RAID (among the federation)– Snapshot copies (in most OSs)– remote replicas (standard in DBMS and FS)

• Archive– Use “cold” 95% of disk space

• Interchange– Send computers not disks.

Page 15: Helped me sharpen  these arguments

15

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

• At 1GBps it takes 12 days!• Store it in two (or more) places online

A geo-plex• Scrub it continuously (look for errors)• On failure,

– use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

Page 16: Helped me sharpen  these arguments

16

Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes

• If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC)

• So you have 1.6 PB of (mirrored) storage (160GB drives)

• Use the “empty” 95% for archive storage.

• No extra space or extra power cost.

• Very fast access (milliseconds vs hours).

• Snapshot is read-only (software enforced )

• Makes Admin easy (saves people costs)

Page 17: Helped me sharpen  these arguments

17

Disk as Tape Archive

• Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

• Using removable hard drives to replace tape’s function has been successful

• When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

• Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Slide courtesy of Brewster Kahle, @ Archive.org

Page 18: Helped me sharpen  these arguments

18

Disk as Tape Interchange

• Tape interchange is frustrating (often unreadable)

• Beyond 1-10 GB send media not data– FTP takes too long (hour/GB)– Bandwidth still very expensive (1$/GB)

• Writing DVD not much faster than Internet

• New technology could change this – 100 GB DVD @ 10MBps would be competitive.

• Write 1TB disk in 2.5 hrs (at 100MBps)

• But, how does interchange work?

Page 19: Helped me sharpen  these arguments

19

Disk As Tape Interchange: What format?• Today I send 160GB NTFS/SQL disks.• But that is not a good format for Linux/DB2 users.• Solution: Ship NFS/CIFS/ODBC servers (not disks)• Plug “disk” into LAN.

– DHCP then file or DB server via standard interface.

– “pull” data from server.

Page 20: Helped me sharpen  these arguments

20

Some Questions

• What is the product?

• How do I manage 10,000 nodes (disks)?

• How do I program 10,000 nodes (disks)?

• How does RAID work?

• How do I backup a PB?

• How do I restore a PB?

Page 21: Helped me sharpen  these arguments

21

What is the Product?• Concept: Plug it in and it works!• Music/Video/Photo appliance (home)• Game appliance • “PC”• File server appliance• Data archive/interchange appliance• Web server appliance• DB server• eMail appliance• Application appliance

power

network

Page 22: Helped me sharpen  these arguments

22

How Does Scale Out Work?• Files: well known designs:

– rooted tree partitioned across nodes– Automatic cooling (migration)– Mirrors or Chained declustering– Snapshots for backup/archive

• Databases: well known designs– Partitioning, remote replication similar to files– distributed query processing.

• Applications: (hypothetical)– Must be designed as mobile objects – Middleware provides object migration system

• Objects externalize methods to migrate ( == backup/restore/archive)

• Web services seem to have key ideas (xml representation)– Example: eMail object is mailbox

Page 23: Helped me sharpen  these arguments

23

Auto Manage Storage• 1980 rule of thumb:

– A DataAdmin per 10GB, SysAdmin per mips

• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).

• Problem:– 5TB is 50k$ today, 5k$ in a few years.

– Admin cost >> storage cost !!!!• Challenge:

– Automate ALL storage admin tasks

Page 24: Helped me sharpen  these arguments

24

Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”)

• Google: 1 :100TB 5k$/TB/y

• Yahoo! 1 : 50TB 20k$/TB/y

• DB 1 : 5TB 60k$/TB/y

• Wall St. 1 : 1TB 400k$/TB/y (reported)

• hardware dominant cost only @ Google.

• How can we waste hardware to save people cost?

Page 25: Helped me sharpen  these arguments

25

How do I manage 10,000 nodes?

• You can’t manage 10,000 x (for any x).• They manage themselves.

– You manage exceptional exceptions.

• Auto Manage– Plug & Play hardware– Auto-load balance & placement storage &

processing– Simple parallel programming model– Fault masking

Page 26: Helped me sharpen  these arguments

26

How do I program 10,000 nodes?

• You can’t program 10,000 x (for any x).

• They program themselves.– You write embarrassingly parallel programs– Examples: SQL, Web, Google, Inktomi, HotMail,….– PVM and MPI prove it must be automatic (unless you have a PhD)!

• Auto Parallelism is ESSENTIAL

Page 27: Helped me sharpen  these arguments

27

Summary• Disks will become supercomputers so

– Lots of computing to optimize the arm

– Can put app close to the data (better modularity, locality)

– Storage appliances (self-organizing)

• The arm/capacity tradeoff: “waste” space to save access. – Compression (saves bandwidth)

– Mirrors

– Online backup/restore

– Online archive (vault to other drives or geoplex if possible)

• Not disks replace tapes: Storage appliances replace tapes.

• Self-organizing storage servers (file systems)(prototypes of this software exist)