Scaleabilty Jim Gray [email protected](with help from Gordon Bell, George Spix, Catharine van Ingen 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Path Groupware Benchmark Mon Tue Wed Thur Fri
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
(with help from Gordon Bell, George Spix, Catharine van Ingen
9:00
11:00
1:30
3:30
7:00
Overview
Faults
Tolerance
T Models
Party
TP mons
Lock Theory
Lock Techniq
Queues
Workflow
Log
ResMgr
CICS & Inet
Adv TM
Cyberbrick
Files &Buffers
COM+
Corba
Replication
Party
B-tree
Access Paths
Groupware
Benchmark
Mon Tue Wed Thur Fri
A peta-op business app?
• P&G and friends pay for the web (like they paid for broadcast television) – no new money, but given Moore, traditional advertising revenues can pay for all of our connectivity - voice, video, data…… (presuming we figure out how to & allow them to brand the experience.)
• Advertisers pay for impressions and ability to analyze same.
• A terabyte sort a minute – to one a second.• Bisection bw of ~20gbytes/s – to ~200gbytes/s. • Really a tera-op business app (today’s portals)
ScaleabilityScale Up and Scale Out
SMPSMPSuper ServerSuper Server
DepartmentalDepartmentalServerServer
PersonalPersonalSystemSystem
Grow Up with SMPGrow Up with SMP4xP6 is now standard4xP6 is now standard
Grow Out with ClusterGrow Out with Cluster
Cluster has inexpensive partsCluster has inexpensive parts
Clusterof PCs
There'll be Billions Trillions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of ClientsNeed Millions Of Servers
MobileMobileclientsclients
FixedFixedclients clients
ServerServer
SuperSuperserverserver
ClientsClients
ServersServers
All clients networked All clients networked to serversto servers May be nomadicMay be nomadic
or on-demandor on-demand Fast clients wantFast clients want
fasterfaster servers servers Servers provide Servers provide
Shared DataShared Data ControlControl CoordinationCoordination CommunicationCommunication
Trillions
Billions
ThesisMany little beat few big
Smoking, hairy golf ballSmoking, hairy golf ball How to connect the many little parts?How to connect the many little parts? How to program the many little parts?How to program the many little parts? Fault tolerance & Management?Fault tolerance & Management?
$1 $1 millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMiniMicroMicro NanoNano
14"14"9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP
What Happened?Where did the 100,000x come from?• Moore’s law: 100X (at most)• Software improvements: 10X (at most)• Commodity Pricing: 100X (at least)• Total 100,000X
• 100x from commodity
– (DBMS was 100K$ to start: now 1k$ to start
– IBM 390 MIPS is 7.5K$ today
– Intel MIPS is 10$ today
– Commodity disk is 50$/GB vs 1,500$/GB
– ...
SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC
per sqft
cpus 2.1 4.7 7.0 4.7 5.0 13.3
specint 29.0 60.5 132.7 79.3 72.3 253.3
ram 4.1 4.7 7.0 0.6 5.0 6.8 gb
disks 1.3 0.5 5.2 0.0 2.5 13.3
Standard package, full height, fully populated, 3.5” disks
HP, DELL, Compaq are trading places wrt rack mount lead
PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft. $650K from Arrow (3yr warrantee!) on chip at speed L2
Web & server farms, server consolidation / sqft
http://www.exodus.com (charges by mbps times sqft)
General purpose, non-parallelizable codesPCs have it!
VectorizableVectorizable & //able(Supers & small DSMs)
Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs)
DatabaseDatabase/TPWeb HostStream Audio/Video
Technical
Commercial
Application Taxonomy
If central control & rich then IBM or large SMPselse PC Clusters
Peta scale w/ traditional balance
2000 2010
1 PIPS processors 1015 ips
106 cpus @109
ips104 cpus @1011
ips10 PB of DRAM 108 chips @107
bytes106 chips @109
bytes10 PBps memory
bandwidth
1 PBps IO bandwidth
108 disks 107
Bps107 disks 108 Bps
100 PB of disk storage
105 disks 1010 B 103 disks 1012 B
10 EB of tape storage
107 tapes 1010 B 105 tapes 1012 B
10x every 5 years, 100x every 10 (1000x in 20 if SC)Except --- memory & IO bandwidth
• How much can you sort for a penny.– Hardware and Software cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident• Input is
– 100-byte records (random data)– key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
How Good is NT5 Sort?• CPU and IO not overlapped.• System should be able to sort 2x more• RAM has spare capacity• Disk is space saturated
(1.5GB in, 1.5GB out on 3GB drive.) Need an extra 3GB drive or a >6GB drive
& cameras (vision)– Storage: Data storage and analysis
• System is “distributed” (a cluster/mob)
Gbps SAN: 110 MBps
SAN: Standard Interconnect
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
• Winsock: 110 MBps(10% cpu utilization at each end)
RIPFDDI
RIPATM
RIPSCI
RIPSCSI
RIPFC
RIP?
Disk = Node• has magnetic storage (100 GB?)
• has processor & DRAM
• has SAN attachment
• has execution environment
OS KernelSAN driver Disk driver
File System RPC, ...Services DBMS
Applications
endend
Standard Storage MetricsStandard Storage Metrics• Capacity: – RAM: MB and $/MB: today at 10MB & 100$/MB– Disk:GB and $/GB: today at 10 GB and 200$/GB– Tape: TB and $/TB: today at .1TB and 25k$/TB
(nearline)• Access time (latency)
– RAM: 100 ns– Disk: 10 ms– Tape: 30 second pick, 30 second position
• Transfer rate– RAM: 1 GB/s– Disk: 5 MB/s - - - Arrays can go to 1GB/s– Tape: 5 MB/s - - - striping is problematic
New Storage Metrics:
Kaps, Maps, SCAN?
New Storage Metrics:
Kaps, Maps, SCAN?
• Kaps: How many KB objects served per second– The file server, transaction processing metric
– This is the OLD metric.
• Maps: How many MB objects served per sec– The Multi-Media metric
• SCAN: How long to scan all the data– The data mining and utility metric
• And
–Kaps/$, Maps/$, TBscan/$
For the Record (good 1998 devices packaged in
systemhttp://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf)DRAM DISK TAPE robot
( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!
The Access Time MythThe Access Time MythThe Myth: seek or pick time dominatesThe reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often shortImplication: many cheap servers
better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte
This is now obvious for disk arraysThis will be obvious for tape arrays Seek
Rotate
Transfer
Seek
Rotate
Transfer
Wait
What To Do About HIGH Availability• Need remote MIRRORED site to tolerate
• National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN• Compaq + HP +Myricom + WindowsNT• A Super Computer for 3M$• Classic Fortran/MPI programming• DCOM programming model
Your Tax Dollars At WorkASCI for Stockpile Stewardship
Observations• Uniprocessor RAP << PAP
– real app performance << peak advertised performance
• Growth has slowed (Bell Prize– 1987: 0.5 GFLOPS
– 1988 1.0 GFLOPS 1 year
– 1990: 14 GFLOPS 2 years
– 1994: 140 GFLOPS 4 years
– 1997: 604 GFLOPS
– 1998: 1600 G__OPS 4 years
Two Generic Kinds of computing• Many little
– embarrassingly parallel– Fit RPC model– Fit partitioned data and computation model– Random works OK– OLTP, File Server, Email, Web,…..
• Few big– sometimes not obviously parallel– Do not fit RPC model (BIG rpcs)– Scientific, simulation, data mining, ...
Many Little Programming Model
• many small requests• route requests to data• encapsulate data with procedures (objects)• three-tier computing• RPC is a convenient/appropriate model• Transactions are a big help in error handling• Auto partition (e.g. hash data and computation)• Works fine.• Software CyberBricks
Object Oriented ProgrammingParallelism From Many Little Jobs
• Automatic parallelism– among transactions (locking)– within a transaction (parallel execution)
SQL a Non-Procedural Programming Language
• SQL: functional programming language describes answer set.
• Optimizer picks best execution plan– Picks data flow web (pipeline),
– degree of parallelism (partitioning)– other execution parameters (process placement, memory,...)
GUI
Schema
Plan
Monitor
Optimizer
ExecutionPlanning
Rivers
Executors
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
Automatic Parallel Object Relational DBSelect imagefrom landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;
Temporal
Spatial
Image
date loc image
Landsat
1/2/72.........4/8/95
33N120W.......34N120W
Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it
image
Answer
date, location, & image tests
Data Rivers: Split + Merge Streams
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data records River = Split/Merge in Gamma =
Exchange operator in Volcano /SQL Server.
River
M ConsumersN producers
N X M Data Streams
Generalization: Object-oriented Rivers• Rivers transport sub-class of record-set (= stream of objects)
– record type and partitioning are part of subclass
• Node transformers are data pumps– an object with river inputs and outputs– do late-binding to record-type
• Programming becomes data flow programming– specify the pipelines
• Compiler/Scheduler does data partitioning and “transformer” placement
NT Cluster Sort as a Prototype
• Using – data generation and – sort
as a prototypical app
• “Hello world” of distributed processing
• goal: easy install & execute
Remote Install
RegConnectRegistry()
RegCreateKeyEx()
•Add Registry entry to each remote node.
Cluster StartupExecution
MULT_QI COSERVERINFO•Setup :
MULTI_QI structCOSERVERINFO struct
•CoCreateInstanceEx()
•Retrieve remote object handle from MULTI_QI struct
•Invoke methods as usual
HANDLEHANDLE
HANDLE
Sort()
Sort()
Sort()
Cluster Sort Conceptual Model
•Multiple Data Sources
•Multiple Data Destinations
•Multiple nodes
•Disks -> Sockets -> Disk -> Disk
B
AAABBBCCC
A
AAABBBCCC
C
AAABBBCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
BBBBBBBBB
AAAAAAAAA
CCCCCCCCC
How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other
– CORBA? DCOM? IIOP? RMI?
– One or all of the above.
• Huge leverage in high-level interfaces.• Same old distributed system story.