1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995 • Detailed notes available from [email protected]– this presentation is 120 of the 174 slides (time limit) – Notes in PowerPoint7 and Word7
140
Embed
1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel Database Systems 101
Jim Gray & Gordon BellMicrosoft Corporation
presented at VLDB 95, Zurich Switzerland, Sept 1995
• Detailed notes available from [email protected] – this presentation is 120 of the 174 slides (time limit)
– Notes in PowerPoint7 and Word7
2Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
• Data storage, organization, and analysis is a challenge.• That is what databases are about• DBs do a good job on “records”• Now working on text, spatial, image, and sound.
6Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Database Store ALL Data Types
• The New World:•Billions of objects•Big objects (1MB)•Objects have behavior
(methods)
• The Old World:
– Millions of objects
– 100-byte objects
Mike
Won
David NY
Berk
Austin
People
Name Address
Mike
Won
David NY
Berk
Austin Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips
Name Address Papers Picture Voice
People
7Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Skew: If tasks get very small, variance > service time
69Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Benchmark Buyer's Guide
The Whole Story (for any system)
Th
rou
gh
pu
t
Processors & Discs
The Benchmark Report
Things to ask
When does it stop scaling?
Throughput numbers,Not ratios.
Standard benchmarks allowComparison to others
Comparison to sequential
Ratios and non-standard benchmarks are red flags.
70Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
AggCount
Performance 101: Scan RateDisk is 3MB/s to 10MB/s
Record is 100B to 200B (TPC-D 110...160, Wisconsin 204)So should be able to read 10kr/s to 100kr/s
Simple test: Time this on a 1M record tableSELECT count(*) FROM T WHERE x < :infinity;(table on one disk, turn off parallelism)
Typical problems:disk or controller is an antiqueno read-ahead in operating system or DBsmall page reads (2kb)data not clustered on disk big cpu overhead in record movement
Parallelism is not the cure for these problems
Scan
71Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel Scan Rate
AggCount
Scan
AggCount
Scan
AggCount
Scan
AggCount
Scan
AggSum
Simplest parallel test:Scaleup previous test:
4 disks, 4 controllers, 4 processors4 times as many records
partitioned 4 ways.Same query
Should have same elapsed time.
Some systems do.
72Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel Update Rate
UPDATELog
Test: UPDATE TSET x = x + :one;
Test for million row T on 1 disk
Test for four million row T on 4 disks
Look for bottlenecks.
After each call, execute ROLLBACK WORK
See if UNDO runs at the DO speed
See if UNDO is parallel (scales up)
74Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
The records/$/second Metric• parallel database systems scan data
• An interesting metric (100 byte record):
– Record Scan Rate / System Cost
• Typical scan rates: 1k records/s to 30k records/s
• Each Scaleable system has a “slice price” guess:– Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or
78Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Index Partitioning
Hash indices partition by hash
B-tree indices partition as a forest of trees.One tree per range
Primary index clusters data
0...9 10..19 20..29 30..39 40..
A..C D..F G...M N...R S..Z
79Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Secondary Index Partitioning
In shared nothing, secondary indices are Problematic
Partition by base table key rangesInsert: completely local (but what about unique?)Lookup: examines ALL trees (see figure)
Unique index involves lookup on insert.
Partition by secondary key rangesInsert: two nodes (base and index)Lookup: two nodes (index -> base)Uniqueness is easy
Teradata solution
A..C D..F G...M N...R S..
Base Table
A..Z
Base Table
A..Z A..Z A..Z A..Z
80Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
Any Sequential
Any Sequential Program Program
81Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Data Rivers Split + Merge Streams
River
M ConsumersN producers
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano.
N X M Data Streams
82Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
83Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
84Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Picking Data Ranges
Disk PartitioningFor range partitioning, sample load on disks.
Cool hot disks by making range smallerFor hash partitioning,
Cool hot disks by mapping some buckets to others
River PartitioningUse hashing and assume uniform If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
85Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Blocking Operators = Short Pipelines
An operator is blocking, if it does not produce any output, until it has consumed all its input
Examples:Sort, Aggregates, Hash-Join (reads all of one operand)
Blocking operators kill pipeline parallelismMake partition parallelism all the more important.
Sort RunsScan
Sort Runs
Sort Runs
Sort Runs
Tape File SQL Table Process
Merge Runs
Merge Runs
Merge Runs
Merge Runs
Table Insert
Index Insert
Index Insert
Index Insert
SQL Table
Index 1
Index 2
Index 3
Database LoadTemplate hasthree blocked phases
86Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use indicesMore compactSometimes have aggregate info.
GROUP BY aggregatesscan in category order if possible (use indices)Else If categories fit in RAM use RAM category hash table
Elsemake temp of <category, item>sort by category,do math in merge step.
88Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sort
Used forloading and reorganization (sort makes them sequential)
build B-treesreports
non-equijoinsRarely used for aggregates or equi-joins (if hash available
SortRunsInput
DataSortedData
Merge
89Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sub-sortsgenerateruns
Mergeruns
Range or Hash Partition River
River is range or hash partitioned
Scan or other source
Parallel Sort
M input N output Sort design
Disk and mergenot needed if sort fits in memory
Scales linearly because6
12= => 2x slowerlog(10 ) 6
log(10 ) 12
Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source
90Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
SIGMOD Sort AwardDatamation Sort: 1M records (100 B recs)
1000 seconds 1986
60 seconds 1990
7 seconds 1994
3.5 seconds 1995 (SGI challenge)
micros finally beat the mainframe!
finally! a UNIX system that does IO
SIGMOD MinuteSort1.1GB, Nyberg, 1994
Alpha 3cpu
1.6GB, Nyberg, 1995 SGI Challenge (12 cpu)
no SIGMOD PennySort record Threads (Sprocs) devoted to sorting
Ela
ps
ed
Tim
e (
se
co
nd
s)
0
50
100
150
200
250
1 2 4 6 10
write done
lists merged
lists-sorted
read-done
pin
Sort Time on an SGI Challenge
1.6 GB (16 M 100-byte records)12 cpu, 2.2 GB, 96 disk
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1985 1990 1995
Sort Records/second vs Time
M68000
Cray YMP
IBM 3090
Tandem
Hardware Sorter
Sequent
Alpha
Intel
HyperCube
SGI
91Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Nested Loops Join
OuterTable
InnerTable
If inner table indexed on join cols (b-tree or hash)then sequential scan outer (from start key)For each outer record
probe inner table for matching recs
Works best if inner is in RAM (=> small inner
92Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Merge Join (and sort-merge join)
LeftTable
RightTable
NxM caseCartesian product
Partitions well: partition smaller to larger partition.
Works for all joins (outer, non-equijoins, Cartesian, exclusion,...)
If tables sorted on join cols (b-tree or hash)then sequential scan each (from start key)left < right left=right left > rightadvance left match advance right
Nice sequential scan of data (disk speed)(MxN case may cause backwards rescan)
Sort-merge join sorts before doing the merge
93Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Hash Join
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smallerElse, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Always beats sort-merge and nestedunless data is clustered.
Good for equi, outer, exclusion joinLots of papers,
products just appearing (what went wrong?)
Hash reduces skew
Right Table
LeftTable
HashBuckets
95Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel Hash JoinICL implemented hash join with bitmaps in CAFS machine
(1976)!
Kitsuregawa pointed out the parallelism benefits of hashjoin in early 1980’s (it partitions beautifully)
We ignored them! (why?) But now, Everybody's doing it.(or promises to do it).
Hashing minimizes skew, requires little thinking for redistribution
Hashing uses massive main memory
98Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
ObservationsIt is easy to build a fast parallel execution environment
(no one has done it, but it is just programming)
It is hard to write a robust and world-class query optimizer.There are many tricksOne quickly hits the complexity barrier
Common approach:Pick best sequential planPick degree of parallelism based on bottleneck analysis
Bind operators to process
99Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
What’s Wrong With That?Why isn’t the best serial plan, the best parallel plan?
Counter example:Table partitioned with local secondary index at two nodesRange query selects all of node 1 and 1% of node 2.Node 1 should do a scan of its partition.Node 2 should use secondary index.
SELECT * FROM telephone_book WHERE name < “NoGood”;
Sybase Navigator & DB2 PE should get this right.
We need theorems here (practitioners do not have them)
N..Z
TableScan
A..M
Index Scan
101Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
103Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
System Survey Ground Rules
Premise: The world does not need yet another PDB survey
It would be nice to have a survey of “real” systems
Visited each parallel DB vendor I could (time limited)
Asked not to be given confidential info.
Asked for public manuals and benchmarks
Asked that my notes be reviewed
I say only nice things (I am a PDB booster)
104Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
AcknowledgmentsTeradata
Todd Walter and Carrie BallingerTandem
Susanne Englert, Don Slutz, HansJorge Zeller, Mike PongOracle
Gary Hallmark, Bill WiddingtonInformix
Gary Kelley, Hannes Spintzik, Frank Symonds, Dave ClayNavigator
Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman , Bob McDonald, Jan Graveson Ron Chung Hu, Stuart Thompto
DB2 Chaitan Baru, Gilles Fecteau, James Hamilton, Hamid Pirahesh
RedbrickPhil Fernandez, Donovan Schneider
105Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Teradata • Ship 1984, now an ATT GIS brand name
• Parallel DB server for decision support SQL in, tables out
• Support Heterogeneous data (convert to client format)
Data hash partitioned among AMPswith fallback (mirror) hash.
Applications run on clients
Biggest installation: 476 nodes, 2.4 TB
Ported to UNIX base
Application Processor
AMP
IBM
PC
MAC
UNIX
VMS
AS400
Mac
PEP
106Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Application Processor
AMP
IBM
PC
MAC
UNIX
VMS
AS400
Mac
PEP
Parsing EnginesInterface to IBM or Ethernet or...Accept SQL, return records and status.Support SQL 89, moving to SQL92
Parse, Plan & authorize SQL cost based optimizerIssue requests to AMPsMerge AMP results to requester.Some global load control based on client priority
(adaptive and GREAT!)
Access ModulesAlmost all work done in AMPsA shared nothing SQL engine
scans, inserts, joins, log, lock,....Manages up to 4 disks (as one logical volume)Easy design, manage, grow (just add disk)
107Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Data Layout: Hash PartitioningAll data declustered to all nodesEach table has a hash key (may be compound)Key maps to one of 4,000 bucketsBuckets map to one of the AMPsNon-Unique secondary index partitioned by table criterionFallback bucket maps to second AMP in cluster.
Typical cluster is 6 nodes (2 is mirroring).Cluster limits failure scope:
2 failures only cause data outage if both in same cluster.
Within a node, each hash to cylinder then hash to “page”
Page is a heap with a sorted directory
108Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Teradata Optimization & Execution
Sophisticated query optimizer(many tricks) Great emphasis on Joins & Aggregates.
Nested, merge, product, bitmap join (no hash join)
Automatic load balancing from hashing & load control
Excellent utilities for data loading, reorganize
Move > 1TB database from old to new in 6 days, in background while old system running
109Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Query ExecutionProtocol
PE requests workAMP responds OK (or pushback)AMP works (if all OK)AMP declares finishedWhen all finished, PE does 2PC and starts pull
Simple scan: PE broadcasts scan to each AMPEach AMP scans produces answer spool filePE pulls spool file from AMPs via Ynet
If scan were ordered, sort “catcher” would be forkedat each AMP pipelined to scansYnet and PE would do merge of merges from AMPs
110Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Aggregates, Updates
Aggregate of Scan:Scan’s produce local sub-aggregatesHash sub-aggregates to YnetEach AMP “catches” its sub-aggregate hash bucketsConsolidate sub-aggregates.PE pulls aggregates from AMPs via Ynet.Note: fully scaleable design
Insert / Update / Delete at a AMP nodegenerates insert / update /delete messages to
unique-secondary indicesfallback bucket of base table.messages saved in spool if node is down
111Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Query Execution: Joins
Great emphasis on Joins.Includes small-table large-table optimization
cheapest triple, then cheapest in triple.
If equi-partitioned, do locallyIf not equi-partitioned,
May replicate small table to large partition (Ynet shines) May repartition one if other is already partitioned on joinMay repartition both (in parallel)
Join algorithm within node is ProductNestedSort-mergeHash bit map of secondary indices, intersected.
112Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Utilities
Bulk Data Load, Fast Data Load, Multi-load, Blast 32KB of data to an AMPMultiple sessions by multiple clients can drive 200x parallelDouble bufferAMP unpacks, and puts “upsert”onto YnetOne record can generate multiple upserts
(transaction-> inventory, store-sales, ...)Catcher on Ynet, grabs relevant “upserts” to temp file.Sorts and then batches inserts (survives restarts).Online and restartable.Customers cite this as Teradata strength.
Fast Export (similar to bulk data load)
113Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Utilities II
Backup / Restore: Rarely needed because of fallback.Cluster is unit of recoveryBackup is online, Restore is offline
Reorganize:Rarely needed, add disk is just restartAdd node:
rehash all buckets that go to that node:(Ynet has old and new bucket map)
Fully parallel and fault tolerant, takes minutes
114Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Port To UNIXNew design (3700 series) described in VLDB 93
Ported to UNIX platforms (3600 AP, PE, AMP)
Moved Teradata to Software Ynet on SMPs
Based on Bullet-Proof UNIX with TOS layer atop.message system
communications stacks
raw disk & virtual processors
virtual partitions (buckets go to virtual partitions)
removes many TOS limits
Result is 10x to 60x faster
than an AMP
Compiled expression evaluation(gives 50x speedup on scans)
116Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
UNIX/SMP Port of Teradata
op rows seconds k r/s MB/s
scan 50000000 737 67.8 11.0
copy 5000000 1136 4.4 0.7
aggregate 50000000 788 63.5 10.3
Join 50x2M (clustered) 52000000 768 67.7 11.0
Join 5x5 (unclustered) 10000000 237 42.2 6.8
Join 50Mx.1K 50000100 1916 26.1 4.2
Times to process a Teradata Test DB on a 8 Pentium, 3650. These numbers are 10 to 150x better than a single AMP Compiled expression handling
more memory
117Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Teradata Good Things
Scaleable to large (multi-terabyte) databases
Available TODAY!
It is VERY real: in production in many large sites
Robust and complete set of utilities
Automatic management.
Integrates with the IBM mainframe OLTP world
Heterogeneous data support is good data warehouse
118Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
TandemMessage-based OS (Guardian): (1) location transparency(2) fault isolation (failover to other nodes).
Expand software 255 Systems WAN
Classic shared-nothing system (like Teradata except applicationsrun inside DB machine.
4 node System
8 x1M B/S
30MB/S
1-16 MIPS R4400 cpusdual port controllers,dual 30MB/s LAN
224PROCESSORS
1974-1985: Encompass: Fault-tolerant Distributed OLTP1986: NonStopSQL: First distributed and high-performance SQL (200 tps)
1989: Parallel NonStopSQL: Parallel query optimizer/executor1994: Parallel and Online SQL (utilities, DDL, recovery, ....)1995: Moving to ServerNet: shared disk model
119Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Tandem Data LayoutEach table or index range partitioned to a set of disks
(anywhere in network)
Index is B-tree per partitionclustering index is B+ tree
Table fragments are files (extent based).
Descriptors for all local files live in local catalog (node autonomy)
Tables can be distributed in network (lan or wan)
Duplexed disks and disk processes for failover
PartitionBlock
Extents may be added
File= {parts}
120Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
123Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel OperatorsInitially just inserted rivers between sequential operatorsParallel query optimizerCreated executors at all clustering nodes or
at all nodes, repartitioned via hash to themGave parallel select, insert, update, delete
join, sort, aggregates,...correlated subqueries are blocking
Got linear speedup/scaleup on Wisconsin.Marketing never noticed, product slept from 1989-1993
Developers added: Hash Joinaggregates in disk processSQL92 featuresparallel utilitiesonline everythingconverted to MIPScofixed bugs
124Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Join StrategiesNested loopSort mergeBoth can work off index-only accessReplicate small to all partitions (when one small)Small-table Cartesian product large-table optimizationNow hybrid-hash join
uses many small bucketstuned to memory demand tuned to sequential disk performanceno bitmaps because (1) parallel hash
(2) equijoins usually do not benefit
When both large, and unclustered (rare case)N+M scanners, 16 catchers: sortmerge or hybrid hash
125Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Administration (Parallel & Online everything)All utilities are online (claim to reduce outages by 40%):
Add table, column,...Add index:
builds index from stale copyuses log for catchupin final minute, gets lock, completes index.
Reorg B-tree while it is accessedAdd / split/ merge/ reorg partitionBackupRecover page, partition, file.Add, alter logs, disks, processors, ...
You need this: Terabyte operations take a long time!
Parallel Utilities:load (M to N)index build (M scanners, N inserters, in background)recovery:
126Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
BenchmarksNo official DSS benchmark reports
Unofficial results1 to 16 R4400 class processors, 64MB each (Himalayas)
Parallel Recovery: (V7.1) @ restart, one log scanner, multiple redoers
Beta in 1993, Ship 6/94.More Parallel (create table): V7.2, 6/95
Shared disk implementation ported to most platforms
Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.
129Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Tabl
e or
Inde
x
SegmentBlock
Extents may be added
= File SetTable Space
Ext
ent
s
Oracle Data LayoutHomogenous:
one table (index) per segmentextents picked from a TableSpace
Files may be raw disk Segments are B-trees or heaps.
data -> disk map is automaticNo range / hash / round-robin partitioning
ROWID can be used as scan partitioning on base tables.
Guiding principal:If its not organized, it can’t get disorganized,
and doesn’t need to be reorganized.
130Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Oracle Parallel Query Product ConceptConvert serial SELECT plan to parallel plan
If Table scan or HINT then consider parallel planTable has default degree of parallelism (explicitly set)Overridden by system limits and hints.Use max degree of all participating tables.Intermediate results are hash partitionedNested Loop Join and Merge Join
User hints can (must?) specify join order, join strategy, index, degree of parallelism,...
DBMulti-process & thread Client Query
Coordinator
131Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Query PlanningQuery Coordinator starts with Oracle Cost-Based plan
If plan requests Table scan or HINT then consider parallel plan
Table has default degree of parallelism (explicitly set)Overridden by system limits and hints.Use max degree of all participating tables.
Shared disk makes temp space allocation easy
Planner picks degree of parallelism and river partitioning.
Proud of their OR optimization.
132Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Query ExecutionCoordinator does extra work to
merge the outputs of several sortssubsorts pushed to servers
aggregate the outputs of several aggregatesaggregates pushed to servers
Parallel function invocation is potentially a big win.
SELECT COUNT ( f(a,b,c,...)) FROM T;
Invokes function f on each element of T, 100x parallel.
133Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Join Strategies
Oracle has (1) Nested Loop Join (2) Merge Join
Replicate inner to outer partition automatic in shared disk (looks like partition outer).
Has small-table large-table optimization (Cartesian product join)
User hints can specify join order, join strategy, indexdegree of parallelism,...
134Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Transactions & RecoveryTransactions and transaction save points (linear nest).
149Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
ConfiguratorFully graphical design tool
Given ER model and dataflow model of the application workload characteristicsresponse time requirements,hardware components(heavy into circles and arrows)
150Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
AdministratorMade HUGE investments in this area.
Truly industry leadinggraphical tools make MPP configuration “doable”.
GUI interface to manage:startup / shutdown of clusterbackup / restore / manage logsconfigure (install, add nodes, configure and tune servers)Manage / consolidate system event logs System stored procedures (global operations)
(e.g. aggregate statistics from local to global cat)Monitor SQL Server events
151Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Data Layout
Pure shared nothingNavigator partitions data among SQL servers
• map to a subset of the servers • range partition or hash partition.
Secondary indices are partitioned with base table No Unique secondary indicesOnly shorthand views, no protection views Schema server stores global data definition for all nodes.Each partition server has
schema for its partitiondata for its partition.log for its partition
152Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sybase SQL Server BackgrounderRecently became SQL89 compliant (cursors, nulls, etc)Stored procedures, multi-threaded, internationalized, B*-tree centric (clustering index is B+tree)Use nested loops, sort-merge join (sort is index build).Page locking, 2K disk IO, ... other little-endian design decisions.Respectable TPC-C results (AIX RS/6000).UNIX raw disks or files are base (also on OS/2, NetWare,...).table->disk mapping
CREATE DATABASE name ON {device...} LOG ON {device...}SP_ADDSEGMENT segment, deviceCREATE TABLE name(cols) [ ON segment]
Microsoft has a copy of the code, deep ported to NT
153Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Navigator Extension Mechanisms
Navigator extended Sybase TDS byAdding stored procedures to do thingsExtending the syntax (e.g. see data placement syntax below)
Sybase TDS and OpenServer design are great for thisAll “front ends based on OpenServer and threads”
154Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Process Structure - Pure Shared Nothing
Control(1/node)
Clients
SQLSplit
DBAserver
= catalogs database in a SQL server
= system manager monitor& SQL optimizer
GUINavigatorManager
schemaserver
DBA Server does everything: SQL compilationSystem managementCatalog managementSQL server restart (in 2nd node)DBA fallback detects deadlock does DBA takeover on fail
Control server at each node manages SQL servers there(security, request caching, 2PC, final merge /aggregate,...
parallel stored procedures (SMID) )Split server manages re-partitioning of dataSQL Server is unit of query parallelism, (one per cpu per node)
155Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Simple Request Processing
Control(1/node)
Client
SQLSplit
DBAserver
schemaserver
Client connects to Navigator (a Control Server) usingstandard Sybase TDS protocol.
SQL request flows to DBA server that compiles itsends stored procedures (plans) to all control servers
plans to all relevant SQL serversControl server executes plan.Pass to SQL server, returns results.
Plan cached on second call, DBA server not invoked.Good for OLTP
156Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Parallel Request Processing
Control
Split
Control
Client
SQLSplit
DBAserver
schemaserver
Control
Split
If query involves multiple nodes, then command sent to each one (diagram shows secondary index lookup)
Query sent to SQL servers that may have relevant data.
If data needs to be redistributed or aggregated, split servers issue queries and inserts
(that is their only role)
split servers have no persistent storage.
157Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Data ManipulationSQL server is unit of parallelism
"Parallelized EVERYTHING in the T-SQL language" Includes SIMD execution of T-SQL procedures, plus N-M data move operations.
Two-level optimization: DBA Server has optimizer
(BIG investment, all new code, NOT the infamous Sybase optimizer)
Each SQL server has Sybase optimizer If extreme skew, different servers have different plansDBA optimizer shares code with SQL server
(so they do not play chess with one another).Very proud of their optimizer.
158Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Query Execution
Classic Sellinger cost-based optimizer.SELECT, UPDATE, DELETE N-to-M parallelBulk and async INSERT interface.N-M Parallel sortAggregate (hash/sort)select and join can do index-only access if data is there.eliminate correlated subqueries (convert to join).
(Gansky&Wong. SIGMOD87 extended)Join: nested-loop, sort-merge, index only
Sybase often dynamically builds index tosupport nested loop (fake sort-merge)
Typically left-deep sequence of binary joins.
159Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Join and Partition Strategy
Partition strategiesIf already partitioned on join, then no splittingElse Move subset of T1 to T2 partitions.or Replicate T1 to all T2 partitionsor repartition both T1 and T2 to width of home nodes
or target.No hash join, but
all (re) partitioning is range or hash based.
Not aggressive parallelism/pipelining: 2 op at a time.Pipeline to disk via split server (not local to disk and then split).Split servers fake subtables for SQL engines.Top level aggregates merged by control, others done by split.
160Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Utilities
Bulk data load (N-M) async calls
GUI managesBackup all SQL serves in parallel
Reorg via CREATE TABLE <new> , INSERT INTO <new> SELECT * FROM <old>
Utilities are mostly offline (as per Sybase)
Nice EXPLAIN utility
161Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Futures
Hash join within split servers
Shared memory optimizations
Full support for unique secondary indices Full trigger support (cross-server triggers)
Full security and view support.
162Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
BenchmarksPreliminary: 8x8 3600 - Ynet.
node: 8 x (50MHz 486 256k local cache) 512MB main memory, 2 x 10 disk arrays, @ 2GB 4 MB/s per disk.6 x Sybase servers
Scaleup & speedup tests of 1, 4, and 8 nodes.Numbers (except loading) reported as ratios of elapsed times
Reference Account: Chase Manahattan Bank14x8 P5 ATT 3600 cluster: (112 processors)56 SQL servers, 10GB each = 560 GB 100x faster than DB2/MVS (minutes vs days)
Linearity is great.
163Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Navigator Good ThingsConcern for lifecycle
design, install,manage, operate, use
Good optimization techniques
Fully parallel, including stored procedures!
Scaleup and Speedup are near linear.
164Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Sybase IQ
Sybase bought Expressway
Expressway evolved from Model 204
bitmap technology: index duplicates with bitmap
compress bitmap.
Can give 10x or 100x speedup.
Can save space and IO bandwidth
Currently, two products (Sybase and IQ) not integrated
165Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2DB2/VM: = SQL/DS: System R gone public
DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)Parallel and async IO into one process (on mainframe)Parallel execution in next release (late next year?)MVS PQS now withdrawn?
DB2/AS400: Home grown
DB2-2-PE: OS2/DM grown large. First moved to AIXBeing extended parallelismParallelism based on SP/2 -- shared nothing done right.Benchmarks today - Beta everywhere
DB2++: separate code path has OO extensions, good TPC-C Ported to HP/UX, Solaris, NT in beta
166Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/2 Data Layout• DATABASE: a collection of nodes (up to 128 SP2s so far)
• NODEGROUP: a collection of logical nodes (a 4k hash map
• LOGICAL NODE: A DB2 instance (segments, log, locks...)
• PHYSICAL NODE: A box.
• Logical Node: Segments of 4 k pages
– Segments allocated in units (64K default)
– Tables stripe across all segments
• Table created in NodeGroup:
– Hash (partition key) across all members of group
• Cluster has single system Image
Segments
Nodes:Group 1
Group 2
167Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/2 Query Execution• Each node maintains pool of AIX server processes
• Query optimizer does query decomposition to node plans (like R* distributed query decomposition)
• Parallel Optimization is 1Ø (not like Wai Hong’s work)
• Sends sub-plans to nodes to be executed by servers
• Node binds plan to server process
• Intermediate results hashed
• Proud that Optimizer does not need hints.
• “Standard” join strategies (except no hash join).
168Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/2 Utilities• 4 loaders:
– import
– raw-insert (fabricates raw blocks, no checks)
– insert
– bulk insert
• Reorganize hash map, add / drop nodes, add devices– Table unavailable during these operations
• Online & Incremental backup
• Fault tolerance via HACMP
169Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/2 Performance: Good performance Great Scaling
Wisconsin scaleups
big = 4.8 M rec = 1 GB
small = 1.2 M rec = 256MB
scan rate ~12 kr/s/node
raw load: 2.5 kr/s/node
see notes for more data
0.0
5.0
10.0
15.0
20.0
25.0
0 2 4 6 8 10 12 14 16
Load
Scan
Agg
SMJ
NLJ
SMJ2
Index1
Index2
MJ
Speedup vs NodesDB2/2 PE on SP2
170Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
DB2/2 Good Things• Scaleable to 128 nodes (or more)
• From IBM
• Good performance
• Complete SQL (update, insert,...)
• Will converge with DB2/3 (OO and TPC-C stuff)
• Will be available off AIX someday – (aix is slow and SP2 is very expensive)
171Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
RedBrick• Read-only (LOAD then SELECT only) Database system
– Load is incremental and sophisticated
• Precompute indices to make small-large joins run fast– Indices use compression techniques.
– Only join via indices
• Many aggregate functions to make DSS reports easy
• Parallelism:
– Pipeline IO
– Typically a thread per processor (works on index partition)
– Piggyback many queries on one scan
– Parallel utilities (index in parallel, etc)
– SP2 implementation uses shared disk model.
172Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
SummaryThere is a LOT of activity
(many products coming to market)
Query optimization is near the complexity barrierNeeds a new approach?
All have good speedup & scaleup if they can find a plan
Managing huge processor / disk / tape arrays is hard.
I am working on commoditizing these ideas:low $/record/sec (scaleup PC technology)low Admin $/node (automate, automate, automate,...)Continuous availability (online & fault tolerant)