CS 600.416 Transaction Processing Lecture 18 Parallelism.

CS 600.416 Transaction Processing

Lecture 18

Parallelism


Motivation for Parallel Databases

• Extremely large data sets– Special application needs: computer-aided design, World Wide

Web

• Queries that have large data requirements– Decision support systems, statistical analysis

• Inherent parallelism in data– Set oriented nature of relations

• Commoditization of parallel computers– 2 or 4 SMPs are commonplace– Clustering software for multiple SMPs is freely available– Weak point in the argument in light of mainframe OSes


Motivation for Parallel Databases

• 2 Major reasons for parallel databases that we learned on the previous slide

• Large data sets, application and query– Because we need it

• Parallel computers and feasible application domain– Because we can


Motivation Reality Check

• Have always needed parallel DBs– DBs have always stretched the capabilities of computer

architectures– Enterprises have always grown to a DBs capabilities

• Distribution cannot really solve the problem– Replication and latency concerns

•As we learned from the paper last week

– Isolation problems•Fault and performance isolation

• One big computer is more powerful than 2 equivalent small computers

– Parallel machines look like 1 big computer from the outside


Parallelism

Theoretically, the execution of a task T onto the computer system Pn should be n times faster than processor P1

Pn

P1 P2 Pn

N processor system

T

T1 T2 Tn

Tasks of equal sizes


Parallelism

• Hardware Parallelism– Parallelism “available” as a result of the existing resources

– Egs: multiprocessors, RAIDS, etc.

• Software Parallelism– parallelism that could be "discovered" in an application

– Egs: parallel algorithms, programming style, compiler optimization


Speedup, Efficiency, and Scaleup

• Definition:– T(p,N) = time to solve problem of size N on p processors

• Speedup:– S(p,N) = T(1,N)/ T(p,N)– Compute same problem with more processors in shorter time

• Efficiency:– E(p,N) = S(p,N) / p

• Scaleup:– Sc(p,N) = N / n with T(1,n) = T(p,N)– Compute larger problem with more processors in same time

• Problems:– S(p,N) close to p or far less ? -> Sub linear speedup


Scale up

• Two kinds:– Batch scaleup:

– The size of the task increases

– E.g: size of database increase, sequential scan is proportionately increased

– Transaction scaleup– Rate of submission of task increases

– Each task may still be short lived


Scaleup, Speedup

sublinear speedup

linear speedup

resources

s

p

e

e

d

Ts

--

Tl

sublinear scaleup

linear scaleup

problem size


Factors against parallelism

• Startup– Thousands of processes may influence startup costs

• Interference– synchronization, communication – Even 1% contention limits speedup to 37x

• Skew– Efficiently load balancing– At fine granularity, variance can exceed mean time to

finish one parallel step


When is parallelism available?

• Good if,– Operations access significant amount of data for e.g joins of large

tables, bulk inserts, aggregation, copying, queries, etc

– Symmetric multiprocessors

– Sufficient IO bandwidth, under utilized or intermittently used CPUs

• Bad if,– Query execution/transactions are short lived

– CPU, memory and IO resources are heavily utilized

• Software parallelism should utilize hardware parallelism


Parallel Architectures

• Stonebraker’s simple taxonomy for parallel architectures:– Shared memory : Processors share common memory

– Shared disk/Clusters: Processors share a common set of disks

– Shared nothing: Network sharing

– Hierarchical: Hybrid of architectures above


Shared Memory

MP

P

P

P

Processors share common memory

Common in SMP systems


Shared Nothing

• Pros– Cost

• Use inexpensive computers to build such a system

– Extensibility• Promotes incremental growth

– Availability• Redundancy can be introduced by replication of data

• Cons– Complex

• Distributed database concepts in parallel setup

– Difficult to achieve load balancing• Relies on software parallelism


Shared Disk

P

P

P

P

M

M

M

MProcessors share a common set of disks

Common in clusters

Networked attached I/O protocols

make more readily available


Shared Disk

• Features– Shared disk access but exclusive memory access– Global locking protocols are needed

• Pros– Cost

• Lower as standard I/O interconnects can be used

– Extensibility• Interference is minimized by exclusive memory cache

– Availability• Degree of fault tolerance in both processor subsystem and disks

• Cons– Highly Complex– Shared-disk as a potential bottleneck


Shared Nothing

P

P

P

P

P

MM

M

MM

Network sharing only

Parallelism available without

any hardware support


Shared Memory

• Pros– Fast processor to processor communication

• No software primitives required

– Simplicity• Meta and control information shared by all

• Cons– Cost

• Expensive interconnect

– Limited extensibility• The shared memory soon becomes a bottleneck• Limited to 10-20 processors

– Cache coherency– Low availability

• Availability depends on the robustness of memory


So far…

• Parallelism and its measures• Problems with parallelism…• Parallel architectures…


I/O and Databases

• What’s important about I/O– Reminder: the performance measure for all DBs is the number of

I/Os. – For the most part, it is the only thing that matters

• Why is I/O inherently parallel?– Even a machine with 1 processor has multiple disks– Placement of data on these disks greatly effects performance

• What does this tell us about parallel DBs?– Parallelism is not necessarily about supercomputers, but occurs at

many levels in computer systems– Every system has some degree of parallelism, can be between

scheduling the different processing units in a CPU


I/O or Disk Parallelism

• Partition data onto multiple disks– Most frequently horizontal paritioning

– Conduct I/O to all disks at the same time

• Techniques– Round-robin: send ith tuple to disk i mod n in an n-disk system

– Hash partitioning: send tuple n to disk f(n) where f is a uniformly distributed random function

– Range partitioning: break tuples up into contiguous ranges of keys, requires a key that can be ordered linearly

– Multi-dimensional partitioning strategies: used for spatial data, images, other mutli-dimensional sets much recent work


Workloads

• Several important/expected workloads– Scanning the entire relation

– Locating a tuple (identity query)

– Locating a set of tuples based on attribute value•Range query, e.g. 100<a<200

•Find all people whose names start with A–Note this is not an identity query


Range partitioning

• Parition requires a partitioning attribute A usually the primary key• A vector of dimension n partitions A

– Vector {v0,v2,…,vn-1}

• Each tuple t goes into:– Partition 0 if t[A] < v0– Partition n-1 if t[A] > vn-2– Partition k if t[A] > vk-1 and t[A] < vk, k >=1

• Simple range paritioning #disks = #partitions• Combined with round robin #disk*k = #partitions

– Has some benefits of avoiding variance in any one partition


Some Practicalities

• Disk blocks are what we partition– Block size is generally a tradeoff between I/O performance and utilization

• bigger blocks are better for performance, more I/O• bigger blocks fragment data, leading to poor space utilization

– Blocks are generally set to the page size• bigger than we would like • often lots of space fragemented (> 50% in file systems)

• What is the problem with larger blocks?– Small relations don’t get placed on as many disks, less parallelism

• What is the problem with small blocks?– Well pages are what OSes read?– Performance suffers

• Some applications with known large data use larger block sizes– Paricularly scientific applications


Workloads Round Robin

• Ups– Good for scans, sequential, parallel, entirely load balanced– What about unfairness in the tail (if you start on the same block all

the time)• randomize start block• use a next block policy

• Downs– Identity queries search n blocks (n/2 if item always exists and is a

key)(n blocks if not a key or to establish it is not found)– Range queries search n blocks, there is not relationship between

key value and placement


Workloads Hash Partition

• Ups– Good for identity queries

• isolates query to a single disk

– Good for sequential scans• low variance in hashing \Omega(log t) for relations with cardinality t

• essentially d (number of disks) times speedup over a single disk system (actually d/(1+\Omega(logt))

• Downs– Bad for range queries, search n blocks

– Bad for identity queries on non-partitioning attributed• e.g. partition/hash on SS# and lookup by last name


Workloads Range Partition

• Ups– Good for identity queries

• isolates query to a single “data” disk/block• must generally read another block to read range information, which hash partitions do not require

–Indices can be large

• Ambiguous-es– Range queries

• good performance when queries access few items–Isolates queries to one or few disks–Allows other queries to run in parallel on other disks

•Bad when accessing lots of data items– Can localize traffic to few disks, creating a hot spot

•Really the good outweighs the bad here


Handling Skew

• Attribute value skew – when lots of tuples are clustered around the same (or nearly same value)

– Occurs in range partitioning

– Imagine a relation with 2 values of an attribute and k disks• Only two will be used

• Partition skew – load imbalance when there is no attribute skew

– O(log t) for t tuples in hash partitioning, no problem

– From poorly constructed range vector


Constructing a Range Vector

• Balanced range partitioning vector can be constructed by– Sorting existing tuples – but incurs I/O costs when sorting and does

not keep the partitioning balanced as new inserts arise

– Using a B-tree, but this limits occupancy of tuples in disks blocks, which ultimately limits I/O performance

– Statistics – keep counts of values based on buckets of values, but this has problems of AV skew within buckets and estimation (Histogram)


Virtual Processor Technique

• Create many virtual processors and map ranges to virtual processors

• Assign the virtual processors to real processors– This eliminates skew, because each processor is accessing many

virtual processors, which are more likely to have close to mean load

• Allows a system to use a “poor” range partition and not have problems with skew

– Generally DBs use histograms with VPs


Lessons Learned

• Parallelism is important, even for single machines

• Disk based parallelism is the most important kind of paralellism

– I/O is the bottleneck in databases

– Not really entirely true, networking is starting to be the bottleneck in distributed TP applications

• Know thy data

CS 600.416 Transaction Processing Lecture 18 Parallelism.

Documents