Top Banner
Multi pass algorithms
31

Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Jan 17, 2016

Download

Documents

Isabella Baker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Multi pass algorithms

Page 2: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Nested-Loop joins• Tuple-Based Nested-loop Join Algorithm:

FOR each tuple s in S DO

FOR each tuple r in R DO

IF r and s join to make a tuple t THEN

output t

What’s the complexity?

Page 3: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Block-based nested loops• Assume B(S) ≤ B(R), and B(S) > M • Read M-1 blocks of S into main memory and compare to

all of R, block by block

FOR each chunk of M-1 blocks of S DO

FOR each block b of R DO

FOR each tuple t of b DO

find the tuples of S in memory that join with t

output the join of t with each of these tuples

Page 4: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example• B(R) = 1000, B(S) = 500, M = 101

• Outer loop iterates 5 times• At each iteration we read M-1 (i.e. 100) blocks of S and all

of R (i.e. 1000) blocks.• Total time: 5*(100 + 1000) = 5500 I/O’s

• Question: What if we reversed the roles of R and S?• We would iterate 10 times, and in each we would read

100+500 blocks, for a total of 10*(100+500) = 6000 I/O’s.

• Compare with one-pass join, if it could be done!• We would need 1500 disk I/O’s if B(S) M-1

Page 5: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Analysis of blocks nested loops• Number of disk I/O’s:

[B(S)/(M-1)] * (M-1 + B(R))

or

B(S) + B(S)B(R)/(M-1)

or approximately B(S)*B(R)/M

Page 6: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Two-pass algorithms based on sorting• This special case of multi-pass algorithms is sufficient for

most of the relation sizes.

Main idea for unary operations on R • Suppose B(R) M (main memory size in blocks)

• First pass: – Read M blocks of R into Main Memory– Sort the content of Main Memory– Write the sorted result (sublist/run) into M blocks on disk.

• Second pass: create final result

Page 7: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Duplicate elimination using sorting• In the second phase (merging) we don’t sort but copy each

tuple just once.

• We can do that because the identical tuples will show up “at the same time,” i.e. they will be all the first ones at the buffers (for the sorted sublists).

• As usual, if one buffer gets empty we refill it.

Page 8: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Duplicate-Elimination using Sorting Example• Assume M=3, each buffer holds 2 records and relation R

consists of the following 17 tuples:

2, 5, 2, 1, 2, 2, 4, 5, 4, 3, 4, 2, 1, 5, 2, 1, 3

• After the first pass the following sorted sub-lists are created:

1, 2, 2, 2, 2, 5

2, 3, 4, 4, 4, 5

1, 1, 2, 3, 5

• In the second pass we dedicate a memory buffer to each sub-list.

Page 9: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example (Cont’d)

Page 10: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example (Cont’d)

Page 11: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example (Cont’d)

Page 12: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Analysis of (R) • 2B(R) to create sorted sublists, B(R) to read each sublist in

phase 2. Total: 3B(R)

• How large can R be?– There can be no more than M sublists since we need one

buffer for each one. So, B(R)/M ≤ M, (B(R)/M is the number of sublists)

i.e. B(R) ≤ M2

• To compute (R) we need at least sqrt(B(R)) blocks of Main Memory.

Page 13: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Sort-based , , -Example: set union.

• Analysis: 3(B(R) + B(S)) disk I/O’s• Condition: B(R) + B(S) ≤ M2

• Similar algorithms for sort based intersection and difference (bag or set versions).

• Create sorted sublists of R and S• Use input buffers for sorted sublists of R and S, one buffer

per sublist.• Output each tuple once.

- We can do that since all the identical tuples appear “at the same time.”

Page 14: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Join• A problem for joins but not for the previous operators: The

number of joining tuples from the two relations can exceed what fits in memory.

• A solution? • Maximize the number of output buffers.• Minimize the number of sorted sublists (since we need a

buffer for each one of them).

Page 15: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Simple sort-based join

• For R(X,Y) S(Y,Z) with M buffers of memory:• Completely: sort R on Y, sort S on Y

Merge phase• Use 2 input buffers: 1 for R, 1 for S.• Pick tuple t with smallest Y value in the buffer for R• If t doesn’t match with the first tuple in the buffer for S, then

just remove t.• Otherwise, read all the tuples from R with the same Y value

as t and put them in the M-2 part of the memory. • When the input buffer for R is exhausted fill it again and

again.

• Then, read the tuples of S that match. For each one we produce the join of it with all the tuples of R in the M-2 part of the memory.

Page 16: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example of sort join• B(R) = 1000, B(S) = 500, M= 101• To sort R, we need 4*B(R) I/O’s, same for S.

– Number of I/O’s = 4*(B(R) + B(S))• Doing the join in the merge phase:

– Number of I/O’s = B(R) + B(S) • Total disk I/O’s = 5*(B(R) + B(S)) = 7500

• Memory Requirement? • To be able to do the sort, we should have B(R) ≤ M2 and

B(S) ≤ M2

• Recall: for nested-loop join, we needed 5500 disk I/O’s, but the memory requirement was quadratic (it is linear, here), i.e., nested-loop join is not good for joining relations that are much larger than MM.

Page 17: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Potential problem ...R(X , Y)----------- x1 a x2 a … xn a

S(Y, Z)--------- a z1

a z2

... a zm

What if Size of n+1 tuples > M-1 andSize of m+1 tuples> M-1?

• If the tuples from R (or S) with the same value y of Y do not fit in M-1 buffers, then we use all M-1 buffers to do a nested-loop join on the tuples with Y-value y from both relations.

• Observe that we can “smoothly” continue with the nested loop join when we see that the R tuples with Y-value y do not fit in M-1 buffers.

Page 18: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

• Do we really need the fully sorted files?• Suppose we are not worried about many common Y values

Can We Improve on Sort Join?

R

S

Join?

sorted runs

Page 19: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

A more efficient sort-based join• Suppose we are not worried about many common Y values

• Create Y-sorted sublists of R and S• Bring first block of each sublist into a buffer (assuming we

have at most M sublists)• Find smallest Y-value from heads of buffers. Join with other

tuples in heads of buffers, use other possible buffers, if there are “many” tuples with the same Y values.

• Disk I/O: 3*(B(R) + B(S))• Requirement: B(R) + B(S) ≤ M2

Page 20: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Example• B(R) = 1000, B(S) = 500, M= 101

• Total of 15 sorted sublists• If too many tuples join on a value y, use the remaining 86

MM buffers for a one pass join on y

• Total cost: 3*(1000 + 500) = 4500 disk I/O’s• M2 =10201 > B(R) + B(S), so the requirement is satisfied

Page 21: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Summary of sort-based algorithms

Operators Approx. M required Disk I/O

, Sqrt( B ) 3B

, , - Sqrt( B(R) + B(S) ) 3(B(R)+B(S))

Sqrt( max( B(R),B(S) ) ) 5(B(R)+B(S))

Sqrt( B(R)+B(S) ) 3(B(R)+B(S))

Page 22: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Two-pass algorithms based on hashing

Main idea: • Instead of sorted sublists, create partitions, based on

hashing.• Second pass creates result from partitions using one pass

algorithms.

Page 23: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Creating partitions• Partitions (buckets) are created based on all attributes of the relation

except for grouping and join, where the partitions are based on the grouping and join-attributes respectively.

• Why bucketize? Tuples with “matching” values end up in the same bucket.

Initialize M-1 buckets using M-1 empty buffers;FOR each block b of relation R DO

read block b into the M-th buffer; FOR each tuple t in b DO

IF the buffer for bucket h(t) has no room for t THENcopy the buffer to disk;initialize a new empty block in that buffer;

copy t to the buffer for bucket h(t);ENDIF;

ENDFOR;FOR each bucket DO

IF the buffer for this bucket is not empty THENwrite the buffer to disk;

Page 24: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Hash-based duplicate elimination• Pass 1: create partitions by hashing on all attributes• Pass 2: for each partition, use the one-pass method for

duplicate elimination

• Cost: 3B(R) disk I/O’s

• Requirement: B(R) ≤ M*(M-1)

(B(R)/(M-1) is the approximate size of one bucket)

i.e. the req. is approximately B(R) ≤ M2

Page 25: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Hash-based grouping and aggregation

• Pass 1: create partitions by hashing on grouping attributes• Pass 2: for each partition, use one-pass method.

• Cost: 3B(R), Requirement: B(R) ≤ M2

• More exactly the requirement is:

MM

RB L )1(

))(((

Page 26: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Hash-based set union• Pass 1: create partitions R1,…,RM-1 of R, and S1,…,SM-1 of S

(with the same hash function)• Pass 2: for each pair Ri, Si compute Ri Si using the one-

pass method.

• Cost: 3(B(R) + B(S))• Requirement: min(B(R),B(S)) ≤ M2

• Similar algorithms for intersection and difference (set and bag versions)

Page 27: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Partition hash-join

• Pass 1: create partitions R1, ..,RM-1 of R, and S1, ..,SM-1 of S, based on the join attributes (the same hash function for both R and S)

• Pass 2: for each pair Ri, Si compute Ri Si using the one-pass method.

Page 28: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

• B(R) = 1000 blocks• B(S) = 500 blocks• Memory available = 101 blocks• R S on common attribute C• Use 100 buckets

– Read R– Hash– Write buckets

Example

...

...

10 blocks

100R

Same for S

Page 29: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

• Read one R bucket• Build memory hash table• Read corresponding S bucket block by block.

RS

...

R

Memory...

Cost• “Bucketize:”

– Read + write R– Read + write S

• Join– Read R– Read S

Total cost = 3*[1000+500] = 4500

In general:

Cost: 3(B(R) + B(S))

Req.: min(B(R),B(S)) ≤ M2

Page 30: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Summary of hash-based methods

Operators Approx. M required

Disk I/O

, Sqrt(B) 3B

, , - Sqrt(B(S)) 3(B(R)+B(S))

Sqrt(B(S)) 3(B(R)+B(S))

Page 31: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.

Sort vs. Hash based algorithms• Hash-based algorithms have a size requirement that

depends only on the smaller of the two arguments rather than on the sum of the argument sizes, as for sort-based algorithms.

• Sort-based algorithms allow us to produce the result in sorted order and take advantage of that sort later. The result can be used in another sort-based algorithm later.

• Hash-based algorithms depend on the buckets being of nearly equal size. Well, what about a join with a very few values for the join attribute…