CS 542 Database Management Systems Query Optimization J Singh March 28, 2011
Jun 14, 2015
CS 542 Database Management SystemsQuery Optimization
J Singh March 28, 2011
2© J Singh, 2011 2
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)
– deal with subqueries• Improve the logical query plan
– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
3© J Singh, 2011 3
Desired Endpoint
• x=1 AND y=2 AND z<5 (R) • R ⋈ S ⋈ U
Example Physical Query Plans
two-passhash-join101 buffers
two-passhash-join101 buffers
TableScan(U)
TableScan(R) TableScan(S)
materialize
Filter(x=1 AND z<5)
IndexScan(R,y=2)
4© J Singh, 2011 4
Physical Plan Selection
• The particular operation being performed
• Size of intermediate results, as derived last week (sec 16.4 of book)
• Physical Operator Implementation used,
– e.g., one- or two-pass
• Operation ordering, – esp. Join ordering
• Operation output: materialized or pipelined.
Governed by disk I/O, which in turn is governed by
5© J Singh, 2011 5
Index-based physical plans (p1)
• Selection example. What is the cost of a=v(R) assuming– B(R) = 2000– T(R) = 100,000– V(R, a) = 20
• Table scan (assuming R is clustered):– B(R) = 2,000 I/Os
• Index based selection:– If index is clustering: B(R) / V(R,a) = 100 I/Os– If index is unclustered: T(R) / V(R,a) = 5,000 I/Os
• For small V(R, a), table scan can be faster than an unclustered index
– Heuristics that pick indexed over not-indexed can lead you astray
– Determine the cost of both methods and let the algorithm decide 5
6© J Singh, 2011 6
Index-based physical plans (p2)
• Example: Join if S has an index on the join attribute
• For each tuple in R, fetch corresponding tuple(s) from S
• Assume R is clustered. Cost:– If index on S is clustering: B(R) + T(R) B(S) / V(S,a)– If index on S is unclustered: B(R) + T(R) T(S) / V(S,a)
• Another case: when R is output of another Iterator. Cost:
– B(R) is accounted for in the iterator– If index on S is clustering: T(R) B(S) / V(S,a)– If index on S is unclustered: T(R) T(S) / V(S,a)– If S is not indexed but fits in memory: B(S)– A number of other cases
7© J Singh, 2011 7
Index-based physical plans (p3)
• Index Based Join if both R and S have a sorted index (B+ tree) on the join attribute
• Then perform a merge join – called zig-zag join
• Cost: B(R) + B(S)
8© J Singh, 2011 8
Grand Summary of Physical Plans (p1)
Op Plan Index Passes Cost
Scan Table Scan N/A 1 B(R)
Select Select Scan N 1 B(R)
Select Index-based C 1 B(R) / V(R, a)
Select Index-based NC 1 T(R) / V(R, a)
• Scans and SelectsIndex: N = None, C = Clustering, NC = Non-clustered
9© J Singh, 2011 9
Grand Summary of Physical Plans (p2)
• JoinsIndex: N = None, C = Clustering, NC = Non-clusteredRelation fits in memory: F = Yes, NF = No
Plan R S Pass Cost
Loop Join N N, F 1 B(R) + B(S)
Nested Loop Join
N N, NF k k*B(R) + B(S), k = # Clusters
Sort Join N N 2 3*(B(R) + B(S))
Hash Join N N 2 3*(B(R) + B(S))
Index-based C N, NF 2 B(R) / V(R, a) + 3*B(S)
Index-based NC N, NF 2 T(R) / V(R, a) + 3*B(S)
Index-based C N, F 1 B(R) / V(R, a) + B(S)
Index-based C N, NF 1 B(R) T(S) / V(R, a) + B(S)
… … … … …
Index-based C C 1 B(R) / V(R, a) + B(S) / V(S, a)
10© J Singh, 2011 10
Physical plans at non-leaf Operators (p1)
• What if the input of the operator is from another operator?
• For Select, cost = 0.– Cost of pipelining is assumed to be zero– The number of tuples emitted is reduced
• For Join, when R is from an operator and S from a table:– B(R) is accounted for in the iterator– If index on S is clustering: T(R) B(S) / V(S,a)– If index on S is unclustered: T(R) T(S) / V(S,a)– If S is not indexed but fits in memory: B(S)– If S is not indexed and doesn’t fit: k*B(S) for k chunks– If S is not indexed and doesn’t fit: 3*B(S) for sort- or hash-
join
11© J Singh, 2011 11
Physical plans at non-leaf Operators (p2)
• For Join, when R and S are both from operators, cost depends on whether the result are sorted by the Join attribute(s)
– If yes, we use the zig-zag algorithm and the cost is zero. Why?
– If either relation will fit in memory, the cost is zero. Why?– At most, the cost is 2*(B(R) + B(S)). Why?
12© J Singh, 2011 12
Example (787)
Product(pname, maker), Company(cname, city)Select Product.pnameFrom Product, CompanyWhere Product.maker=Company.cname and Company.city = “Seattle”
• How do we execute this query ?
13© J Singh, 2011 13
Example (787)
Product(pname, maker), Company(cname, city)Select Product.pnameFrom Product, CompanyWhere Product.maker=Company.cname and Company.city = “Seattle”
• Logical Plan
scity=“Seattle”
Product(pname,maker)
Company(cname,city)
maker=cname
Clustering Indices:Product.pnameCompany.cname
Unclustered Indices:
Product.makerCompany.city
14© J Singh, 2011 14
Example (787) Physical Plans
• Physical Plan 1 • Physical Plans 2a and 2b
scity=“Seattle”
Product(pname,maker)
Company(cname,city)
cname=maker
Index-based
selection
Index-basedjoin
scity=“Seattle”
Product(pname,maker)
Company(cname,city)
maker=cname
Index-
scan
Merge-join
Scan and sort (2a)
index scan (2b)
15© J Singh, 2011 15
Evaluate (787) Physical Plans
• Physical Plan 1– Tuples:
• T(city='Seattle'(Company)) = T(Company) / V(Company, City)
– Cost:• T(city='Seattle'(Company)) *
T(Product) / V(Product, maker)
or, simplifying,• T(Company) / V(Company,
City) * T(Product) / V(Product, maker)
• Total Cost:– 2a: 3B(Product) +
B(Company)– 2b: T(Product) +
B(Company)
scity=“Seattle”
Product(pname,maker)
Company(cname,city)
maker=cname
Index-
scan
Merge-join
Scan and sort (2a)
index scan (2b)
16© J Singh, 2011 16
Final Evaluation
• Plan Costs:– Plan 1: T(Company) / V(Company, city) T(Product)/V(Product,
maker)– Plan 2a: B(Company) + 3B(Product)– Plan 2b: B(Company) + T(Product)
• Which is better?– It depends on the data
17© J Singh, 2011 17
Example (787) Evaluation Results
• Case 1: – V(Company, city)
T(Company)– V(Company, city) = 5,000
• Plan 1: 1 20 = 20• Plan 2a: 3,500• Plan 2b: 100,500
• Case 2: – V(Company, city) <<
T(Company)– V(Company, city) = 20
• Plan 1: 250 20 = 5,000• Plan 2a: 3,500• Plan 2b: 100,500
• Common assumptions:T(Company) = 5,000 B(Company) = 500 M = 100T(Product) = 100,000 B(Product) = 1,000
Assume V(Product, maker) T(Company)
Reference from previous page:– Plan 1: T(Company)/V(Company,city)
T(Product)/V(Product,maker)– Plan 2a: B(Company) + 3B(Product)– Plan 2b: B(Company) + T(Product)
18© J Singh, 2011 18
Lessons
• Need to consider several physical plans– even for one, simple logical plan
• No magic “best” plan: depends on the data
• In order to make the right choice– need to have statistics over the data– the B’s, the T’s, the V’s
19© J Singh, 2011 19
Query Optimzation
• Have a SQL query Q
• Create a plan P
• Find equivalent plans P = P’ = P’’ = …
• Choose the “cheapest”. HOW ??
20© J Singh, 2011 20
Logical Query Plan
SELECT P.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’
• Plan
Purchase Person
Buyer=name
City=‘seattle’ phone>’5430000’
buyer
In class:find a “better” plan P’
CS 542 Database Management SystemsQuery Optimization – Choosing the Order of Operations
J Singh March 28, 2011
22© J Singh, 2011 22
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)
– deal with subqueries• Improve the logical query plan
– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
23© J Singh, 2011 23
Join Trees
• Recall that the following are equivalent:– R ⋈ S ⋈ U– R ⋈ (S ⋈ U)– (R ⋈ S) ⋈ U– S ⋈ (R ⋈ U)
– But they are not equivalent from an execution viewpoint.– Considerable research has gone into picking the best order for
Joins
24© J Singh, 2011 24
Join Trees
• R1 ⋈ R2 ⋈ … ⋈ Rn• Join tree:
• Definitions– A plan = a join tree– A partial plan = a subtree of a join tree
24
R3 R1 R2 R4
25© J Singh, 2011 25
Left & Right Join Arguments
• The argument relations in joins determine the cost of the join
• In Physical Query Plans, the left argument of the join is – Called the build relation– Assumed to be smaller– Stored in main-memory
26© J Singh, 2011 26
Left & Right Join Arguments
• The right argument of the join is– Called the probe relation – Read a block at a time– Its tuples are matched with those of build relation
• The join algorithms which distinguish between the arguments are:
– One-pass join– Nested-loop join– Index join
27© J Singh, 2011 27
Types of Join Trees
• Left deep: • Bushy
R3 R1
R5
R2
R4
R3
R1
R2 R4
R5
• Right deep
R3
R1
R5
R2 R4
Many different orders, very important to pick the right one
28© J Singh, 2011 28
Optimization Algorithms
• Heuristic based• Cost based
– Dynamic programming: System R– Rule-based optimizations: DB2, SQL-Server
29© J Singh, 2011 29
Dynamic Programming
• Given: a query R1 ⋈ R2 ⋈ … ⋈ Rn• Assume we have a function cost() that gives us the cost of a
join tree• Find the best join tree for the query
30© J Singh, 2011 30
Dynamic Programming
• Problem Statement– Given: a query R1 ⋈ R2 ⋈ … ⋈ Rn– Assume we have a function cost() that gives us the cost of a
join tree– Find the best join tree for the query
• Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset
• Algorithm: In increasing order of set cardinality, compute the cost for
– Step 1: for {R1}, {R2}, …, {Rn}– Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn}– …– Step n: for {R1, …, Rn}
• It is a bottom-up strategy
Skipping further details of the algorithm
Read from book if interestedWill not be on the exam
31© J Singh, 2011 31
Dynamic Programming Algorithm
• When computing R1 ⋈ R2 ⋈ … ⋈ Rn ,
• Best Plan (R1 ⋈ R2 ⋈ … ⋈ Rn) = min cost plan of• Best Plan (R2 ⋈ R3 ⋈ … ⋈ Rn) ⋈ R1
• Best Plan (R1 ⋈ R3 ⋈ … ⋈ Rn) ⋈ R2
• …• Best Plan (R1 ⋈ R2 ⋈ … ⋈ Rn-1) ⋈ Rn
32© J Singh, 2011 32
Reducing the Search Space
• Left-deep trees vs Bushy trees– Combinatoric explosion of the number of possible trees
• Computing the cost of all possible trees is not feasible– For a 6-way Join, we can have
• More than 30,000 bushy trees• 6!, or 720 left-deep trees
– Left-deep trees leave their result in memory, making it possible to pipeline efficiently
• Trees without Cartesian product– Example: R(A,B) ⋈ S(B,C) ⋈ T(C,D)– Plan: (R(A,B) ⋈ T(C,D)) ⋈ S(B,C) has a Cartesian product– Most query optimizers will not consider it
33© J Singh, 2011 33
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)– deal with subqueries
• Improve the logical query plan– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
• Three topics– Choosing the physical
implementations (e.g., select and join methods)
– Decisions regarding materialized vs pipelined
– Notation for physical query plans
34© J Singh, 2011 34
Choosing a Selection Method
• Algorithm for each selection operator1. Can we use an created index on an attribute?– If yes, index-scan. (Otherwise table-scan)2. After retrieving all condition-satisfied tuples in (1), filter
them with the remaining selection conditions
• In other words,– When computing c1 c2 … cn(R), we index-scan on ci, then
filter the result on all other ci, where j i.– The next 2 pages show an example where we examine
several options and pick the best one
35© J Singh, 2011 35
Selection Method Example (p1)
• Selection: x=1 y=2 z < 5 (R)– Where parameters of R are:
T(R) = 5,000 B(R) = 200V(R, x) = 100 V(R, y) = 500
– Relation R is clustered– x and y have non-clustering indices– z is a clustering index
36© J Singh, 2011 36
Selection Method Example (p2)
Selection options:
1. Table-scan filter x, y, z. Cost is B(R) = 200 since R is clustered.
2. Use index on x =1 filter on y, z. Cost is 50 since T(R) / V(R, x) is (5000/100) = 50 tuples, x is not
clustering.
3. Use index on y =2 filter on x, z. Cost is 10 since T(R) / V(R, y) is (5000/500) = 10 tuples, y is not
clustering.
4. Index-scan on clustering index w/ z < 5 filter x ,y. Cost is about B(R)/3 = 67
Therefore:First retrieve all tuples with y = 2 (option 3)Then filter for x and z
37© J Singh, 2011 37
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)– deal with subqueries
• Improve the logical query plan– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
• Three topics– Choosing the physical
implementations (e.g., select and join methods)
– Decisions regarding materialized vs pipelined
– Notation for physical query plans
38© J Singh, 2011 38
Pipelining Versus Materialization
• Materialization– store (intermediate) result of each operations on disk
• Pipelining– Interleave the execution of several operations, the tuples
produced by one operation are passed directly to the operations that used it
– store (intermediate) result of each operations on buffer, which is implemented on main memory
• Prefer Pipelining where possible– Sometimes not possible, as the following example shows
• Next few pages, a fully worked-out example
39© J Singh, 2011 39
R⋈S⋈U Example (p1)
• Consider physical query plan for the expression(R(w, x) ⋈ S(x, y)) ⋈ U(y, z)
• Assumption– R occupies 5,000 blocks, S
and U each 10,000 blocks.– The intermediate result R ⋈
S occupies k blocks for some k.
– Both joins will be implemented as hash-joins, either one-pass or two-pass depending on k
– There are 101 buffers available.
40© J Singh, 2011 40
R⋈S⋈U Example (p2)
• When joining R ⋈ S, neither relation fits in buffers
• Need two-pass hash-join to partition R
– How many hash buckets for R?
– 100 at most
• The 2nd pass hash-join uses 51 buffers, leaving 50 buffers for joining result of R ⋈ S with U.
– Why 51?
41© J Singh, 2011 41
R⋈S⋈U Example (p3)
• Case 1: Suppose k 49, the result of R ⋈ S occupies at most 49 blocks.
• Steps 1. Pipeline in R ⋈ S into 49
buffers2. Organize them for lookup as a
hash table3. Use one buffer left to read
each block of U in turn4. Execute the second join as
one-pass join.
• The total number of I/O’s is 55,000– 45,000 for two-pass hash join of R and S– 10,000 to read U for one-pass hash join of (R⋈ S) ⋈ U.
42© J Singh, 2011 42
R⋈S⋈U Example (p4)
• Case 2: suppose k > 49 but < 5,000, we can still pipeline, but need another strategy where intermediate results join with U in a 50-bucket, two-pass hash-join. Steps are:1. Before start on R ⋈ S, we hash U into 50 buckets of 200 blocks
each.
2. Perform two-pass hash join of R and U using 51 buffers as case 1, and placing results in 50 remaining buffers to form 50 buckets for the join of R ⋈ S with U.
3. Finally, join R ⋈ S with U bucket by bucket. • The number of disk I/O’s is:
– 20,000 to read U and write its tuples into buckets– 45,000 for two-pass hash-join R ⋈ S– k to write out the buckets of R ⋈ S– k+10,000 to read the buckets of R ⋈ S and U in the final
join• The total cost is 75,000+2k.
43© J Singh, 2011 43
R⋈S⋈U Example (p5)
• Case 3: k > 5,000, we cannot perform two-pass join in 50 buffers available if result of R ⋈ S is pipelined. We are forced to materialize the relation R ⋈ S.
• The number of disk I/O’s is:– 45,000 for two-pass hash-join R and S– k to store R ⋈ S on disk– 30,000 + 3k for two-pass join of U in R ⋈ S
• The total cost is 75,000+4k.
44© J Singh, 2011 44
R⋈S⋈U Example (p6)
• In summary, costs of physical plan as function of R ⋈ S size.
• Pause and Reflect– It’s all about the expected size of the intermediate result R ⋈ S– What would have happened if
• We guessed 45 but had 55? Guessed 55 but only had 45?
• Guessed 4,500 but had 5,500? Guessed 5,500 but only had 4,500?
45© J Singh, 2011 45
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)– deal with subqueries
• Improve the logical query plan– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
• Three topics– Choosing the physical
implementations (e.g., select and join methods)
– Decisions regarding materialized vs pipelined
– Notation for physical query plans
46© J Singh, 2011 46
Notation for Physical Query Plans
• Several types of operators: 1. Operators for leaves2. (Physical) operators for Selection3. (Physical) Sorts Operators4. Other Relational-Algebra Operations
• In practice, each DBMS uses its own internal notation for physical query plans
47© J Singh, 2011 47
PQP Notation
• Leaves: Replace a leaf in an LQP by– TableScan(R): Read all blocks– SortScan(R, L): Read in order according to L– IndexScan(R, C): Scan R using index attribute A by condition AC– IndexScan(R, A): Scan R using index attribute A
• Selects: Replace a Select in an LQP by one of the leaf operators plus:
– Filter(D) for condition D
• Sorts: Replace a leaf-level sort as shown above. For other operation,
– Sort(L): Sort a relation that is not stored
• Other Operators: Operation- and algorithm-specific (e.g., Hash-Join)
– Also need to specify # passes, buffer sizes, etc.
48© J Singh, 2011 48
We have Arrived at the Desired Endpoint
• x=1 AND y=2 AND z<5 (R) • R ⋈ S ⋈ U
Example Physical Query Plans
two-passhash-join101 buffers
two-passhash-join101 buffers
TableScan(U)
TableScan(R) TableScan(S)
materialize
Filter(x=1 AND z<5)
IndexScan(R,y=2)
49© J Singh, 2011 49
Outline
• Convert SQL query to a parse tree– Semantic checking: attributes, relation names, types
• Convert to a logical query plan (relational algebra expression)
– deal with subqueries• Improve the logical query plan
– use algebraic transformations– group together certain operators– evaluate logical plan based on estimated size of relations
• Convert to a physical query plan– search the space of physical plans – choose order of operations– complete the physical query plan
50© J Singh, 2011 50
Optimization Issues and Proposals
• The “fuzz” in estimation of sizes– Parametric Query Optimization
• Specify alternatives to the execution engine so it may respond to conditions at runtime
– Multiple-query optimization• Take concurrent execution of several queries into account
• Combinatoric explosion of options when doing an n-way Join– Becomes really expensive around n > 15
• Alternatives optimizations have been proposed for special situations, but no general framework
– Rule-based optimizers– Randomized plan generation
CS 542 Database Management SystemsDistributed Query ExecutionSource: Carsten Binnig, Univ of Zurich, 2006
J Singh March 28, 2011
52© J Singh, 2011 52
Motivation
• Algorithms based on Semi-Joins have been proposed as techniques for query optimization
– They shine in Distributed and Parallel Databases– Good opportunity to explore them in that context
• Semi-join by example:
• Semi-join formal definition:
53© J Singh, 2011
Distributed / Parallel Join Processing
• Scenario:
• How to compute A ⋈ B?– Table A resides on Node 1– Table B resides on Node 2
Node 1 Node 2
Table A Table B
54© J Singh, 2011 54
Naïve approach (1)
• Idea: Use standard join and fetch table page-wise from remote node if necessary (send- and receive-operators)
• Example:– Join is executed on node 2 using a Nested-Loop-Join– Outer loop: Request page of table A from node 1 (remote)– Inner loop: For each page iterate over table B and produce
output=> Random access of pages on node 1 (due to network delay)
Node 1 Node 2
Table A Table BPage A1Request
Send
55© J Singh, 2011 55
Naïve approach (2)
• Idea: Ship one table completely to the other node
• Example:– Ship complete table A from node 1 to node 2– Join table A and B locally on node 2Þ Avoid random page access on node 1
Node 1 Node 2
Table A Table BTable AShip
56© J Singh, 2011 56
Naïve Approach: Implications
• Problems:– High cost for shipping data– Network cost roughly the same as I/O cost for a hard disk (or
even worse because of unpredictability of network delay)– Shipping A roughly equivalent to a full table scan
• (Trivial) Optimizations:– Ship always smaller table to the other side– If query contains a selection, apply selection before sending A– Note: bigger table may become the smaller table (after
selection)
57© J Singh, 2011 57
Semi-join Approach (p1)
• Idea: Before shipping a table, reduce to data that is shipped to those tuples that are only relevant for join
• Example: Join on A.id=B.id and table A should be shipped to node 2
Node 1 Node 2
A.id
Table AB.id
Table B
58© J Singh, 2011 58
Semi-join Approach (p2)
• (1) Compute projection B.id of table B on node 2
• (2) Ship column B.id to node 1
Node 1 Node 2
A.id … …
Table AB.id … …
Table BB.id
ShipB.id
59© J Singh, 2011 59
Semi-join Approach (p3)
• (3) Execute semi-join of B.id and table A on A.id=B.id (to select only relevant tuples of table A => table A’)
• (4) Send result of semi-join (table A’) to node 2
Node 1
A.id … …
Table AB.id
Ship
Node 2
B.id … …
Table BA.id … …
Table A’
60© J Singh, 2011 60
Semi-join Approach (p4)
• (5) Join the shipped table A’ locally on node 2 with table B
• => Optimization of this approach: If node 1 holds a join index (e.g., type 1 with A.id -> {B.RID}) we can start with step (3)
Node 1 Node 2
A.id … …
Table AB.id … …
Table BShipA.id … …
Table A’
61© J Singh, 2011 61
Semi-join Approach Discussion
• This strategy works well if semi-join reduces size of the table that needs to be shipped
– Assume all rows of Table A are needed anyway => none of the rows of table A can be discarded
– Then this approach is more costly than shipping the entire table A in the first place!
• Consequence:– Need to decide whether this method makes sense based on
semi-join selectivity – => Cost-based optimization must decide this
62© J Singh, 2011 62
Bloom-join Approach (p1)
• Algorithm same as semi-join approach– Ship a bloom-filter instead of (foreign) key column– Use bloom-filter technique to compress data
• Goal: only send a small bit list (to reduce network I/O) instead of all keys of column (as bit-vector)
• Problems: – A superset of tuples that might join will be sent back (same
problem as in bloom-filters for bitmap-indexes)– => More tuples must be sent over network and thus net gain
depends on good hash function
63© J Singh, 2011 63
Bloom-join Approach (p2)
• (1) Compute bloom filter BL of size n for column B.id of table B on node 2 with n << |B.id| (e.g., by B.id % n)
• (2) Ship bloom filter B.id’ to node 1
Node 1 Node 2
A.id … …
Table AB.id … …
Table BBL
ShipBL
64© J Singh, 2011 64
Bloom-join Approach (p3)
• (3) Probe bloom filter B.id’ with tuples from table A to get a superset of possible join candidates (=> table A’)
• (4) Send result (table A’) to node 2 (table A’ might contain join candidates that do not have a partner in table B)
• (5) Join the shipped table A’ locally on node 2 with table B
Node 1 Node 2
A.id … …
Table AB.id … …
Table BBL
ShipProbe
A.id … …
Table A’
65© J Singh, 2011 65
Bloom-join Approach Discussion
• Communication cost much reduced
• But have to deal with false positives
• Widely used in NoSQL databases
66© J Singh, 2011 66
Project Rubric
Weight Category \\ Score 5 3 2 0
10
References and Sources
Used a variety of references, translated into own words, properly noted, to provide solid information to project
Used some references, but Did not fully relate to topic
References were sparse, and verbatim.
Little or no references.
10
Originality Topic well researched beyond depth
Some originality, but much repetition of class coverage
Little originality, and Info was not in own word
Not covered beyond level discussed in class covered in class and readily available with a Google Search
10Relevance to this course
Strong positive connection
Some relevance, but too much fluff
Little science, mostly fluff
All fluff
10
Presentation Quality -- Q&A well handled -- Within time limits
Organized, easy to follow, good delivery, kept to time limit.
Too cluttered, hard to understand
Too sparse, not enough depth
Confusing, partial or little relevance, answers not responsive to questions
10
Other -- Demo -- Graphics -- Written Report
Relevant and understandable
Too complicated Too basic Does not relate well and poorly explained.