Top Banner
Efficient Query Optimization for Distributed Join in Database Federation A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008
22

Efficient Query Optimization for Distributed Join in Database Federation

Feb 23, 2016

Download

Documents

young young

Efficient Query Optimization for Distributed Join in Database Federation. A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008. Outline. Introduction – Query Optimization in Database Federations Architecture and Problem Definition Proposed Work Schedule. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join

in Database Federation

A Master’s Thesis Proposalby

Di Wang 

Advisor: Prof. Murali Mani

Dec 4, 2008

Page 2: Efficient Query Optimization for Distributed Join in Database Federation

OutlineIntroduction – Query Optimization

in Database Federations

Architecture and Problem Definition

Proposed Work

Schedule

Page 3: Efficient Query Optimization for Distributed Join in Database Federation

Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a

common interface◦Business partners -> combine data

Multiple Access MethodsMultiple Data Schemas

Page 4: Efficient Query Optimization for Distributed Join in Database Federation

Introduction to Database Federation

Database Federation is one approach to data integration◦Key performance advantage: efficiently

combine data from multiple sources in a single statement

◦The data sources are federated into a unified middleware, called mediator.

Page 5: Efficient Query Optimization for Distributed Join in Database Federation

Key Components of Database Federation

Query Rewriter

Cost-Based Optimizer

Query

. . . . . .

Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization

Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems

Page 6: Efficient Query Optimization for Distributed Join in Database Federation

The problem

Page 7: Efficient Query Optimization for Distributed Join in Database Federation

Things that make us unhappySortMerge on M1

NestLoop on M1M3.R3

M1.R1 M2.R2

Optimizer

M1

M2

M3

Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …

SortMerge on M2

NestLoop on M2M3.R3

M1.R1 M2.R2

Plan 1 Plan 2 HashJoin on M3

SortMerge on M1M3.R3

M1.R1 M2.R2

Plan 3

Run CPU Utility Available Buffer Chosen Plan

Optimal PlanM1 M2 M3 M1 M2 M3

1 25%

25% 25%

B(R1) - - Plan 1 Plan 1

2 75%

10% 25%

> B(R1) > B(R1)

- Plan 1 Plan 2

3 50%

50% 15%

> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.

Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)

Page 8: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB

1992. Key idea: To identify several execution plans, each one of

which is optimal for a subset of ALL possible values of the run-time parameters

E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}

P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450

The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:

is static parameters, c( ) is the cost function

Page 9: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution - Parametric Query Optimization (Cont.)

Efficient exploration algorithm – Randomized Algorithm

Justification for using parametric query optimizationRelative cost

Buffer size

Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters

Page 10: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution – Two-Phase Algorithm

W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.

Developed for a parallel database based on a share-memory multiprocessor

Phase 1: find the optimal sequential plan assuming the entire buffer pool is available

Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

Page 11: Efficient Query Optimization for Distributed Join in Database Federation

Benefits:

◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2

◦ Capability of dealing with compile-time unknown parameters

Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive

for large scale of data sources

Existing Solution – Two-Phase Algorithm(Cont.)

Page 12: Efficient Query Optimization for Distributed Join in Database Federation

Proposed Work

Page 13: Efficient Query Optimization for Distributed Join in Database Federation

Important Observation many national-scale or global-scale data federations are

built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.

many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

Page 14: Efficient Query Optimization for Distributed Join in Database Federation

Cluster-and-Conquer consider all data resources in the database federation

as a set of several clusters of sites

design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator

Cluster 2Cluster 1

Cluster 8

Cluster 4

Cluster 5 Cluster 6Cluster 7

Cluster9

Cluster 11

Cluster10Cluster12

Cluster13Global

Mediator

Page 15: Efficient Query Optimization for Distributed Join in Database Federation

Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations

•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster

•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel

Page 16: Efficient Query Optimization for Distributed Join in Database Federation

Cost Model and Optimization Goal

Cost Model

Optimization Goal◦to find the distributed join schedule

plan with minimum cost.

Page 17: Efficient Query Optimization for Distributed Join in Database Federation

Problem DefinitionRun-time parameters:

◦Available buffer size◦CPU utilization

Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism

Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output

Independent parallelism

Page 18: Efficient Query Optimization for Distributed Join in Database Federation

Optimization Algorithm

E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND

S1.t2.Year = S2.t3.Year

Global Mediator

Clustered view

Physical design info:B(R), T(R), V(R.attr), ……

Rule 1: only determine inter-cluster operations

Rule 2: plans that join two relations in distinct clusters are eliminated

Page 19: Efficient Query Optimization for Distributed Join in Database Federation

Optimization Algorithm (Cont.)

Cluster

1Mediato

r

Sub-plan

Search space:•Algebraic transform

•Physical method selection – Available_buffer

•Site selection – CPU_utility (fine grain operator scheduling)

Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……

Physical design info:B(R), T(R), V(R.attr), ……

Page 20: Efficient Query Optimization for Distributed Join in Database Federation

Theoretical AnalysisIn global mediator

In cluster mediator

Compare to related works

Page 21: Efficient Query Optimization for Distributed Join in Database Federation

Experiment Design

Page 22: Efficient Query Optimization for Distributed Join in Database Federation

That is what I want to do for my Master

Thesis …

Thanks