Distributed Query Processing Donald Kossmann University of Heidelberg [email protected].

Distributed Query Processing

Donald Kossmann

University of Heidelberg

[email protected]

Agenda

• Query Processing 101– centralized query processing– distributed query processing

• Middleware– SQL and XML data integration

• The Role of Web Services

Problem Statement

• Input: QueryHow many times has the moon circled around the

earth in the last twenty years?

• Output: Answer240!

• Objectives:– response time, throughput, first answers, little IO, ...

• Centralized vs. Distributed Query Processing– same problem– but, different parameters and objectives

Query Processing 101

• Input: Declarative Query– SQL, OQL, XQuery, ...

• Step 1: Translate Query into Algebra– Tree of operators

• Step 2: Optimize Query (physical and logical)– Tree of operators– (Compilation)

• Step 3: Interpretation– Query result

Algebra

– relational algebra for SQL very well understood– algebra for OQL fairly well understood– algebra for XQuery (work in progress)

SELECT A.dFROM A, BWHERE A.a = B.b AND A.c = 35

A.d

A.a = B.b,A.c = 35

X

A B

Query Optimization

– „no brainers“ (e.g., push down cheap predicates)– enumerate alternative plans, apply cost model– use search heuristics to find cheapest plan

A.d

A.a = B.b,A.c = 35

X

A B

A.d

hashjoin

B.b

index A.c B

Query Execution

– library of operators (hash join, merge join, ...)– pipelining (iterator model)– lazy evaluation– exploit indexes and clustering in database

A.d

hashjoin

B.b

index A.c B

(John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)

(Edinburgh, AS, 6.0)

(CS)(AS)

(John, 35, CS)

John

Summary: Centralized Queries• Basic SQL (SPJG, nesting) well understood• Very good extensibility

– nearest neighbor search, spatial joins, time series, UDF, roll-up, cube, ...

• Current problems– statistics, cost model for optimization– physical database design expensive

• Trends– interactiveness during execution – approximate answers– more and more functionality, powerful models (XML)

Distributed Query Processing 101

• Idea: This is just an extension of centralized query

processing. (System R* et al. in the early 80s)

• What is different?– extend physical algebra: send&receive operators– resource vectors, network interconnect matrix– caching and replication– optimize for response time– less predictability in cost model (adaptive algos)– heterogeneity in data formats and data models

Distributed Query Plan

A.d

hashjoin

B.b

index A.c B

receive receive

send send

Cost

1

8

2

5 10

1 6

1 6

Total Cost =Sum of Cost of Ops

Cost = 40

Response Time

25, 33

24, 32

0, 12

0, 5 0, 10

0, 7 0, 24

0, 6 0, 18

Total Cost = 40first tuple = 25last tuple = 33

first tuple = 0last tuple = 10

independent,pipelined

parallelism

Adaptive Algorithms

• Deal with unpredictable events at run time– delays in arrival of data, burstiness of network– autonomity of nodes, change in policies

• Example: double pipelined hash joins– build hash table for both input streams– read inputs in separate threads– good for bursty arrival of data

• re-optimization at run time– monitor execution of query– adjust estimates of cost model– re-optimize if delta is too large

Heterogeneity

• Use Wrappers to „hide“ heterogeneity• Wrappers take care of data format, packaging• Wrappers map from local to global schema• Wrappers carry out caching

– connections, cursors, data, ...

• Wrappers map queries into local dialect• Wrappers participate in query planning!!!

– define the subset of queries that can be handled– give cost information, statistics– „capability-based rewrite“ (HKWY, VLDB 1997)

Data Cleaning

• Are two objects the same?

• Is „D. A. Kossman“ the same as „Kossmann“?

• Is the object that was at Position x 10 min. ago the same as the object at Position y now?

• Approaches (combination of)– statistical– domain knowledge– human interspection

• Very Expensive

Summary• „Theory“ very well understood

– extend traditional (centralized) query processing– add some bells and whistles– heterogeinity needs manual work and wrappers

• Problems in Practice– cost model, statistics– architectures are not fit for adaptivity, heterogeneity– optimizers do not scale for 10,000s of sites – autonomy of sites,

systems not built for asynchronous communication– data cleaning

Middleware• Two kinds of middleware

– data warehouses– virtual integration

• Data Warehouses– good: query response times– good: materializes results of data cleaning– bad: high resource requirements in middleware– bad: staleness of data

• Virtual Integration – the opposite– caching possible to improve response times

Virtual Integration

Query

Middleware(query decomposition, result composition)

DB1 DB2

wrapper

subquery

wrapper

subquery

IBM Data Joiner

SQL Query

Data Joiner

SQL DB1 SQL DB2

wrapper

subquery

wrapper

subquery

Adding XML

Query

Middleware (SQL)

DB1 DB2

wrapper

subquery

wrapper

subquery

XML Publishing

XML Data Integration

XML Query

Middleware (XML)

DB1 DB2

wrapper

XMLquery

wrapper

XMLquery

XML Data Integration

• Example: BEA Liquid Data• Advantage

– Availability of XML wrappers for all major databases

• Problems– XML - SQL mapping is very difficult– XML is not always the right language

(e.g., decision support style queries)

Summary

• Middleware „looks“ like a homogenous, centralized database– location transparency– data model transparency

• Middleware provides global schema– data sources map local schemas to global schema

• Various kinds of middleware (SQL, OQL, XML)

• „Stacks“ of middleware possible

• Data Cleaning requires special attention

A Note on Web Services• Idea: Encapsulate Data Source

– provide WSDL interface to access data– works very well if query pattern is known

• Problem: Exploit Capability of Source– WSDL limits capabilities of data source

good optimization requires „white box“– example: access by id, access by name, full scan

should all combinations be listed in WSDL?

• Solution: WSDL for Query Planning– Details ???

Distributed Query Processing Donald Kossmann University of Heidelberg [email protected].

Documents

query optimization

query physical

query planning

agenda query processing

declarative query sql

distributed query plan

cb slide

expensive slide