Top Banner
Distributed Query Processing Donald Kossmann University of Heidelberg [email protected] heidelberg.de
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Distributed Query Processing

Donald Kossmann

University of Heidelberg

[email protected]

Page 2: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Agenda

• Query Processing 101– centralized query processing– distributed query processing

• Middleware– SQL and XML data integration

• The Role of Web Services

Page 3: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Problem Statement

• Input: QueryHow many times has the moon circled around the

earth in the last twenty years?

• Output: Answer240!

• Objectives:– response time, throughput, first answers, little IO, ...

• Centralized vs. Distributed Query Processing– same problem– but, different parameters and objectives

Page 4: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Query Processing 101

• Input: Declarative Query– SQL, OQL, XQuery, ...

• Step 1: Translate Query into Algebra– Tree of operators

• Step 2: Optimize Query (physical and logical)– Tree of operators– (Compilation)

• Step 3: Interpretation– Query result

Page 5: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Algebra

– relational algebra for SQL very well understood– algebra for OQL fairly well understood– algebra for XQuery (work in progress)

SELECT A.dFROM A, BWHERE A.a = B.b AND A.c = 35

A.d

A.a = B.b,A.c = 35

X

A B

Page 6: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Query Optimization

– „no brainers“ (e.g., push down cheap predicates)– enumerate alternative plans, apply cost model– use search heuristics to find cheapest plan

A.d

A.a = B.b,A.c = 35

X

A B

A.d

hashjoin

B.b

index A.c B

Page 7: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Query Execution

– library of operators (hash join, merge join, ...)– pipelining (iterator model)– lazy evaluation– exploit indexes and clustering in database

A.d

hashjoin

B.b

index A.c B

(John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)

(Edinburgh, AS, 6.0)

(CS)(AS)

(John, 35, CS)

John

Page 8: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Summary: Centralized Queries• Basic SQL (SPJG, nesting) well understood• Very good extensibility

– nearest neighbor search, spatial joins, time series, UDF, roll-up, cube, ...

• Current problems– statistics, cost model for optimization– physical database design expensive

• Trends– interactiveness during execution – approximate answers– more and more functionality, powerful models (XML)

Page 9: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Distributed Query Processing 101

• Idea: This is just an extension of centralized query

processing. (System R* et al. in the early 80s)

• What is different?– extend physical algebra: send&receive operators– resource vectors, network interconnect matrix– caching and replication– optimize for response time– less predictability in cost model (adaptive algos)– heterogeneity in data formats and data models

Page 10: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Distributed Query Plan

A.d

hashjoin

B.b

index A.c B

receive receive

send send

Page 11: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Cost

1

8

2

5 10

1 6

1 6

Total Cost =Sum of Cost of Ops

Cost = 40

Page 12: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Response Time

25, 33

24, 32

0, 12

0, 5 0, 10

0, 7 0, 24

0, 6 0, 18

Total Cost = 40first tuple = 25last tuple = 33

first tuple = 0last tuple = 10

independent,pipelined

parallelism

Page 13: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Adaptive Algorithms

• Deal with unpredictable events at run time– delays in arrival of data, burstiness of network– autonomity of nodes, change in policies

• Example: double pipelined hash joins– build hash table for both input streams– read inputs in separate threads– good for bursty arrival of data

• re-optimization at run time– monitor execution of query– adjust estimates of cost model– re-optimize if delta is too large

Page 14: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Heterogeneity

• Use Wrappers to „hide“ heterogeneity• Wrappers take care of data format, packaging• Wrappers map from local to global schema• Wrappers carry out caching

– connections, cursors, data, ...

• Wrappers map queries into local dialect• Wrappers participate in query planning!!!

– define the subset of queries that can be handled– give cost information, statistics– „capability-based rewrite“ (HKWY, VLDB 1997)

Page 15: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Data Cleaning

• Are two objects the same?

• Is „D. A. Kossman“ the same as „Kossmann“?

• Is the object that was at Position x 10 min. ago the same as the object at Position y now?

• Approaches (combination of)– statistical– domain knowledge– human interspection

• Very Expensive

Page 16: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Summary• „Theory“ very well understood

– extend traditional (centralized) query processing– add some bells and whistles– heterogeinity needs manual work and wrappers

• Problems in Practice– cost model, statistics– architectures are not fit for adaptivity, heterogeneity– optimizers do not scale for 10,000s of sites – autonomy of sites,

systems not built for asynchronous communication– data cleaning

Page 17: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Middleware• Two kinds of middleware

– data warehouses– virtual integration

• Data Warehouses– good: query response times– good: materializes results of data cleaning– bad: high resource requirements in middleware– bad: staleness of data

• Virtual Integration – the opposite– caching possible to improve response times

Page 18: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Virtual Integration

Query

Middleware(query decomposition, result composition)

DB1 DB2

wrapper

subquery

wrapper

subquery

Page 19: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

IBM Data Joiner

SQL Query

Data Joiner

SQL DB1 SQL DB2

wrapper

subquery

wrapper

subquery

Page 20: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Adding XML

Query

Middleware (SQL)

DB1 DB2

wrapper

subquery

wrapper

subquery

XML Publishing

Page 21: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

XML Data Integration

XML Query

Middleware (XML)

DB1 DB2

wrapper

XMLquery

wrapper

XMLquery

Page 22: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

XML Data Integration

• Example: BEA Liquid Data• Advantage

– Availability of XML wrappers for all major databases

• Problems– XML - SQL mapping is very difficult– XML is not always the right language

(e.g., decision support style queries)

Page 23: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

Summary

• Middleware „looks“ like a homogenous, centralized database– location transparency– data model transparency

• Middleware provides global schema– data sources map local schemas to global schema

• Various kinds of middleware (SQL, OQL, XML)

• „Stacks“ of middleware possible

• Data Cleaning requires special attention

Page 24: Distributed Query Processing Donald Kossmann University of Heidelberg kossmann@informatik.uni-heidelberg.de.

A Note on Web Services• Idea: Encapsulate Data Source

– provide WSDL interface to access data– works very well if query pattern is known

• Problem: Exploit Capability of Source– WSDL limits capabilities of data source

good optimization requires „white box“– example: access by id, access by name, full scan

should all combinations be listed in WSDL?

• Solution: WSDL for Query Planning– Details ???