GALT Workshop Oct 16, 2003 1 Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements: • Philippa Gardner and Sergio Maffeis, Imperial College • Arnaud Sahuguet, Bell Labs (formerly at UPenn) • Zack Ives and Benjamin Pierce, UPenn
26
Embed
Languages for Distributed Information Val Tannen Computer and Information Science Department
Languages for Distributed Information Val Tannen Computer and Information Science Department University of Pennsylvania Acknowledgements: Philippa Gardner and Sergio Maffeis, Imperial College Arnaud Sahuguet, Bell Labs (formerly at UPenn) Zack Ives and Benjamin Pierce, UPenn. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GALT WorkshopOct 16, 2003
1
Languages for Distributed Information
Val Tannen
Computer and Information Science Department
University of Pennsylvania
Acknowledgements:
• Philippa Gardner and Sergio Maffeis, Imperial College• Arnaud Sahuguet, Bell Labs (formerly at UPenn)• Zack Ives and Benjamin Pierce, UPenn
GALT WorkshopOct 16, 2003
2
• Past: because it did not fit on one system; data
placement was a big topic.
•Present: independent data sources agree to collaborate
• Queries that require integration across multiple sources.
• Heterogeneity. Is it solved by XML? Results are mixed.
• Redundancy is what makes things really interesting!
Distributed information
GALT WorkshopOct 16, 2003
3
A query arrives… and demands satisfaction
query process
Arrives in any node. Refers to names. Names relate to data
in various nodes. For example:
node A: proj (join R (sel S))
Need to discover that R is at node A and S is at node B.
Need a plan to compute the result of
proj (join tableR@A (sel tableS@B)
and return the answer to the client that asked the query.
GALT WorkshopOct 16, 2003
4
One language for queries, views, and query plans
data languagequery language
(SQL, XQuery)
view language
query plan language
processes
GALT WorkshopOct 16, 2003
5
Motives and intentions
Existing view and plan languages are limited.
Capture distributed query processing in a high-level language.
(SQL vs. C++)
Easier to program self-tuning features.
Easier to generate automatically, eg., by optimizers.
Easier to model toward formal verification.
My ambition:) is to convince
(1) database people that process calculi are useful,
(2) process calculi people to look at some new problems.
GALT WorkshopOct 16, 2003
6
Concurrency and process calculi
Why process calculi?
Because queries arrive asynchronously and must be processed
concurrently within a node (and across nodes).
Composition primitives:
e1 | e2 parallel
e1 ; e2 sequential
e1 or e2 nondeterministic choice
e1, e2 expressions denoting processes
GALT WorkshopOct 16, 2003
7
Synchronization and communication
CCS notifL ; e1 | waitL ; e2 e1 | e2
pi-calculus
new c . ( send c v ; e1 | recv c x. e2(x) )
new c . e1 | e2(v)
Anticipation: the blocking operational semantics does not
capture well query execution. The old dataflow paradigm might
be better for that. But channels are essential.
blocking synchro
+ comm on channel c
GALT WorkshopOct 16, 2003
8
Capturing query plans
Original query may be “decomposed” into subordinate
queries running at different nodes. Recall
node A: proj (join R (sel S))
B
sel
A
proj join
SR
Client
plan:
Get from here… … to here
GALT WorkshopOct 16, 2003
9
Query plan “development”
node A: proj (join R (sel S))
> meta-step (discovery)>
node A: proj (join R@A (sel S@B))
> meta-step (discovery)>
node A: proj (join tableR (sel S@B))
> meta-step (optimization)>
node A: new c . proj (join tableR (split c (sel S@B)))
>operational semantics step>
node A: new c . proj (join tableR (recv c))
| send c (sel S@B)
GALT WorkshopOct 16, 2003
10
A new primitive: split
new c . e1(split c (e2))
new c . e1(recv c) | send c e2
(We are using a simplified form of recv since the value passed
through the channel is consumed in just one place.)
split is typically introduced by optimization steps.
Alternative (query shipping vs. data shipping):
node A: new c . proj (join tableR (sel (split c S@B)))
GALT WorkshopOct 16, 2003
11
Distribution and migration
Each node handles a “soup” of parallel processes:
( node A: e1 | e2 ) || ( node B: e3 | e4 | e5 )
Process migration:
( node A: e1 | migrate B e2) || ( node B: e3 )
( node A: e1 ) || ( node B: e3 | e2 )
Will be used to move subordinate queries.
GALT WorkshopOct 16, 2003
12
Query plan, continued
node A: new c . proj (join tableR (recv c))
| send c (sel S@B)
>meta-step (subordination)>
node A: new c . proj (join tableR (recv c))
| migrate B (send c (sel S@B))
>operational semantics step>
new c@A . ( node A: proj (join tableR (recv c))
|| node B: send c@A (sel S@B) )
Global channels vs. located channels.
We can get by with located channels.
GALT WorkshopOct 16, 2003
13
Distributed query evaluation
>meta-step (discovery)>
new c@A . ( node A: proj (join tableR (recv c))
|| node B: send c@A (sel tableS) )
Should not block send/recv on the entire table (all-push). Data
should be streamed.
All-pull (“iterators”) blocks recv/send on each tuple: not good.
Push-pull (queued) is most reasonable.
B
sel
A
proj join
SR
Client
GALT WorkshopOct 16, 2003
14
“Lay down the pipes … Turn on the faucets”
The approach of ubQL [Sahuguet&T., 2001]. Separate
• deployment phase: queries are split
channels are created
subordinate query processes migrate
carrying channel names with them
• execution phase: data is streamed through channels
processed according to subord. queries
• adaptive behavior: execution can be interrupted and reset
channels can be flushed and reset
alternative plans can be started
GALT WorkshopOct 16, 2003
15
Adaptive query processing
Another new primitive (like Cardelli’s “Web combinators”)
adapt stream AltPlan stream if adequate
AltPlan if drying up
redundancy
Example:
new c@A .
node A: proj (join tableR (adapt (recv c) AltPlan))
|| node B: send c@A (sel tableS)
where, eg., AltPlan new c’@A. split c’ (sel S@C)
GALT WorkshopOct 16, 2003
16
Distributed databases systems: 25+ years
A lot of useful techniques, especially optimizations to reduce
bandwidth. Eg., use of semijoins.
Essential limitation: need powerful, all-knowing node, to generate the query plans.
proj join
sel
B
A
proj SR
Client
These are also subordinatequeries. Not quite decomposition…
still easily expressible
GALT WorkshopOct 16, 2003
17
Where is the data? Discovery!
• Distributed Database Systems
centralized, complete, consistent knowledge.
• No clue, go out and search everywhere (Gnutella)!