Languages for Distributed Information Val Tannen Computer and Information Science Department

GALT WorkshopOct 16, 2003

1

Languages for Distributed Information

Val Tannen

Computer and Information Science Department

University of Pennsylvania

Acknowledgements:

• Philippa Gardner and Sergio Maffeis, Imperial College• Arnaud Sahuguet, Bell Labs (formerly at UPenn)• Zack Ives and Benjamin Pierce, UPenn


2

• Past: because it did not fit on one system; data

placement was a big topic.

•Present: independent data sources agree to collaborate

• Queries that require integration across multiple sources.

• Heterogeneity. Is it solved by XML? Results are mixed.

• Redundancy is what makes things really interesting!

Distributed information


3

A query arrives… and demands satisfaction

query process

Arrives in any node. Refers to names. Names relate to data

in various nodes. For example:

node A: proj (join R (sel S))

Need to discover that R is at node A and S is at node B.

Need a plan to compute the result of

proj (join tableR@A (sel tableS@B)

and return the answer to the client that asked the query.


4

One language for queries, views, and query plans

data languagequery language

(SQL, XQuery)

view language

query plan language

processes


5

Motives and intentions

Existing view and plan languages are limited.

Capture distributed query processing in a high-level language.

(SQL vs. C++)

Easier to program self-tuning features.

Easier to generate automatically, eg., by optimizers.

Easier to model toward formal verification.

My ambition:) is to convince

(1) database people that process calculi are useful,

(2) process calculi people to look at some new problems.


6

Concurrency and process calculi

Why process calculi?

Because queries arrive asynchronously and must be processed

concurrently within a node (and across nodes).

Composition primitives:

e1 | e2 parallel

e1 ; e2 sequential

e1 or e2 nondeterministic choice

e1, e2 expressions denoting processes


7

Synchronization and communication

CCS notifL ; e1 | waitL ; e2 e1 | e2

pi-calculus

new c . ( send c v ; e1 | recv c x. e2(x) )

new c . e1 | e2(v)

Anticipation: the blocking operational semantics does not

capture well query execution. The old dataflow paradigm might

be better for that. But channels are essential.

blocking synchro

+ comm on channel c


8

Capturing query plans

Original query may be “decomposed” into subordinate

queries running at different nodes. Recall


B

sel

A

proj join

SR

Client

plan:

Get from here… … to here


9

Query plan “development”


> meta-step (discovery)>

node A: proj (join R@A (sel S@B))

> meta-step (discovery)>

node A: proj (join tableR (sel S@B))

> meta-step (optimization)>

node A: new c . proj (join tableR (split c (sel S@B)))

>operational semantics step>

node A: new c . proj (join tableR (recv c))

| send c (sel S@B)


10

A new primitive: split

new c . e1(split c (e2))

new c . e1(recv c) | send c e2

(We are using a simplified form of recv since the value passed

through the channel is consumed in just one place.)

split is typically introduced by optimization steps.

Alternative (query shipping vs. data shipping):

node A: new c . proj (join tableR (sel (split c S@B)))


11

Distribution and migration

Each node handles a “soup” of parallel processes:

( node A: e1 | e2 ) || ( node B: e3 | e4 | e5 )

Process migration:

( node A: e1 | migrate B e2) || ( node B: e3 )

( node A: e1 ) || ( node B: e3 | e2 )

Will be used to move subordinate queries.


12

Query plan, continued


| send c (sel S@B)

>meta-step (subordination)>


| migrate B (send c (sel S@B))

>operational semantics step>

new c@A . ( node A: proj (join tableR (recv c))

|| node B: send c@A (sel S@B) )

Global channels vs. located channels.

We can get by with located channels.


13

Distributed query evaluation

>meta-step (discovery)>

new c@A . ( node A: proj (join tableR (recv c))

|| node B: send c@A (sel tableS) )

Should not block send/recv on the entire table (all-push). Data

should be streamed.

All-pull (“iterators”) blocks recv/send on each tuple: not good.

Push-pull (queued) is most reasonable.

B

sel

A

proj join

SR

Client


14

“Lay down the pipes … Turn on the faucets”

The approach of ubQL [Sahuguet&T., 2001]. Separate

• deployment phase: queries are split

channels are created

subordinate query processes migrate

carrying channel names with them

• execution phase: data is streamed through channels

processed according to subord. queries

• adaptive behavior: execution can be interrupted and reset

channels can be flushed and reset

alternative plans can be started


15

Adaptive query processing

Another new primitive (like Cardelli’s “Web combinators”)

adapt stream AltPlan stream if adequate

AltPlan if drying up

redundancy

Example:

new c@A .

node A: proj (join tableR (adapt (recv c) AltPlan))

|| node B: send c@A (sel tableS)

where, eg., AltPlan new c’@A. split c’ (sel S@C)


16

Distributed databases systems: 25+ years

A lot of useful techniques, especially optimizations to reduce

bandwidth. Eg., use of semijoins.

Essential limitation: need powerful, all-knowing node, to generate the query plans.

proj join

sel

B

A

proj SR

Client

These are also subordinatequeries. Not quite decomposition…

still easily expressible


17

Where is the data? Discovery!

• Distributed Database Systems

centralized, complete, consistent knowledge.

• No clue, go out and search everywhere (Gnutella)!

• Keyword-node indexes (catalogs) in each node.

• Finally something clever: Distributed Hash Tables

• Layered organization using views

(simple version in Mediated Information Integration: Tsimmis,

Kleisli, Information Manifold, K2, Garlic, etc.)


18

Views can organize distributed data

C

S1 = expr( localTables )

S2 = expr( localTables )

B

R = expr( localTables, S2@C)

A

query = expr( R@B, S1@C)

V = expr(S1@C, S2@C)

composition-with-views

rewriting-with-views


19

New kinds of viewsGeneralize split:

e1(spawn e2 e3) e1(e2) | e3

split c e spawn (recv c) (send c e)

View definition that invokes a remote continuous query:

R = new cr. spawn (recv cr) (send cq cr)

Continuous query “installed” elsewhere:

recv* cq x. send x (sel tableT)

cr is a data stream channelcq is a standard

pi-calculus channel


20

Some tentative conclusions

• Very useful process primitives: parallel and sequential

composition, migration, channels.

• Need a new behavior for certain channels: streaming data (at

least at the language level; underneath we have bounded

buffers/queues ).

• Standard channel semantics is still useful for passing small

values or channel names, eg., for continuous queries.

• Some new primitives should be considered split/spawn,

adapt.


21

Where to?

Service calls, active (dynamic) data, done by Gardner and

Maffeis.

Clarify semantics of execution phase so we can apply

verification techniques for adaptive query plans.

In the presence of redundancy, building globally optimal plans

seems hopeless. How to define local optimality? Is there even a

concept of “good enough”?

Combine layers of views with distributed hashing at the level of

schema?


22


23

An interesting complication: services

Service calls in query:

to applications that provide additional processing

(that cannot be programmed in the query language).

Eg., analysis tools, scientific (BLAST) or financial

Service calls in data (active data):

part of the data is to be retrieved only on demand

Eg., Active XML

Service definition, result computation, and result consumption

all in potentially different nodes.


24

Mixing query features with process features

In principle we can take any query language (SQL, OQL,

XQuery) and add the process primitives we saw.

Really? The query language better be nicely “compositional”

and “referentially transparent”. SQL: lots of special cases…

Even better is to also have a robust static type system.

Things can get complicated:

foreach r in Dept@A collect foreach s in Emp@B where s.name=r.manager collect {s.age}


25

Where is the relevant data? (2)

Basic algorithm can be iterated and made useful to databases

(Ives). But, what to do with irregular pre-existing data with

complex schema?


26

Some tentative conclusions (2)

• Services are modeled in the same framework with little

effort. Service call arguments can also be carried as part of

migrating processes.

• Did not show it, but subscriptions, continuous queries,

caching, proxying (query redirecting) can be approached too.

Useful feature from process calculi: replicated processes.

• Well-designed query languages will mix nicely with process

calculi primitives.

• Operational semantics steps and meta-steps must be

interleaved (this makes me uncomfortable!)

Languages for Distributed Information Val Tannen Computer and Information Science Department

Documents