Top Banner
Ariadne: Managing Fine-Grained Provenance on Data Streams Boris Glavic 1 Kyumars Sheykh Esmaili 2 Peter M. Fischer 3 Nesime Tatbul 4 Illinois Institute of Technology 1 DBGroup Nanyang Technological University 2 SANDS University of Freiburg 3 Web Science Intel Labs 4 Intel Science and Technology Center for Big Data DEBS 2013 - July 7, 2013 - Arlington, USA
65

DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Nov 22, 2014

Download

Science

Boris Glavic

Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Ariadne: Managing Fine-Grained Provenance on DataStreams

Boris Glavic1 Kyumars Sheykh Esmaili2 Peter M. Fischer3

Nesime Tatbul4

Illinois Institute ofTechnology1

DBGroup

NanyangTechnological

University2

SANDS

University ofFreiburg3

Web Science

Intel Labs4

Intel Science andTechnology Center

for Big Data

DEBS 2013 - July 7, 2013 - Arlington, USA

Page 2: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

1 Motivation

2 Reduced Eager Operator Instrumentation

3 Optimizations

4 Experiments

5 Conclusions

Page 3: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Fine-grained Provenance for Data Stream Management

Provenance• Information about the origin and creation process of data

• Here: On which inputs does a given output depend on

Fine-grained Provenance

• At the granularity of tuples

• Fix a tuple t in an output or intermediate stream

• On which input tuples does it depend on?

Example

S1 Sout

S2

141 2 3

a b

Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 4: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Fine-grained Provenance for Data Stream Management

Provenance• Information about the origin and creation process of data

• Here: On which inputs does a given output depend on

Fine-grained Provenance

• At the granularity of tuples

• Fix a tuple t in an output or intermediate stream

• On which input tuples does it depend on?

Example

S1 Sout

S2

141 2 3

a b

Slide 1 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 5: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Why (Fine-Grained) Stream Provenance?

Use cases• Ad hoc inspection

• DSMS generates alarm event based on sensor readings• Understand why alarm was raised to act appropriately

• Stream query debugging• Trace back and forward erroneous data items

• Auditing• Proof of correct operation• No access policies violated

Slide 2 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 6: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Challenges and Opportunities

• Online and infinite data arrival• Traditional methods that reconstruct provenance retroactively not

applicable

• Ordered data model• Order can be exploited for compressing provenance

• Window-based processing• Aggregation prevalent operation that has large provenance

• Low-latency requirements• Provenance overhead should not violate latency requirements

• Non-determinism• Sources: Load-shedding, windowing on system-time• Provenance computation can not assume reproducibility of results

Slide 3 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 7: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Related Work

Database Provenance• Large body of work on provenance models

• Query rewrite-based approaches

Workflow Provenance• Many systems support provenance tracking

• Mostly coarse-grained provenance

Data Stream Provenance• Coarse-grained provenance

• Not detailed enough for most use-cases

• Fine-grained provenance• Stream debugging: One-step (one operator at a time)• Approximation techniques (strong assumptions)

Slide 4 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 8: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

How to generate provenance?

Inversion• Infer provenance from query outputs

• Usually tracing back one operator at a time

• No storage overhead

• Only works for trivial operators

Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 9: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

How to generate provenance?

Annotation Propagation

• Annotate data with provenance

• Propagate and manipulate annotations during query processing

Propagation - Query rewrite

• Rewrite query network to propagate annotations

• Use original DSMS operators

• Easy to implement, applicable to all DSMS∗

• Often results in complex and inefficient networks

• Only applicable to deterministic networks

Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 10: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

How to generate provenance?

Propagation - Operator instrumentation

• Modify DSMS operators to propagate provenance information

• More efficient than query rewrite

• Tolerates certain types of non-determinism

• Keeps structure of the original query network intact

• Requires modification of the engine (DSMS source code needed)

Slide 5 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 11: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Plan of attack

1) Provenance model

• Annotated tuples

• Annotated streams

2) Provenance generation and querying

• Operator instrumentation

• Buffer inputs + reconstruction for querying

• Implementation in Ariadne (extension of Borealis)

3) Optimizations

• Compression

• Laziness

• Decoupling provenance computation from query processing

Slide 6 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Motivation

Page 12: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

1 Motivation

2 Reduced Eager Operator Instrumentation

3 Optimizations

4 Experiments

5 Conclusions

Page 13: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Overview

Operator Instrumentation

• For each operator of the system create new versions• Provenance generator (PG): Generates provenance annotations• Provenance propagator (PP): Propagates provenance annotations

• Instrument query (or parts thereof) to generate provenance• Replace operators with annotating versions• Only in the part of the network we want to track provenance

Reduced Eager

• Eager• Propagate provenance annotations during query processing

• Reduced Eager• Propagate sets of tuple-IDs (TIDs) annotations• Temporarily store inputs of relevant streams for restoring full tuples in

provenance

Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 14: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Overview

Querying Provenance

• Translate provenance annotations into regular streaming data

• For one output tuple t annotated with provenance {t1, . . . , tn}• output n tuples (t, t1), (t, t2), . . .

• New operator (p-join) joins buffered input data with provenanceannotations

• Use operators of the DSMS for querying

Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 15: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Overview

S1 Sout

S2

Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 16: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Overview

S1 Sout

S2

InstrumentedNetwork

PG PP

PG

Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 17: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Overview

S1 Sout

S2 n P

InstrumentedNetwork

ReconstructProvenance

TemporaryInput Storage

PG

PG

PP

Slide 7 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 18: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Stream Provenance Model

Provenance Model• Provenance is modelled as a set of contributing tuples

• From input of intermediate streams

• Provenance set P(t, I )• tuple t in intermediate or result stream• set of upstream streams I

• Which tuples belong to provenance?• Declarative definition that models provenance for single operators• Transitivity

Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 19: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Stream Provenance Model

Example

• Query over stream of temperature readings• Filter out outliers (temperature above threshold)• Compute average temperature over every consecutive window of two

readings

• P(3:2 , {2}) = {2:2 , 2:3 }• 3:2 generated by aggregating values from 2:2 and 2:3

• P(3:2 , {1}) = {1:2 , 1:4 }• 2:2 and 2:3 derived from 1:2 and 1:4

� ↵

1 32

Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 20: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Stream Provenance Model

Provenance Annotated Streams

• Each tuple in a provenance annotated stream (PAS) is annotatedwith its provenance set

• according to a set of input streams I

• PAS P(O, I )• Provenance annotated stream for stream O according to set of

upstream streams I

Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 21: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Stream Provenance Model

Example

• Query over stream of temperature readings• Filter out outliers (temperature above threshold)• Compute average temperature over every consecutive window of two

readings

• Provenance annotated stream (PAS) P(3, {1})

� ↵

1 32

Slide 8 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 22: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Implementing Instrumented Operators

Approach

• Three instrumented versions for each operator

• Provenance generators: Initialize provenance annotations

• Provenance propagators: Propagate provenance annotations

• Provenance dropper: Drop provenance from input

Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 23: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Implementing Instrumented Operators

Provenance Generator• Computes one-step provenance for an operator’s output based on its

input

• Attach input TIDs as provenance

• For windowing operators merge TIDs from window

• Before instrumentation: I → op → O

• After instrumentation: I → opPG → P(O, I )

Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 24: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Implementing Instrumented Operators

Provenance Propagator

• Compute output provenance set by combining provenance sets fromthe input

• Before instrumentation: P(I , I ′)→ op → O

• After instrumentation: P(I , I ′)→ opPP → P(O, I ′)

Provenance Dropper

• Removes provenance annotations

• Instrumentation: P(I , I ′)→ opPD → O

Slide 9 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 25: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Instrumenting a Network

• User provides• query network q• stream O• set of streams I upstream from O

• Output is a network that computes P(O, I )

InstrumentNetwork(q,O,I)

for all op on paths between I and Oif op is connected to I

replace op with PG version

else

replace op with PP version

Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 26: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Instrumenting a Network

Example

• Network that computes P(3, {1})

� ↵

1 32

TID time tempature1:1 0 1021:2 5 1051:3 10 3991:4 15 85

TID avg temp3:1 103.5 {1:1, 1:2}3:2 95 {1:2, 1:4}

PPPG

Example

• Network that computes P(2, {1})

TID avg temp3:1 103.53:2 95

� ↵

1 32

TID time temperature1:1 0 1021:2 5 1051:3 10 3991:4 15 85

TID time temperature2:1 0 102 {1:1}2:2 5 105 {1:2}2:3 15 85 {1:4}

PG PD

Slide 10 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 27: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Ariadne Implementation

Based on Borealis• Set of standard DSMS operators

• Operators connected through queues

• Fixed-length tuples

Changes to Borealis code

• Implement annotations

• Implement operator instrumentation

• Implement p-join and input buffering

Slide 11 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 28: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Querying Provenance

Approach

1 Translate provenance annotations into regular stream data to queryusing the host language

Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 29: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Querying Provenance

Approach

1 Translate provenance annotations into regular stream data to queryusing the host language

P-join

• Joins a PAS P(O, I ) with buffered input tuples from I

• Combine output t with each tuple in its provenance P(t, I )

Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 30: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Querying Provenance

Example

• Tuple 3:2 with P(3:2, {1}) = {1:2, 1:4}

� ↵

1 32

TID time temperature1:1 0 1021:2 5 1051:3 10 3991:4 15 85

TID avg temp3:1 103.5 {1:1, 1:2}3:2 95 {1:2, 1:4}

TID time temperature2:1 0 102 {1:1}2:2 5 105 {1:2}2:3 15 85 {1:4}

PPPG

C1n

Slide 12 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Reduced Eager Operator Instrumentation

Page 31: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

1 Motivation

2 Reduced Eager Operator Instrumentation

3 Optimizations

4 Experiments

5 Conclusions

Page 32: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Rationale

Compression

• Reduce load on queues by using concise provenance representation

• Reduce memory imprint

Lazy Retrieval

• Avoid reconstructing unneeded provenance

Replay-Lazy

• Avoid generating unneeded provenance

• Decouple provenance computation from regular stream processing

Slide 13 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations

Page 33: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Replay-Lazy

Idea• Avoid provenance computation if provenance is not needed

• Propagate concise representation of a super-set of the provenance

• If provenance is requested replay super-set through provenancegenerating network

Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations

Page 34: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Replay-Lazy

Idea• Avoid provenance computation if provenance is not needed

• Propagate concise representation of a super-set of the provenance

• If provenance is requested replay super-set through provenancegenerating network

Covering Intervals

• Interval:• Minimal TID in provenance• Maximal TID in provenance

• Constant size super-set of provenance (piggy-back!)

• Provenance: {1, 2, 4, 5, 10, 16, 65}• Covering interval: [1, 65]

Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations

Page 35: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Replay-Lazy

Enabling Replay

• Covering interval operators:• CG corresponds to PG• CP corresponds to PP

• C-join operator ⊗• For each input get covering interval• Retrieve all tuples from covering interval from connection point

Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations

Page 36: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Replay-Lazy

Approach

1 Instrument original network to produce covering interval

2 Filter out results and c-join

3 Feed result into provenance generating copy of the network

�PG

↵PP

↵CP

�CG

S1 Sout

Provenance Generating

NetworkFilter Provenance and Fetch Tuples

from Input

Covering Interval

GeneratingNetwork

⇡PD

P

Slide 14 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Optimizations

Page 37: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

1 Motivation

2 Reduced Eager Operator Instrumentation

3 Optimizations

4 Experiments

5 Conclusions

Page 38: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Completion Time - Basic Network

• Send large batch of 100,000 tuples

• Measure completion time• Without Retrieval/With Retrieval• No Provenance/Instrumentation/Rewrite

• Using the Basic Network• Vary Aggregation Window Size• Slide fixed to 1• Increase window size → increase amount of provenance

Basic Query Network

ασ σ

Rewritten Version

σα

σ

Instrumented Version

αPP

σPG PP

σ �σ

Replay-Lazy Version

�PP

�CP

� �PG

↵PP

↵CP

�CG

⌦ n

Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments

Page 39: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Completion Time - Basic Network

0

5

10

15

20

25

30

35

40

45

No Provenance

Instrumentation

Replay-Lazy

Rew

rite

No Provenance

Instrumentation

Replay-Lazy

Rew

rite

No Provenance

Instrumentation

Replay-Lazy

Rew

rite

No Provenance

Instrumentation

Replay-Lazy

Rew

rite

No Provenance

Instrumentation

Replay-Lazy

Rew

rite

No Provenance

Instrumentation

Replay-Lazy

Rew

riteC

ompl

etio

n Ti

me

(sec

)

Window Size

No ProvenanceInstrumentation (Generation)Instrumentation (Retrieval)Replay-Lazy (Covering Interval)Replay-Lazy (Retrieval)Rewrite

1008060402010

Slide 15 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Experiments

Page 40: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

1 Motivation

2 Reduced Eager Operator Instrumentation

3 Optimizations

4 Experiments

5 Conclusions

Page 41: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Conclusions

Reduced-Eager Operator Instrumentation

• Novel propagation provenance method for DSMS

• Replace operators in a query network with provenance-aware(instrumented) versions

• Keeps original network structure intact

• Deals well with non-determinism

• Flexible:• Single- and multi-hop provenance• Instrument parts of a network

Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions

Page 42: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Conclusions

Optimizations

• Replay-Lazy• Propagate only covering intervals - replay through provenance

generating copy of network• Reduce cost for low retrieval rates• Option to decouple provenance computation

• Lazy-Retrieval• Push filters through p-joins• Avoids provenance reconstruction

• Compression• Interval• Delta• Dictionary

Slide 16 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions

Page 43: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Future Work

Optimizing Temporary Input Storage

• Exploiting additional knowledge in purging temporary input storage• Static query network analysis• Self-optimizing purging strategies• Compression and Approximations

Integration with Distributed Storage and Scaling-out

• Using distributed write-optimized storage for provenance?

Order-aware Provenance Model• Integrate order into the provenance model

• “Tuple t is after u in the result, because t’s provenance is before u’sprovenance in the input.”

Slide 17 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions

Page 44: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Questions?

Boris GlavicIIT

[email protected]

Kyumars Sheykh EsmailiNanyang University

[email protected]

Peter M. FischerUniversity of Freiburg

Web Sciencepeter.fischer@

cs.uni-freiburg.de

Nesime TatbulIntel Labs

Intel Science andTechnology Center

for Big [email protected]

http://cs.iit.edu/~dbgroup/research/ariadne.php

Slide 18 of 18 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne: Conclusions

Page 45: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

6 Implementation

7 Additional Experiments

8 Optimizations

Page 46: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

How to Implement Annotations?

Alternatives for sending annotations

• Borealis models streams as queues of fixed-length tuples with headerand payload

1 Variable-length tuples: Negates optimizations based on fixed-length

2 Split provenance sets into fix-length tuples

3 New information channels: Complex interactions with normal queryprocessing

Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation

Page 47: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

How to Implement Annotations?

Alternatives for sending annotations

• Borealis models streams as queues of fixed-length tuples with headerand payload

1 Variable-length tuples: Negates optimizations based on fixed-length

2 Split provenance sets into fix-length tuples• Send provenance set after each output tuple• First tuple has small header to tell downstream operators how many

provenance tuples will follow• Additional provenance tuples only store payload (TIDs)

3 New information channels: Complex interactions with normal queryprocessing

Slide 1 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation

Page 48: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Implementing Instrumented Operators

Approach

• Factor out common functionality• Serializing/Deserializing provenance to/from queues• Caching provenance of windows• Merging provenance sets

• Implemented in Provenance Wrapper• Operators access queues through the wrapper• Reduces code changes to each operator

• LOC• Provenance wrapper: ˜8000 LOC• Instrumented operators: aggregation (largest) 200 LOC

Slide 2 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation

Page 49: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Temporary Input Storage

Connection Points• Borealis feature for storing tuples from a stream

• Time-based or count based eviction strategies

• Content can be joined with streams

S1 Sout

S2 n P

InstrumentedNetwork

ReconstructProvenance

TemporaryInput Storage

PG

PG

PP

Slide 3 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Implementation

Page 50: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

6 Implementation

7 Additional Experiments

8 Optimizations

Page 51: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Varying Window Size

• Send large batch of 100,000 tuples

• Measure completion time• No Retrieval• No Provenance/Single/Optimized/Covering Interval

• Using the Basic Network• Vary Window Size• Slide fixed to 1

Basic Query Network

ασ σ

Rewritten Version

σα

σ

Instrumented Version

αPP

σPG PP

σ �σ

Replay-Lazy Version

�PP

�CP

� �PG

↵PP

↵CP

�CG

⌦ n

Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 52: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Varying Window Size

0

10

20

30

40

50

60

70

80

90

100

50 100 200 500 1000 2000

Co

mp

letio

n T

ime

(se

c)

Window Size

No Provenance

Single

Optimized

Covering Interval

Slide 4 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 53: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Vary Retrieval Frequency

• Send large batch of 100,000 tuples

• Measure completion time• Instrumentation with Retrieval• Replay-Lazy with Retrieval

• Using a sequence of aggregations• Vary Retrieval Frequency - selectivity of filter before p-join• Slide fixed to 1, Window size fixed to 100

Instrumented Version

↵PP

�PG PP

� n�

Replay Lazy

�PP

�CP

� �PG

↵PP

↵CP

�CG

⌦ n

Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 54: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Vary Retrieval Frequency

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

0.05 0.10 0.25 0.50 0.75 1 2 5 10 25 50 100

Com

ple

tion T

ime /

Com

ple

tion T

ime w

ithout P

rove

nance

Retrieval Frequency (%)

Instrumentation with Retrieval (Optimized)

Replay-Lazy with Retrieval (Optimized)

Slide 5 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 55: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Latency

• In Ariadne inputs are send in batches with a fixed number of tuples(parameter)

• Measure latency• Using the Basic Network

• Vary the load• Vary batch size and fix the delay between consecutive batches

Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 56: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Latency

0

1

2

3

4

5

6

7

10 25 50 75 100

Late

ncy (

ms)

Batch Size

No Provenance

Single

Optimized

Covering Intervals

Slide 6 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 57: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Nested Aggregations

• Sequence of aggregation operators

• Measure Completion time

• Vary Number of aggregation operators

Method Number of Aggregations1 2 3 4

No Provenance 3.1 3.9 4.8 5.7

Instr.Generation 3.9 7.4 14.7 48.6Retrieval 3.0 12.9 103.0 2047.0

Replay-LazyCov. Inter. 3.1 4.4 5.2 6.3Retrieval 5.2 14.7 91.1 2224.0

Rewrite 7.2 625.0 crash crash

Slide 7 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Additional Experiments

Page 58: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Outline

6 Implementation

7 Additional Experiments

8 Optimizations

Page 59: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Compression

Rationale• Reduce amount of data that is shipped between operators

• Requirements• Fast compression and decompression• Operations such as merging sets on compressed data

Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 60: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Compression

Rationale• Reduce amount of data that is shipped between operators

• Requirements• Fast compression and decompression• Operations such as merging sets on compressed data

Interval Compression

• Input: {1, 2, 4, 5, 6, 7, 9}• Output: {[1− 2], [4− 6], [9− 9]}

Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 61: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Compression

Rationale• Reduce amount of data that is shipped between operators

• Requirements• Fast compression and decompression• Operations such as merging sets on compressed data

Interval Compression

• Input: {1, 2, 4, 5, 6, 7, 9}• Output: {[1− 2], [4− 6], [9− 9]}

Dictionary Compression

• Input: {1, 2, 4, 5}• Output: LZ77({1, 2, 4, 5})

Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 62: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Compression

Delta Compression

• Express provenance as delta to down-stream provenance• Once in a while send full provenance• Express provenance as delta to previous full provenance

• Input: {1, 2, 4, 7, 9} ← {4, 7, 9, 10} ← {7, 9, 10, 12, 15, 19}• Output: {1, 2, 4, 7, 9} ← 3− {10} ← 2− {10, 12, 15, 19}

Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 63: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Compression

Delta Compression

• Express provenance as delta to down-stream provenance• Once in a while send full provenance• Express provenance as delta to previous full provenance

• Input: {1, 2, 4, 7, 9} ← {4, 7, 9, 10} ← {7, 9, 10, 12, 15, 19}• Output: {1, 2, 4, 7, 9} ← 3− {10} ← 2− {10, 12, 15, 19}

Heuristic Adaptive Compression

• No one fits all

• Combine compression methods

• Heuristic rules for when to apply which method

Slide 8 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 64: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Retrieval-Lazy

Approach

• User query that filters out unneeded provenance

• Avoid expensive provenance reconstruction if provenance is notneeded

• If possible filter before reconstruction (p-join)• Push filters through p-joins

↵PP

�PG PP

� n �

Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations

Page 65: DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"

Retrieval-Lazy

Approach

• User query that filters out unneeded provenance

• Avoid expensive provenance reconstruction if provenance is notneeded

• If possible filter before reconstruction (p-join)• Push filters through p-joins

↵PP

�PG PP

� n�

Slide 9 of 9 Glavic, Sheykh Esmaili, Fischer, Tatbul - Ariadne - Appendix: Optimizations