Top Banner
Lineage Tracing in Data Lineage Tracing in Data Warehouses Warehouses Yingwei Cui Stanford University Database Group
85

Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

Lineage Tracing in DataLineage Tracing in Data WarehousesWarehouses

Yingwei Cui

Stanford University Database Group

Page 2: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

2

Motivation: Data WarehousingMotivation: Data Warehousing

Data Warehouse

Source 1 Source 2 Source 3

Lucrative Fields

Databases $8800K Theory $320K

Networks $800K

StudentsEnrollmentsCourses

Wow?!

Databases $8800K

Page 3: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

3

Courses Enrollments Students

Oh, I see...

Source 1 Source 2 Source 3

Lineage Tracer

Data Warehouse

Lucrative Fields

Database 1800 Theory $320K

Networks $800K Databases $8800K

CS145 Ted CS154 Joe

CS244 BobCS145 Ann CS245 Jane

……

Bob MS $1K Jane Web $5K

Ann BS $1K

Joe BS $1KTed Web $5K … … …

CS145 Databases CS154 Theory

CS244 Networks CS245 Databases

Page 4: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

4

The Data Lineage ProblemThe Data Lineage Problem

Data warehouses integrate data from multiple sources for analysis and mining

Data lineageData lineage: given data item o in the warehouse, which data items in the sources were used to derive o?

Sometimes called “drill-through” in industry

Page 5: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

5

ChallengesChallenges

Warehouse of relational views over relational sources– What is a good formal definition for lineage?– How do we trace data lineage for arbitrary views?– How do we make it efficient?

Warehouse defined by graph of data transformations– No fixed, well-defined relational operators– Large transformation sequences and graphs

Page 6: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

6

ContributionsContributions Thesis contributions

– Basics of lineage tracing for relational views [TODS’00]

– Lineage tracing system prototype [ICDE’00 demo]

– Performance study and optimizations [ICDE’00, DMDW’00]

– Lineage tracing for general data transformations [VLDB’01]

– View update for deletions using data lineage [TechReport’01]

Other contributions (joint with others)– Data warehousing performance issue [VLDB’00]

– Data management for wireless networks [Infocom’98, Globecom’97]

Page 7: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

7

Outline of TalkOutline of Talk

Part 1: Lineage tracing for relational views

Part 2: Lineage tracing for general data transformations

Part 3: View update for deletions using data lineage (time permitting)

Page 8: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

8

Part 1: Part 1: Lineage Tracing for Relational ViewsLineage Tracing for Relational Views

Declarative definition of data lineage

Lineage tracing algorithms

Using auxiliary views for efficient lineage tracing

Experimental results (small sample)

Page 9: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

9

Views We ConsiderViews We Consider

Relational algebra

Arbitrary use of aggregation

Set semantics

Also in thesis– Set operators – Bag semantics

R S T

V

Page 10: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

10

V

V = ( (R S)) Y,sum(Z) X >Z

R

S

X Y Z3 2a

bb

88

06

Y sum

a 2b 6

X Y Z3 2a8 08 98 6

bbb

X Y3 a

Y Z

2a0b9b6b

8 b

Y,sum(Z)X >Z

T U

b 6b8 0b8 6

8 0

8 6

b

b0b

6b

8 b

Simple Lineage ExampleSimple Lineage Example

Page 11: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

11

Lineage for Relational OperatorsLineage for Relational Operators

Unary relational operators

op

R

R* t

Lineage of t according to op is the maximal subset R* R such that

(1) op(R*) = {t}(2) t* R*: op({t*})

Page 12: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

12

Example 1

R

X Y Z3 2a

bb

88

06

X Y Z3 2a8 08 98 6

bbb

X >Z

Lineage of t according to op is the maximal subset R* R such that

(1) (1) opop((RR*) = {*) = {tt}}(2) (2) tt* * RR*: *: opop({({tt*}) *})

Lineage for Relational OperatorsLineage for Relational Operators

b8 68 6b

Page 13: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

13

Example 2

R

X Y Z3 2a

bb

88

06

Y sum

a 2b 6

Y,sum(Z)

Lineage of t according to op is the maximalmaximal subset R* R such that

(1) op(R*) = {t}(2) t* R*: op({t*})

Lineage for Relational OperatorsLineage for Relational Operators

b 6b8 0b8 6

Page 14: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

14

N-ary relational operators (e.g., )

Lineage for Relational OperatorsLineage for Relational Operators

Lineage of t according to op is the maximalmaximal subsets Ri* Ri for i = 1..n such that

(1) op(R1*, …, Rn*) = {t}(2) ti* Ri*: op(R1, …, {ti*}, …, Rn)

op

R1*

*R2

R2

R1

Page 15: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

15

Lineage for Relational ViewsLineage for Relational Views

Lineage of a tuple set is union of lineage of each tuple in the set

Lineage for views is defined recursively

opop1 2

VU

R1

R2

t

U*

*

*

R1

R2

Lineage of t is R1*, R2*

Page 16: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

16

Lineage TracingLineage Tracing

Convert view into aa segmented normal form segmented normal form

E1 … En Each segment

Generate one tracing query tracing query for each segment

Apply tracing queries recursively

– # non-top + 1

Lineage result is unaffected by normalization and Lineage result is unaffected by normalization and segment-level tracingsegment-level tracing

Page 17: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

17

Tracing Query for One SegmentTracing Query for One Segment

V Y sum

a 2b 6

R

S

TQ = Split ( (R S))X >Z Y=b R,S

Y,sum(Z)

X >Z

b

6

b

X Y3 a8

Y Z

2a09b

b

R*={(8,b)}, S*={(b,0),(b,6)}

b 0

6b

b8

b 6

V = ( (R S)) X >ZY,sum(Z)

Page 18: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

18

Recursive Tracing ProcedureRecursive Tracing Procedure

V W avg

p 4q 6

U

R

S

X Y3 a

Y Z

2a0b9b6b

8 b

T

Y sum

a 2b 6

Y Wa p

pq

bb

TQ = Split ( (U T))W=q1 U,T TQ = Split ( (R S))X >Z Y=b2 R,S

b 6

qb

8 b

0b

6b

q 6

R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)}

8 b

0b

6bqb

V = (( (R S)) T)) W, avg(sum) Y,sum(Z) X >Z

Page 19: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

19

Making It EfficientMaking It Efficient

Source accesses are usually expensive or impossible

Need some intermediate results for lineage tracing

Store auxiliary viewsauxiliary views at the warehouse– Reduce or eliminate source accesses– Reduce recomputation of intermediate results

Page 20: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

20

Auxiliary ViewsAuxiliary Views

There are many possible auxiliary views

For single-segment views– Identified 10 possible auxiliary view schemes– Studied performance tradeoffs

For arbitrary views– Hard optimization problem– Exhaustive and heuristic algorithms– Performance study

R1 … Rn

Page 21: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

21

+ Always improve lineage tracing

– Must be maintained when sources change

+ Can also help with maintenance of original user views

Auxiliary Views: Performance TradeoffsAuxiliary Views: Performance Tradeoffs

Page 22: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

22

Auxiliary View Schemes for Auxiliary View Schemes for Single-Segment ViewsSingle-Segment Views

Parameters:- 3-way SPJ view- sources: 10MB each- disk: 1Mbps- network: 50kbps- 1000 operations- q/u ratio = 4

Measurements:- tracing time- maintenance time

Page 23: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

23

Auxiliary View Selection Auxiliary View Selection Algorithms for Arbitrary ViewsAlgorithms for Arbitrary Views

Page 24: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

24

Part 2: Part 2: Transformation GraphsTransformation Graphs

Lineage definition

Tracing algorithms

Combining transformations for lineage tracing

Experimental results (tiny sample) Source 1

Data Warehouse

Source 2 Source 3

T6

T4 T5

T3

T2

T1

Page 25: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

25

T1

T3 T4 T6 T7T5

id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)

id name price valid1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 500 2/1/98-7/1/98 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-

name avg3 Q4 palm 2K 6Kpalmpalm 2K 6K 2K 6K

3 palm 400 7/2/98-9/1/993 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-3 palm 300 9/2/99-

2 C 4/5/99 2(5),3(10)2 C 4/5/99 2(5),3(10)

4 B 8/6/994 B 8/6/99 1(10),3(5)1(10),3(5)5 D 10/8/99 1(5),3(10)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)6 B 12/1/99 2(10),3(10)

SalesJump

Order

Product T2

Transformation Example Transformation Example

selection

“join”split pivot projectionselectionprojection

Page 26: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

26

Lineage for General TransformationsLineage for General Transformations

A transformationtransformation can be an arbitrary program

T

select … from … where … main(int argc, char** argv) {…} sed “s/string1/string2/g” …

??

– One extreme: relational operators– Another extreme: we know nothing about T– Middle ground: based on transformation properties

Page 27: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

27

Transformation PropertiesTransformation Properties

Transformation classes

Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure

Page 28: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

28

i II: T(I) = T({i})

dispatcher

T*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

Page 29: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

29

Dispatcher ExampleDispatcher Example

id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)

Orderid cust date pid quant1 A 2/8/99 1 101 A 2/8/99 2 10 : : : 5 D 10/8/99 1 55 D 10/8/99 3 10 6 B 12/1/99 2 106 B 12/1/99 3 10

T1

O1

5 D 10/8/99 1(5),3(10)

5 D 10/8/99 1 55 D 10/8/99 3 10 5 D 10/8/99 3 10

5 D 10/8/99 1(5),3(10)

Page 30: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

30

i II: T(I) = T({i})

dispatcher

I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}

aggregator

T*(ok) = IkT*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

Page 31: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

31

Aggregator ExampleAggregator Example

T4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K

O3

O4

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5

3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

2 palm 4/5/99 400 10 2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

palm 0K 4K 2K 6K 5 palm 10/8/99 300 10

palm 0K 4K 2K 6K

2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

5 palm 10/8/99 300 10

Page 32: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

32

i II: T(I) = T({i})

dispatcher

I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}

aggregator black-box

All others

T*(ok) = Ik T*(o) = IT*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

Page 33: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

33

Most transformations are dispatchers, aggregators, or their compositions

A transformation can be both dispatcher and aggregator– Lineage definitions are equivalent

Transformations can be relational operators– Lineage definitions same as relational definitions

Transformation ClassesTransformation Classes

Page 34: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

34

Transformation PropertiesTransformation Properties

Transformation classes

Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure

Page 35: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

35

Transformation SubclassesTransformation Subclasses

Permit more efficient lineage tracing

Filter is a special dispatcher– Each input data item produces itself or nothing

Context-free aggregator– Whether two input data items are in the same partition

is independent of other items

Key-preserving aggregator– Any subset of an input partition always produces the

same output key

Page 36: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

36

Tracing Example: AggregatorsTracing Example: Aggregators Consider T(I) = {o1…on}

Tracing the lineage of o for aggregator– Partition input I into I1…In such that T(Ik) = {ok}– Return Ik such that T(Ik) = {o}

Tracing the lineage of o for context-free aggregator– Partition input I into I1…In such that |T(Ik)| = 1– Return Ik such that T(Ik) = {o}

Page 37: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

37

Schema InformationSchema Information

Input schema A=(A1…An) and key Akey

Output schema B=(B1…Bn) and key Bkey

Schema mappings: f(A) B and A g(B)

Transformations with special schema mappings– Forward key-map: f(A) Bkey – Backward key-map: Akey g(B) – Backward total-map: A g(B)

Page 38: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

38

Tracing Example: Forward Key-MapsTracing Example: Forward Key-Maps

T4

O3 O4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K palm 0K 4K 2K 6K

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5

3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

2 palm 4/5/99 400 10 2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

5 palm 10/8/99 300 10

Page 39: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

39

Other PropertiesOther Properties

Provided Tracing Procedure

Provided Transformation Inverse T –1

– If T is an aggregator, then o’s lineage is T –1({o}) – Not always true for dispatchers or black-boxes

Page 40: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

40

Tracing ProceduresTracing Procedures

Property Procedure # T Calls # Accesses

dispatcher TraceDS O(|I|) O(|I|)

aggregator TraceAG O(2|I|) O(2|I|)

black-box return I; 0 O(|I|)

filter return o; 0 0

context-free aggr. TraceCF O(|I|2) O(|I|2)

key-preserving aggr. TraceKP O(|I|) O(|I|)

forward key-map TraceFM 0 O(|I|)

backward key-map TraceBM 0 O(|I|)

backward total-map TraceTM 0 0

Provided tracing-proc. provided ? ?

Page 41: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

41

Property HierarchyProperty HierarchyANY

provided tracing-proc.

or inverse

black-boxaggregator

dispatchercontext-free aggr.

key-preserving aggr.

filter

forward key-mapbackward key-map

total-map

Page 42: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

42

Summary of Our Approach for Summary of Our Approach for One TransformationOne Transformation

Properties are provided with transformations– Specified by the transformation author – Declared in prepackaged transformations– Derived using recent techniques [Clio01, RB01]

The best property of a transformation is selected based on the hierarchy

The tracing procedure using the best property is called at tracing time

Indexing techniques

Page 43: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

43

Transformation SequencesTransformation Sequences

Naive algorithm traces backwards one transformation at a time– Need all intermediate results– Poor performance for long sequences

T1 T2 T3 TnI O

Page 44: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

44

T1 T2 T3 TnI O

T’ TnI O

Combine transformations and trace as one– Reduces number of intermediate results– By combining judiciously

Reduces tracing cost Doesn’t lose accuracy

Transformation SequencesTransformation Sequences

Page 45: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

45

Overall ApproachOverall Approach

Algorithm for deriving properties of T = T1 T2 from properties of T1 and T2

Coarse-grained cost metric for a tracing sequence based on transformation properties

Greedy algorithm

Page 46: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

46

Example of Greedy AlgorithmExample of Greedy Algorithm

T4 T6 T7 T5

fkmap(2) btmap(1) filter(1) bkmap(2)

blkbox(5)

blkbox(5) bkmap(2)

bkmap(2)fkmap(2) btmap(1)

fkmap(2)T4’ T6 T7

bkmap(2)filter(1)

bkmap(2)T6’

fkmap(2)T4’

Page 47: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

47

Multiple-Input ExampleMultiple-Input Example

T3

id cust date pid quant1 A 2/8/99 1 101 A 2/8/99 2 10 : : : 5 D 10/8/99 1 55 D 10/8/99 3 10 6 B 12/1/99 2 106 B 12/1/99 3 10

id name price valid1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 : : : 5 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

5 palm 10/8/99 300 10

5 D 10/8/99 3 10

3 palm 300 9/2/99-

dispatcher

dispatcher

O3

O1

O2

Page 48: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

48

Transformation GraphsTransformation Graphs

I1

I2O

Definition time – Specify properties of each transformation in graph

Page 49: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

49

Transformation GraphsTransformation Graphs

Definition time – Specify properties of each transformation in graph– Consider each path as a transformation sequence– Combine transformations in each sequence

I1

I2O

Page 50: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

50

Transformation GraphsTransformation Graphs

Load time – Save intermediate results and build indices as desired

Tracing time – Trace lineage through each sequence – Combine results

Definition time

I1

I2O

Page 51: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

51

Example RevisitedExample Revisited

T1

T3 T4 T6

ProductSalesJumpT7T5

Order

T2

bkmapbkmap

dispatcher fkmap filterbtmap

filter

dispatcher

T1

T3 T4 T6

ProductSalesJumpT7T5

Order

T2

bkmapfkmapbkmap

dispatcher

Page 52: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

52

Experimental ResultsExperimental Results

Transformation graph based on a complex TPC-D query (Q12)

Page 53: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

53

Part 3: Part 3: View Update Using Data LineageView Update Using Data Lineage

View update: translating updates on views to updates on base tables

Obvious connection to lineage in case of view deletions

Fresh approach with improved results

Page 54: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

54

View Update Translations: View Update Translations: Valid and Exact Valid and Exact

V

t

R1 R2 Rn

……

Page 55: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

55

V

t

R1 R2 Rn

……

View Update Translations: View Update Translations: Valid and Exact Valid and Exact

Page 56: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

56

V

t

R1 R2 Rn

……

View Update Translations: View Update Translations: Valid and Exact Valid and Exact

Page 57: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

57

Our AlgorithmOur Algorithm

Uses lineage to:– Find an exact translation whenever one exists

(in linear time for many cases)– Find a “good” translation when no exact translation exists

Fully automatic

Previous approaches– Don’t always find an exact translation– Often require user input– Consider restricted classes of views

Page 58: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

58

Related WorkRelated Work

Schema-level lineage tracing (annotation-based)

[BB99, HQGW93, RS98]

Drill-down or drill-through on data cubes [Gray95]

“Weak inverse” for transformations [WS97]

Warehouse load resumption [LGMW00]

Data cleaning [GFSS+01]

View update [DB82, Mas84, Kel85]

Page 59: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

59

ConclusionsConclusions

Data lineage problem in two scenarios– Warehouse defined by relational views– Warehouse defined by general data transformations

For both scenarios, we provide:– Formal lineage definition– Lineage tracing algorithms– Optimization techniques– System prototype and performance study

Use lineage for the view update problem

Page 60: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

60

Some Open ProblemsSome Open Problems

Lineage of “missing” view or base tuples

Deriving transformation properties

Combining with annotation-based approach

View update– Translation ambiguity– Base table constraints– Multiple interacting views

Page 61: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

61

Page 62: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

62

Lineage ApplicationsLineage Applications

On-line analytical processing (OLAP)

Scientific databases

Sensory and monitoring systems

Data cleaning

Warehouse resumption

Data security

View update

Page 63: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

63

Convert view definition into aa segmented normal form segmented normal form

Generate one tracing querytracing query for each ASPJ segment Apply tracing queries top-down through view definition Lineage result is unaffected by normalizationLineage result is unaffected by normalization

R S T

V

W

R S T

V

W

Lineage TracingLineage Tracing

Page 64: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

64

V

K1 X1 a

K2 X Z2b4a1b8d

b2

R

S1234

3 c

Y

9b5

X avg

a 4b 6

pqr

V = ( (R S)) X,avg(Z) K1<K2

TQ = Split ( (R S))K1<K2 X=b R,S

3b

b2

3

9b5

q

b 6

Tracing ExampleTracing Example

Page 65: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

65

Split Lineage Tables (SLT)Split Lineage Tables (SLT)

V

K1 X1 a

K2 X Z2b4a3b8d

b2

R

S1234

3 c

Y

9b5

X avg

a 4b 6

pqr

K1 X1 a

b2

K2 X Z4a2

Y

1b39b5

R'

S'

Split

pqb2 q

3b39b5

b 6

Page 66: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

66

Base Table Projections (BP)Base Table Projections (BP)

V X avg

a 4b 6

R

S K2 X Z2b4a1b8d

1234

8b5

K1 X1 a

b23 c

Ypqr

3b

b2

3

9b5

q

b 6

K1 X1 a

b23 c

K2 X babd

1234

b5

R’

S’

b2

b3

b5

Page 67: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

67

Context-Free Aggregator Context-Free Aggregator ExampleExample

T4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K

O3

O4

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10

3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

2 vaio 4/5/99 2400 52 palm 4/5/99 400 10

1 imac 2/8/99 1200 10

3 imac 6/1/99 1200 20

1 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5

3 vaio 6/1/99 2400 10

2 palm 4/5/99 400 10

4 imac 8/6/99 1200 10

5 imac 10/8/99 1200 5

6 vaio 12/1/99 1800 10

4 palm 8/6/99 400 5

5 palm 10/8/99 300 10

6 palm 12/1/99 300 10

palm 0K 4K 2K 6K

2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

5 palm 10/8/99 300 10

6 palm 12/1/99 300 10

Page 68: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

68

Tracing Example 1Tracing Example 1

Tracing procedure for context-free aggregators– Partition input I into I1…In such that |T(Ik)| = 1;– Return Ik s.t. T(Ik) = {o};

Page 69: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

69

Lineage EquivalenceLineage Equivalence

Lineage of equivalent SPJ views are equivalent

Not for ASPJ views

R

UX Y Z3 2a

bb

88

06

Y sum

a 2b 6

Y,sum(Z)b 6b8 0

b8 6

Lineage of equivalent SPJ views are equivalent

Not for ASPJ views

Page 70: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

70

Lineage EquivalenceLineage Equivalence

Lineage of equivalent SPJ views are equivalent

Not for ASPJ views

R

UX Y Z3 2a

bb

88

06

Y sum

a 2b 6

B=0 Y,sum(Z)b 6

b8 6

Page 71: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

71

Non-Context-Free Example Non-Context-Free Example

Page 72: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

72

Non-Context-Free ExampleNon-Context-Free Example

Page 73: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

73

Indices Help!Indices Help!

Conventional index – On input key Akey for a backward key-map with

Akeyg(B)

Functional index– On f(A) for a forward key-map with f(A)Bkey – On T(A) for a dispatcher

Lineage index – Mapping the key of each output data item o to

the keys of input data items in o’s lineage

Page 74: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

74

Experimental ResultsExperimental Results

Tracing through an “SP” transformation over TPC-D table PartSupp

Page 75: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

75

Tracing Through SequencesTracing Through Sequences

Tracing cost estimation– Divide properties into 5 groups– T’s cost level depends on the group of its best property – Associate a sequence with N[1..5] where N[k] records

the number of transformations with cost level k

Greedy algorithm– Pick a combination that results in the lowest N

Page 76: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

76

Lineage Annotation (Appendix)Lineage Annotation (Appendix)

1

2

3

{1}{1,2}

{1,2}

{2,4}{4}

{4}4

{1,2}

{1,2,4}

{4}

T1 T2

T1* T2*

Page 77: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

77

Multiple Inputs and OutputsMultiple Inputs and Outputs

Define properties for each input and output

Trace lineage for each input/output pair using single-input single-output tracing procedures

T

I1

I2

Im

...

O1

O2

On

Page 78: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

78

View UpdateView Update

Deletions on SPJ view deletions on base database

View tuple deletion request –t and base tuple deletion D

D is a translation for –t if {t} V = V(D) – V(D – D)

Side-effect E = V – {t}; D is exact if E =

D

V’UV

D’UD?

V

Page 79: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

79

Relationships to Data LineageRelationships to Data Lineage

t

R1 R2 Rn

A

C

ti belongs to t’s exclusive lineage Ri** iff

{t} = ( (R1 …{ti}… Rn))

Intuition: ti contributes only to t

A C

ti Ri belongs to t’s lineage Ri* iff

{t} ( (R1 …{ti}… Rn))A C

For an SPJ view:

Page 80: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

80

The ProblemThe Problem

D

V’

D’?

V

View update

View update for deletions

t

R1 R2 Rn

A

C

Page 81: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

81

Relationships to Data LineageRelationships to Data Lineage

Deleting a lineage branch Ri*of t is always a translation for –t

t

R1 R2 Rn

A

C

Page 82: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

82

Deleting a lineage branch Ri*of t is always a translation for –t

t

R1 R2 Rn

A

C

Deleting any subset of t’s exclusive lineage D** never causes side-effect

Relationships to Data LineageRelationships to Data Lineage

Page 83: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

83

Deleting a lineage branch Ri*of t is always a translation for –t

t

R1 R2 Rn

A

C

If –t has an exact translation D, it must also has an exact translation within t’s lineage

Deleting any subset of t’s exclusive lineage D** never causes side-effect

Relationships to Data LineageRelationships to Data Lineage

Page 84: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

84

Translating View Tuple DeletionsTranslating View Tuple Deletions

DELETE(t, V, D)

compute lineage D* and exclusive lineage D**; IF D** is a translation THEN RETURN; IF i s.t. Ri* causes no side-effect THEN RETURN; FOR each subset D of D* DO

IF D is not a translation THEN prune all subsets of D; ELSE IF D causes a side-effect THEN prune all supersets of D; ELSE RETURN;

Page 85: Lineage Tracing in Data Warehouses Yingwei Cui Stanford University Database Group.

85

Detailed ComputationsDetailed Computations

Is D a translation for –t?if t ( ((R1*–R1) … (Rn*–Rn)))then D is a translation

Does D cause side-effect?

E ( (R1 …Ri… Rn))) – {t}

if E ( ((R1–R1) … (Rn–Rn)))then D is exact

Further pruning by sizes

A C

A Ci=1..n

A C