Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation by Aaron St.Clair
Jan 17, 2016
Lineage Tracing for General Data Warehouse Transformations
Yingwei Cui and Jennifer Widom
Computer Science Department, Stanford University
Presentation by Aaron St.Clair
Outline What is lineage tracing? Why is tracing lineage data important? How can we find lineage data? Performance results
Data Warehouses Integrate data from multiple sources Data undergoes series of transformations Transformations vary in complexity
Data Source
1
Data Source
2
Data Source
N…
Transformation
Summarized Data
Lineage Tracing Identifying the specific data items in the
sources that derive a given data item in the warehouse
Allows In-depth data analysis Data mining Authorization management View update Efficient warehouse recovery
An Example
Selects items whose last quarter sales are more than twice the average of the last three quarter’s sales
An Example
Lineage Granularity Coarse-Grained
Schema-level, attribute mapping Fine-Grained
Set of source data items
Existing Work Mostly coarse-grained lineage Existing methods for fine-grained lineage
Extra annotation Developer-defined weak inverses Statistical estimation Can’t handle complex procedural transformations
Tracing Lineage - Definitions Data set – set of data items without duplicates Transformation – any procedure that takes
data sets as input and produces data sets as output Stable (no spurious output) Deterministic (under some conditions)
Lineage of a data item – set of input data items that contribute to that item
Determining Contributions
• Need to find relevant data items– Easy for simple relational
operators– Difficult for procedural
transformations• Select positives vs. Aggregation and
sum
Lineage Tracing
• Use of hierarchical model– Transformation classes– Schema mappings– Defined inverses
Transformation Classes
Transformation class defines procedure lineage determination For a dispatcher:
Iteratively apply transformation to inputs
If T(I) is in output set add I to lineage of the output set
Schema Mappings Defined schema for input and output of a
transformation• Backward key-maps
– Akey g(B)– T1
Forward key-maps f(A) Bkey
T4 Backward total-maps A g(B) T5
Provided Inverses/Tracing Procedures Best case; someone has defined a function
mapping output items to their deriving lineage items
Know nothing about efficiency of function
Property Hierarchy
Finding Lineage
• Recursively apply algorithms based on the transformation type until we reach top level
Optimizations Indexing input data set improves
performance Functional index using the schema optimizes
queries of the form F(i) = v Store auxiliary or intermediate views in the
warehouse Reduce number by composing
transformations
Transformation Graphs Create a tracing sequence for each path from
input to output in the graph Combine the results of each sequence
Performance• 1GB warehouse• Schema mapping better than
transformation class-specific algorithms• Indexing helps• Combining attributes reduces trace time
Questions?