Utility of Considering Multiple Alternative Rectifications in Data Cleaning PREET INDER SINGH RIHAN MASTER’S THESIS 1 Committee Members Dr. Subbarao Kambhampati (Chair) Dr. Huan Liu Dr. Hasan Davulcu
Feb 24, 2016
1
Utility of Considering Multiple Alternative Rectifications in Data CleaningPREET INDER SINGH RIHAN MASTER ’S THESIS
Committee MembersDr. Subbarao Kambhampati
(Chair)Dr. Huan Liu
Dr. Hasan Davulcu
2
Importance of Data Cleaning
Data is one of the most useful resources◦ Crucial to numerous important decision making
and analysis processes
High volume, variety and velocity of data make it difficult to obtain data in the cleanest form
3
Sources and Types of Noise
A few reasons for the noise present in data◦ Imperfect sensing devices/ information extractor◦ Heterogeneity in data from multiple sources◦ Errors in data entry, misspelling etc.
Data that suffers from quality issues is called Dirty data
Make Model Cartype Condition Drivetrain
TSX Acura Used FWD
Honda Corolla Sedan New FWD
Honda Civic Sdna Used FWD
Example of dirty data
4
Current Techniques
◦ De-Duplication
◦ Inconsistencies
◦ Schema Noise
◦ Outlier detection ◦ Conditional
functional dependencies
◦ BayesWipe
Types of Problems Industry Solutions Academic Approaches
5
Common themes in Current Data Cleaning Techniques
Considers multiple rectifications Picks most likely rectification deterministically, using:
◦ fixed rules◦ domain experts
✔
6
2 Honda Civic Sedan New FWD 0.6
ExampleTID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Corolla Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Honda Civic Sedan Used FWD
5 Toyota Corolla Sedan New FWD
Dirty Database
2 Honda Corolla Sedan New FWD
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Civic Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Honda Civic Sedan Used FWD
5 Toyota Corolla Sedan New FWD
Deterministic clean output
2 Toyota Corolla Sedan New FWD 0.4
2 Toyota Corolla Sedan New FWDTrue Tuple
Dirty Tuple
Rectification 1
Rectification 2
7
Data Cleaning Approaches: Problems
Hard to get the perfect fixed rules/domain knowledge
Partially correct rules/knowledge may ignore true rectification◦ Results in information loss◦ Irrecoverable when original data is decoupled
from cleaned outcome
8
✔✔✔✔
✔
Alternative Approach: Considering multiple alternative candidates after data cleaning Keep multiple alternative rectifications in a probabilistic database
Advantages:◦ Prevent information loss◦ Generates query results with more
recall
9
Alternative Approach : Potential Challenges
Keeping multiple alternative rectifications of a dirty tuple poses some challenges:◦ Query results with many irrelevant results -- low precision◦ Query processing over probabilistic data◦ Size of probabilistic data
10
Problem Statement
To investigate the trade-offs of considering multiple alternative rectifications of a dirty data instance against having a deterministically selected unique clean rectification of a dirty data instance
Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results
12
Background SystemFor investigation, BayesWipe[1] is used◦ End to end probabilistic data cleaning system◦ Cleans structured data◦ Handles data quality issues due to
◦ Inconsistency◦ Incompleteness◦ Substitutions
[1] Y. Hu, S. De, Y. Chen, and S. Kambhampati. Bayesian data cleaning for web data. arXiv preprint arXiv:1204.3677, 2012
13
BayesWipe For every tuple T in dirty database:
◦ Set of rectifications (T*)s is generated◦ Every T* has a probability value P(T*|T)
◦ P(T*|T) is the system’s confidence in claiming T* to be the true tuple
14
BayesWipe-DET
BayesWipe’s Clean Outcomes
BayesWipe is used to produced outcomes in two modes◦ BayesWipe-DET: - Only most likely rectification◦ BayesWipe-PDB: - All rectification with associated
probability
T
T*1
T*i
T*n
T*2.
.BayesWipeDirty Data BayesWipe-PDB
15
TID Make Model Cartype Condition Drivetrain Probability
1Honda Civic Sedan Used FWD 0.9
Honda Civik Sedan New FWD 0.1
2
Toyota Corolla Sedan New FWD 0.2
Toyota Corolla Sedan Used FWD 0.2
Honda Civic Sedan Used FWD 0.6
3Honda Civik Sedan New FWD 0.05
Honda Civic Sedan Used FWD 0.95
4Toyota Corolla Sedan New FWD 0.9
Honda Corolla Sedan New FWD 0.1
Example:BayesWipe-DET and BayesWipe-PDB
TID Make Model Cartype Condition Drivetrain
1 Honda Civik Sedan Used FWD
2 Honda Corolla Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Toyota Corolla Sedan New FWD
Dirt
y Da
taba
se
BayesWipe
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Civic Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Toyota Corolla Sedan New FWD
Baye
sWip
e-DE
T
Baye
sWip
e-PD
B
16
BayesWipe-PDB Type Types of Probabilistic database
◦ Tuple Independent ◦ Block Independent Disjoint (BID) ◦ C-Table
BayesWipe-PDB type◦ Block Independent Disjoint (BID)
TID Make Model Cartype Condition Drivetrain Probability
12
Honda Civic Sedan Used FWD 0.85
Honda Corolla Sedan New FWD 0.10
Honda Civic Sdfkshf Used FWD 0.05
210Toyota Corolla Sedan New FWD 0.9
Honda Corolla Sedan New FWD 0.1
Block Independent Disjoint Probabilistic database
17
BayesWipe-PDB Storage BayesWipe-PDB is stored into a relational database:
◦ SQL Server
Query Processing Engine:◦ Mystiq[2] -- a prototype of Probabilistic database
management system
[2] Boulos, Jihad, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Re, and Dan Suciu. "MYSTIQ: a system for finding more answers by using probabilities." In SIGMOD, pp. 891-893. ACM, 2005.
Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results
19
Investigation Strategy Criteria to compare BayesWipe-PDB and BayesWipe-DET◦ Accuracy of query results◦ Scalability
20
Accuracy of Query results
To check if BayesWipe-PDB makes improvement in query results
Query results from BayesWipe-PDB and BayesWipe-DET are compared using:◦ Precision of query results◦ Recall of query results◦ Total increase in true positives over multiple queries◦ Total increase in false positive over multiple queries
21
Query Results:BayesWipe-PDB and BayesWipe-DET
BayesWipe-DET Query results ◦ Set of deterministic
tuples
BayesWipe-PDB Query results ◦ Set of Probabilistic
tuples◦ Multiple rectifications of
a tuple
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Civic Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Toyota Corolla Sedan New FWD
σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Civic Sedan New FWD
3 Honda Civic Sedan Used FWD
Deterministic Query Results
TID Make Model Cartype Condition Drivetrain Probability
1 Honda Civic Sedan Used FWD 0.9
2 Honda Civic Sedan Used FWD 0.6
3Honda Civik Sedan New FWD 0.05
Honda Civic Sedan Used FWD 0.95
4 Honda Corolla Sedan New FWD 0.1
σ𝑚𝑎𝑘𝑒¿ ′𝐻𝑜𝑛𝑑𝑎′
TID Make Model Cartype Condition Drivetrain Probability
1Honda Civic Sedan Used FWD 0.9
Honda Civik Sedan New FWD 0.1
2
Toyota Corolla Sedan New FWD 0.2
Toyota Corolla Sedan Used FWD 0.2
Honda Civic Sedan Used FWD 0.6
3Honda Civik Sedan New FWD 0.05
Honda Civic Sedan Used FWD 0.95
4Toyota Corolla Sedan New FWD 0.9
Honda Corolla Sedan New FWD 0.1
Probabilistic Query Results
22
Accuracy of Query results:Evaluation Challenges
Precision/Recall calculation is not straightforward Evaluation Challenges:
◦ Defining accuracy/relevance of resultant Tuple◦ Precision/Recall for Query Results from BayesWipe-PDB
23
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Corolla Sedan New FWD
3 Honda Civic Sedan Used AWD
4 Toyota Corolla Sedan New FWD
3 Honda Civic Sedan Used AWD
Defining Relevance/Accuracy of Resultant tuple
TID Make Model Cartype Condition Drivetrain
1 Honda Civic Sedan Used FWD
2 Honda Corolla Sedan New FWD
3 Honda Civic Sedan Used FWD
4 Toyota Corolla Sedan New FWD
Ground Truth Results Observed Results
Query Results for
3 Honda Civic Sedan Used FWD
Ground Truth Result
Observed Result
24
Relevance of Resulting Tuples In this precision and recall computation
◦ Relevance is defined only by tuple ids
A probabilistic/deterministic tuple from query results is relevant if:◦ Its tuple id appears in query results from ground
truth
25
Precision/Recall of Probabilistic Query Results
Precision and Recall is not defined for probabilistic query results
True precision/recall for probabilistic query results◦ Calculated over all possible worlds*◦ Overall precision recall is weighted sum of
precision/recall over all possible worlds
Exponential numbers of possible worlds
* A possible world is a state of a probabilistic database in which each random variable in the PDB has been assigned one of its possible values
26
Approximate Precision/Recall of Probabilistic Results
Two ways to approximate precision/recall calculation◦ Consider partial belongingness of tuples
◦ Where P(t) is probability of the tuple t
◦ Use a pass threshold to classify probabilistic tuples as query results or not ◦ Calculated precision and Recall using standard formula for query results◦ Traditional way [3] to handle uncertain results
[3] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. VLDB,pages 586–597. VLDB Endowment, 2002.
27
Precision/Recall approximation using a pass Threshold
A fixed pass threshold is applied to total probability
◦ Aggregated probabilities of all rectifications of probabilistic tuple
Total Probability>= ◦ Query results◦ Rejected
TID Make Model Cartype Condition Drivetrain Probability
1 Honda Civic Sedan Used FWD 0.9
2 Honda Civic Sedan Used FWD 0.6
3Honda Civik Sedan New FWD 0.05
Honda Civic Sedan Used FWD 0.95
4 Honda Corolla Sedan New FWD 0.1
Probabilistic query results
TID Probability
1 0.9
2 0.6
3 1
4 0.1
Aggregated probabilities of probabilistic tuple
h h𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 θ=0.2 TID
1
2
3
Determinized results
28
Scalability To check if BayesWipe-PDB is scalable
Two comparisons are performed◦ Size of BayesWipe-PDB vs. Size of
BayesWipe-DET◦ Query Processing time over BayesWipe-
PDB vs. Query Processing time over BayesWipe-DET
29
Potential issues with BayesWipe-PDB
Number of rectifications with low probabilities increases as data size increases
Potential issues:◦ Query results with very low precision
◦ One way to control is by good pass threshold ◦ Scalability issues
◦ High physical space◦ High query processing time
Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results
31
Optimization Technique Reason for potential issues
◦ Too many irrelevant results in BayesWipe-PDB
Pre-Pruning, an optimization technique ◦ Prevent irrelevant rectifications to be stored in
BayesWipe-PDB◦ Checks every rectification T* of tuple T◦ Stores T* in BayesWipe-PDB if
◦ T* passes pre-pruning algorithm
32
PrePruned BayesWipe-PDB
Probabilistic Database stores multiple candidate clean version T* after Pre-Pruning
Dirty data
Multiple alternatives
(T*)BayesWipe BayesWipe-PDBPruning
33
Pre-Pruning Algorithm Pre-Pruning Algorithm considers every candidate clean version and associated probability i.e. P(T*|T)
α β P(T*|T)
Rejected AcceptedFurther
Investigated
34
Pre-Pruning Algorithm If
◦ T* is kept as clean version of the tuple T
In the case of rare, legitimate tuples, it is possible that both T and T* have low probabilities
The values of α, β and γ are set to 0.009, 0.5 and 5 respectively
For this, Prior of tuple T and Prior of tuple T* is considered
P[T ]=¿ tupleT occurs∈dirty database
¿ tuples∈dirty database
α β P(T*|T)
Rejected AcceptedFurther Investigated
Agenda Motivation Deterministic and Probabilistic clean outcomes Investigation Strategy Optimization technique Experiments and Results
36
Experimental Setup Experiments are performed on used car dataset crawled from Google base
Experiments are performed on size of data set varying from 1000 tuples to 30000 tuples
Synthetic noise is introduced at random to the clean dataset◦ Noise level varies from 1% to 20%
Random queries were selected to compare the quality of query results extracted from BayesWipe-DET and BayesWipe-PDB
37
Experimental Results Present finding of comparison of BayesWipe-PDB and BayesWipe-DET on◦ Accuracy of query results (Precision and Recall)◦ Scalability (Size and Query processing time)
Present the effect of optimization technique on BayesWipe-PDB
38
00.30.60.9
BayesWipe-PDB Recall BayesWipe-DET Recall
RECA
LBayesWipe-PDB vs. BayesWipe-DET Recall and Precision Data Size = 2500
Noise = 10%Threshold = 0.1
make = acura model = outlander sports
cartype = sedan make = bmw & condition = used
model = jetta model = cooper s model = h3 mini Average0
0.10.20.30.40.50.60.70.80.9
1
BayesWipe-PDB Precision BayesWipe-DET Precision
Prec
isio
n
39
BayesWipe-PDB vs. BayesWipe-DET:Effect of Threshold Dataset size 30000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.2
0.4
0.6
0.8
1
Average BayesWipe-PDB Precision
Average BayeWipe-PDB Recall
Threshold
Prec
ision
/Rec
all
DET Recall=0.82
DET Precision= 0.997
40
Accuracy of Query Results from Multiple Random Queries
Average of precision and recall values of multiple random queries does not give good idea
Comparison on non normalized metrics over 100 random queries◦ Total increase numbers of true positives generated
◦ True Positive (BayesWipe-PDB) – True Positives(BayesWipe-DET)◦ Total increase number of false positive generated
◦ False Positive (BayesWipe-PDB) – False Positives(BayesWipe-DET)
41
1 2 5 10 15 200
100
200
300
400
500
600
Increase in True Positives Increase in False Negative
Noise Percentage
BayesWipe-PDB vs. BayesWipe-DET: True Positives and False Positives Gain
Data Size =30000Threshold = 0.1
42
BayesWipe-PDB vs. BayesWipe-DET: Size of Database
1 2 5 10 15 200
50000
100000
150000
200000
250000
300000
350000
400000
BayesWipe-PDB database size BayesWipe-DET database size
Noise
Num
ber o
f tup
les
Data Size = 30000
43
BayesWipe-PDB vs. BayesWipe-DET: Average Query Processing time
Data Size =30000
1 2 5 10 15 200
10
20
30
40
50
60
70
80
BayesWipe-PDB query processing time BayesWipe-DET query processing time
Noise Percentage
Tim
e in
ms
44
model = 'c
ontinental gt'
model = 'e
quinox'
make = 'is
uzu'
make = 'fo
rd'
make = 'h
yundai'
model = '5
25'
'550 gran turis
mo'
model = 'e
nclave
'0.86
0.90.940.98
BayesWipe-PDB Recall Optimized BayesWipe-PDB Recall BayesWipe-DET Recall
Reca
ll
model = 'c
ontinental gt'
model = 'e
quinox'
make = 'is
uzu'
make = 'fo
rd'
make = 'h
yundai'
model = '5
25'
'550 gran turis
mo'
model = 'e
nclave
'0.2
0.4
0.6
0.8
1
BayesWipe-PDB Precision Optimized BayesWipe-PDB Precision BayesWipe-DET Precision
Prec
isio
n
Optimized BayesWipe-PDB vs. BayesWipe-DET Recall and Precision
45
Optimized BayesWipe-PDB:True Positives and False Positives Gain
1 2 5 10 15 200
100
200
300
400
500
600
Increase in True Positives Increase in True Positive (Optimized) Increase in False NegativeIncrease in False Positive (Optimized)
Noise Percentage
Data Size =30000Threshold = 0.1
46
1 2 5 10 15 200
50000
100000
150000
200000
250000
300000
350000
400000
BayesWipe-DET database size BayesWipe-PDB database sizePrePruned BayesWipe-PDB database size
Noise Percentage
Num
ber o
f tup
les
Optimized BayesWipe-PDB:Database Size Comparison
Data Size =30000
47
1 2 5 10 15 200
10
20
30
40
50
60
70
80
BayesWipe-DET query processing time BayesWipe-PDB query processing timePrePruned BayesWipe-PDB query processing time
Noise Percentage
Tim
e in
ms
Optimized BayesWipe-PDB: Average Query Processing time
Data Size =30000
48
Results SummaryQuery Processing time Database size Precision Recall
BayesWipe-DET Low Same as Dirty Data High Low
BayesWipe-PDB Very High Very largeIncreases with
increase inThreshold value
Decreases with increase in
Threshold value
Optimized BayesWipe-PDB High Large
Higher to Precision of
BayesWipe-DET in most cases (65%
times)
Higher or equal to Recall of
BayesWipe-DET
49
Conclusion I studied the utility of considering multiple alternative rectifications in data cleaning
For that, I compare BayesWipe-PDB and BayesWipe-DET BayesWipe-PDB always has better recall for query results at the cost of precision
BayesWipe-PDB also requires larger physical space and high query processing time
Optimization technique provide a way to minimize the cost of precision and scalability issues