COLUMN MATCHING Machine Learning Systems for Data Integration Melody Penning [email protected] Information and Data Quality Conference November 4-7, 2013, Little Rock, AR
COLUMN MATCHING Machine Learning Systems for Data Integration
Melody Penning
Information and Data Quality Conference
November 4-7, 2013, Little Rock, AR
The Problem
• 4,003 Tables with 92,161 Columns
• 1347 have Primary Key Constraints
• 1527 Columns are Foreign Keys
• Column usage may have changed over time…
Melody Penning 11/2013 2
-http://it.wikipedia.org/wiki/Teraminx
Data Integration
• Data Integration is Hard
• Systems Reasons
• Logical Reasons
• Social and Administrative Reasons
Melody Penning 11/2013 3
--Doan, Halevy and Ives, Principles of Data Integration
Data Integration Architecture
• Spectrum
• Virtual Integration
• Warehousing
11/2013 Melody Penning 4
Wrapper /
Extractor
Wrapper /
Extractor
Wrapper /
Extractor
Wrapper /
Extractor
Mediated Schema
or Warehouse
RDBMS RDBMS HTML XML
--Doan, Halevy and Ives, Principles of Data Integration
Current Solutions
• Expert Opinion
• Metadata-based Matchers
• Instance-based Matchers
• Semantic Matchers
Melody Penning 11/2013 5
--Bellahsene, Bonifati and Rahm, Schema Matching and Mapping
0 200 400 600 800 1000 1200 1400
ACTIVITY_DATE
DESC
TERM_CODE
TERM_CODE_EFF
PROGRAM
SEQNO
COLL_CODE
AREA
DEPT_CODE
SEQ_NO
SBGI_CODE
TEXT
USER
CONNECTOR_MAX
ACTN_CODEMost Frequent Column Names
Which Pieces Fit Together?
Melody Penning 11/2013 6 -http://en.wikipedia.org/wiki/Jigsaw_puzzle
The Goal
• Efficiently Identify Columns that Match
• Gather Convincing Evidence
• Test the Conclusions
• Clues
• Database Metadata
• Naming Scheme
• Column Contents
Melody Penning 11/2013 7
Database Metadata
11/2013 Melody Penning 8
Column Description Column Description
DATA_TYPE Datatype of the column HIGH_VALUE High value in the column
DATA_LENGTH Length of the column (in bytes) DENSITY Density of the column
DATA_PRECISION
Decimal precision for NUMBER datatype;
binary precision for FLOAT datatype, null
for all other datatypes CHAR_LENGTH
Displays the length of the column in
characters. This value only applies to the
following datatypes:
DATA_SCALE Digits to right of decimal point in a number NUM_BUCKETS Number of buckets in the histogram for the
column
NULLABLE
Specifies whether a column allows NULLs.
Value is N if there is a NOT NULL
constraint on the column or if the column is
part of a PRIMARY KEY. The constraint
should be in an ENABLE VALIDATE state.
GLOBAL_STATS
For partitioned tables, indicates whether
column statistics were collected for the
table as a whole (YES) or were estimated
from statistics on underlying partitions and
subpartitions (NO)
DEFAULT_LENGTH Length of default value for the column SAMPLE_SIZE Sample size used in analyzing this column
NUM_DISTINCT Number of distinct values CHAR_COL_DECL_LEN
GTH Length
LOW_VALUE Low value in the column LAST_ANALYZED Date on which this column was most
recently analyzed
CHAR_USED
B | C. B indicates that the column uses
BYTE length semantics. C indicates that the
column uses CHAR length semantics.
NULL indicates the datatype is not a CHAR
type.
AVG_COL_LEN Average length of the column (in bytes)
HISTOGRAM Indicates existence/type of histogram NUM_NULLS Number of nulls in the column
Machine Learning or Statistics?
• Data Modeling vs. Algorithmic Modeling
Melody Penning 11/2013 9
-Leo Breiman, Statistical Modeling: The Two Cultures
-Bill Howe, University of Washington; Data Science on Coursera.org
Learning to Predict
• Supervised Learning
• Labeled examples are available
• Ex: Classification
• Semi-supervised Learning
• Some examples are labeled
• Some are not
• Unsupervised Learning
• Labeled examples are not available
• Ex: Clustering
11/2013 Melody Penning 10
-Han, Kamber and Pei, Data Mining Concepts and Techniques
Active Learning • user interaction
Classifiers
• Predict Class: Binary {0,1} or Multiclass {0,1,2,3...}
• Classification Problem Examples
• Loans: Safe or Risky
• Tumor: Malignant or Benign
• Customers: Will Click or Will not Click
• Classification Algorithms Examples
• Logistic Regression
• Rule Based
• Decision Tree Induction
• Naïve Bayes
Melody Penning 11/2013 11
Logistic Regression
• Inputs: Column Metadata
• Known and Unknown Key Constraints
• Training Set
• Columns with Known Key Constraints Labeled ‘1’
• All Others Labeled ‘0’
• Outputs are Probabilities and Classification Level
Melody Penning 11/2013 12
Metadata
Columns… Training
Label FROM INTO IP_1 IP_0 XP_1 XP_0 LEVEL phat lcl ucl
xxxx… 0 0 1 0.6678 0.3322 0.6686 0.3314 1 0.6678 0.6363 0.6979
xxxx… 0 0 1 0.7092 0.2908 0.7098 0.2902 1 0.7092 0.6821 0.7349
xxxx… 0 0 1 0.5978 0.4022 0.5984 0.4016 1 0.5978 0.5667 0.6281
xxxx… 0 0 1 0.5241 0.4759 0.5247 0.4753 1 0.5241 0.4896 0.5583
xxxx… 0 0 1 0.7359 0.2641 0.7366 0.2634 1 0.7359 0.7079 0.7621
xxxx… 0 0 1 0.6499 0.3501 0.6506 0.3494 1 0.6499 0.6178 0.6806
xxxx… 0 0 1 0.5181 0.4819 0.5187 0.4813 1 0.5181 0.4829 0.5531
Example Output From SAS Proc Logistic
Combining Match Predictions
• “The More The Merrier”
• Ensemble Methods Improve Prediction
• Match Score Aggregation Methods
• Average, Minimum or Maximum
• Weighted-sum
• Rule Based
Melody Penning 11/2013 13
-Bell, Koren and Volinsky, All Together Now: A Perspective on the Netflix Prize
Benefiting from Errors
• Semi-Supervised Classification
• Classification of Key Constraint Columns using Metadata
• Logistic Regression
• Requires Lots of Data Preparation
• Followed by Semantic Matching
• Columns Grouped by Name
• Instance-based Matching
• Overlapping Data Counted
Melody Penning 11/2013 14
Logistic Regression Results
• Classifier Performance
• Matching Candidates come
from the False Positives
Melody Penning 11/2013 15
Classification Table
Prob
Level
Correct Incorrect
Event Non-
Event
Event Non-
Event
0.500 845 7153 459 2252
Percentages
Correct Sensi-
tivity
Speci-
ficity
False
POS
False
NEG
74.7 27.3 94.0 35.2 23.9
Semantic Matching
• Column Names of Two Groups:
• Candidates – Indicated by Logistic Regression
• Non-Candidates – Randomly Sampled from Data Columns
• Columns with Count Greater than One
Melody Penning 11/2013 16
0
5
10
15
20
25
30
AR
_IN
D
BLD
G_C
OD
E
CN
T_
IN_A
RE
A_
CO
LL
_C
OD
E
CO
NN
EC
TO
R_I_
CO
NN
EC
TO
R_R
E
CR
ED
IT_
CA
RD
_
CR
SE
_N
UM
B_
IN
CR
SE
_N
UM
B_
LO
CU
RR
EN
T_
CD
E
DU
PLIC
AT
E
FU
LL
_P
AR
T_
IN
GP
A_T
YP
E_IN
D
GR
AD
AB
LE
_IN
D
GR
DE
_C
OD
E_M
I
ID
JO
B_ID
PR
EQ
_O
VE
R
PR
IM_
SE
C_C
DE
RE
QU
IRE
D_IN
D
RG
RP
_C
OD
E
RO
OM
_N
UM
BE
R
RS
TS
_C
OD
E
SO
UR
CE
TR
AN
SC
_P
RT
_I
WE
B_IN
D
DE
PT
_C
OD
E
ER
RO
R_F
LA
G
RM
SG
_C
DE
RU
LE
SE
T
SU
BJ_C
OD
E
GR
DE
_C
OD
E
INC
L_E
XC
L_IN
LO
AD
_IN
D
TA
BLE
_N
AM
E
DIS
P_W
EB
_IN
D
LE
VL_C
OD
E
TY
PE
KE
Y_R
UL
E
AC
TIV
E_IN
D
PR
OG
RA
M
TE
RM
_C
OD
E_
EF
CR
SE
_N
UM
B
TE
RM
_C
OD
E
CO
DE
RE
C_T
YP
E
CR
N
AR
EA
Frequency of Column Name
Column Overlap Testing
• The Candidates vs. The Non-Candidates
𝐶𝑜𝑙𝑢𝑚𝑛 𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝑆𝑐𝑜𝑟𝑒 =𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝐶𝑜𝑢𝑛𝑡
𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝐶𝑜𝑙𝑢𝑚𝑛 𝐶𝑜𝑢𝑛𝑡
Table 1 Table 1 Count Table 2 Table 2 Count Combined Count Score
GWTERRS_ID 25,674 SRTIDEN_ID 174,994 10 0.000389499
SARADAP_MAJR_CODE_1 138
SGBSTDN_MAJR_CODE_1 169 132 0.956521739
SCBCRSE_CRSE_NUMB 2,413 SCRSCHD_CRSE_NUMB 2,301 2,301 1
SHBDIPL_NAME 25,512 SWRAPFE_NAME 3,111 53 0.017036323
SHRASES_BEGIN_DATE 645 SARPSES_BEGIN_DTE 7,775 0 0
SHRIDEN_CITY 342 SPREMRG_CITY 1,918 254 0.742690058
SMBARUL_AREA 397 SMBATRK_AREA 389 248 0.637532134
SMRDORJ_CRN 14,305 SSRMEET_CRN 12,899 10,385 0.805101171
SMRDOUS_AREA 362 SMRSACA_AREA 259 257 0.992277992
SPREMRG_CITY 1,918 SWVSBGI_CITY 13,523 111 0.057872784
Examples from Overlap Scores
Melody Penning 11/2013 17
Candidate vs. Non-Candidate Distributions
Melody Penning 11/2013 18
Conclusion
• Lots of Columns Appear to Match but Don’t Always…
• Which Ones Really Do Match?
• Some Solutions • Expert Opinion
• Metadata-based Matchers
• Instance-based Matchers
• Semantic Matchers
• Making the Solutions Work Together • Logistic Regression to Limit the Candidates
• Data Overlap Comparison of the Resulting Candidate Sets
• How Well Does it Work? • Compare Overlap Counts from the Candidates with Randomly
Chosen Columns
Melody Penning 11/2013 19
Problems / Wish List
• Too Many Steps / Too Many Tools
• Need Streamlining and Automation
• Would like to Take Advantage of
• Visualization
• Human-Computer Interaction
Melody Penning 11/2013 20
Thank You!
11/2013 Melody Penning 21
References • Bell, R. M., Koren, Y., & Volinsky, C. (2010). All together now: A perspective on the
NETFLIX PRIZE. Chance, Volume 23, Issue 1, p24-29.
• Bellahsene, Z., Bonifati, A., & Rahm, E. (2011). Schema Matching and Mapping. Heidelberg: Springer.
• Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, Vol. 16, No. 3, 199–231.
• Doan, A., Halevy, A., & Ives, Z. (2012). Principles of Data Integration. Waltham: Elsevier Inc.
• Han, J., Kamber, M., & Pei, J. (2012). Data Mining Concepts and Techniques. Waltham: Morgan Kaufmann.
• Hosmer, D. P. (2013). Logistic Regression Modeling. Stowe: School of Public Health and Health Sciences, University of Massachusetts.
• Tamhane, A. C., & Dunlop, D. D. (1999). Statistics and Data Analysis: From Elementary to Intermediate. Pearson.
Melody Penning 11/2013 22