COLUMN MATCHING: Machine Learning Systems for Data … · -Bill Howe, University of Washington; Data Science on Coursera.org . Learning to Predict ... •Human-Computer Interaction

COLUMN MATCHING Machine Learning Systems for Data Integration

Melody Penning

[email protected]

Information and Data Quality Conference

November 4-7, 2013, Little Rock, AR

The Problem

• 4,003 Tables with 92,161 Columns

• 1347 have Primary Key Constraints

• 1527 Columns are Foreign Keys

• Column usage may have changed over time…

Melody Penning 11/2013 2

-http://it.wikipedia.org/wiki/Teraminx

Data Integration

• Data Integration is Hard

• Systems Reasons

• Logical Reasons

• Social and Administrative Reasons


--Doan, Halevy and Ives, Principles of Data Integration

Data Integration Architecture

• Spectrum

• Virtual Integration

• Warehousing

11/2013 Melody Penning 4

Wrapper /

Extractor

Wrapper /

Extractor

Wrapper /

Extractor

Wrapper /

Extractor

Mediated Schema

or Warehouse

RDBMS RDBMS HTML XML

--Doan, Halevy and Ives, Principles of Data Integration

Current Solutions

• Expert Opinion

• Metadata-based Matchers

• Instance-based Matchers

• Semantic Matchers


--Bellahsene, Bonifati and Rahm, Schema Matching and Mapping

0 200 400 600 800 1000 1200 1400

ACTIVITY_DATE

DESC

TERM_CODE

TERM_CODE_EFF

PROGRAM

SEQNO

COLL_CODE

AREA

DEPT_CODE

SEQ_NO

SBGI_CODE

TEXT

USER

CONNECTOR_MAX

ACTN_CODEMost Frequent Column Names

Which Pieces Fit Together?

Melody Penning 11/2013 6 -http://en.wikipedia.org/wiki/Jigsaw_puzzle

The Goal

• Efficiently Identify Columns that Match

• Gather Convincing Evidence

• Test the Conclusions

• Clues

• Database Metadata

• Naming Scheme

• Column Contents


Database Metadata


Column Description Column Description

DATA_TYPE Datatype of the column HIGH_VALUE High value in the column

DATA_LENGTH Length of the column (in bytes) DENSITY Density of the column

DATA_PRECISION

Decimal precision for NUMBER datatype;

binary precision for FLOAT datatype, null

for all other datatypes CHAR_LENGTH

Displays the length of the column in

characters. This value only applies to the

following datatypes:

DATA_SCALE Digits to right of decimal point in a number NUM_BUCKETS Number of buckets in the histogram for the

column

NULLABLE

Specifies whether a column allows NULLs.

Value is N if there is a NOT NULL

constraint on the column or if the column is

part of a PRIMARY KEY. The constraint

should be in an ENABLE VALIDATE state.

GLOBAL_STATS

For partitioned tables, indicates whether

column statistics were collected for the

table as a whole (YES) or were estimated

from statistics on underlying partitions and

subpartitions (NO)

DEFAULT_LENGTH Length of default value for the column SAMPLE_SIZE Sample size used in analyzing this column

NUM_DISTINCT Number of distinct values CHAR_COL_DECL_LEN

GTH Length

LOW_VALUE Low value in the column LAST_ANALYZED Date on which this column was most

recently analyzed

CHAR_USED

B | C. B indicates that the column uses

BYTE length semantics. C indicates that the

column uses CHAR length semantics.

NULL indicates the datatype is not a CHAR

type.

AVG_COL_LEN Average length of the column (in bytes)

HISTOGRAM Indicates existence/type of histogram NUM_NULLS Number of nulls in the column

Machine Learning or Statistics?

• Data Modeling vs. Algorithmic Modeling


-Leo Breiman, Statistical Modeling: The Two Cultures

-Bill Howe, University of Washington; Data Science on Coursera.org

Learning to Predict

• Supervised Learning

• Labeled examples are available

• Ex: Classification

• Semi-supervised Learning

• Some examples are labeled

• Some are not

• Unsupervised Learning

• Labeled examples are not available

• Ex: Clustering


-Han, Kamber and Pei, Data Mining Concepts and Techniques

Active Learning • user interaction

Classifiers

• Predict Class: Binary {0,1} or Multiclass {0,1,2,3...}

• Classification Problem Examples

• Loans: Safe or Risky

• Tumor: Malignant or Benign

• Customers: Will Click or Will not Click

• Classification Algorithms Examples

• Logistic Regression

• Rule Based

• Decision Tree Induction

• Naïve Bayes


Logistic Regression

• Inputs: Column Metadata

• Known and Unknown Key Constraints

• Training Set

• Columns with Known Key Constraints Labeled ‘1’

• All Others Labeled ‘0’

• Outputs are Probabilities and Classification Level


Metadata

Columns… Training

Label FROM INTO IP_1 IP_0 XP_1 XP_0 LEVEL phat lcl ucl

xxxx… 0 0 1 0.6678 0.3322 0.6686 0.3314 1 0.6678 0.6363 0.6979

xxxx… 0 0 1 0.7092 0.2908 0.7098 0.2902 1 0.7092 0.6821 0.7349

xxxx… 0 0 1 0.5978 0.4022 0.5984 0.4016 1 0.5978 0.5667 0.6281

xxxx… 0 0 1 0.5241 0.4759 0.5247 0.4753 1 0.5241 0.4896 0.5583

xxxx… 0 0 1 0.7359 0.2641 0.7366 0.2634 1 0.7359 0.7079 0.7621

xxxx… 0 0 1 0.6499 0.3501 0.6506 0.3494 1 0.6499 0.6178 0.6806

xxxx… 0 0 1 0.5181 0.4819 0.5187 0.4813 1 0.5181 0.4829 0.5531

Example Output From SAS Proc Logistic

Combining Match Predictions

• “The More The Merrier”

• Ensemble Methods Improve Prediction

• Match Score Aggregation Methods

• Average, Minimum or Maximum

• Weighted-sum

• Rule Based


-Bell, Koren and Volinsky, All Together Now: A Perspective on the Netflix Prize

Benefiting from Errors

• Semi-Supervised Classification

• Classification of Key Constraint Columns using Metadata

• Logistic Regression

• Requires Lots of Data Preparation

• Followed by Semantic Matching

• Columns Grouped by Name

• Instance-based Matching

• Overlapping Data Counted


Logistic Regression Results

• Classifier Performance

• Matching Candidates come

from the False Positives


Classification Table

Prob

Level

Correct Incorrect

Event Non-

Event

Event Non-

Event

0.500 845 7153 459 2252

Percentages

Correct Sensi-

tivity

Speci-

ficity

False

POS

False

NEG

74.7 27.3 94.0 35.2 23.9

Semantic Matching

• Column Names of Two Groups:

• Candidates – Indicated by Logistic Regression

• Non-Candidates – Randomly Sampled from Data Columns

• Columns with Count Greater than One


0

5

10

15

20

25

30

AR

_IN

D

BLD

G_C

OD

E

CN

T_

IN_A

RE

A_

CO

LL

_C

OD

E

CO

NN

EC

TO

R_I_

CO

NN

EC

TO

R_R

E

CR

ED

IT_

CA

RD

_

CR

SE

_N

UM

B_

IN

CR

SE

_N

UM

B_

LO

CU

RR

EN

T_

CD

E

DU

PLIC

AT

E

FU

LL

_P

AR

T_

IN

GP

A_T

YP

E_IN

D

GR

AD

AB

LE

_IN

D

GR

DE

_C

OD

E_M

I

ID

JO

B_ID

PR

EQ

_O

VE

R

PR

IM_

SE

C_C

DE

RE

QU

IRE

D_IN

D

RG

RP

_C

OD

E

RO

OM

_N

UM

BE

R

RS

TS

_C

OD

E

SO

UR

CE

TR

AN

SC

_P

RT

_I

WE

B_IN

D

DE

PT

_C

OD

E

ER

RO

R_F

LA

G

RM

SG

_C

DE

RU

LE

SE

T

SU

BJ_C

OD

E

GR

DE

_C

OD

E

INC

L_E

XC

L_IN

LO

AD

_IN

D

TA

BLE

_N

AM

E

DIS

P_W

EB

_IN

D

LE

VL_C

OD

E

TY

PE

KE

Y_R

UL

E

AC

TIV

E_IN

D

PR

OG

RA

M

TE

RM

_C

OD

E_

EF

CR

SE

_N

UM

B

TE

RM

_C

OD

E

CO

DE

RE

C_T

YP

E

CR

N

AR

EA

Frequency of Column Name

Column Overlap Testing

• The Candidates vs. The Non-Candidates

𝐶𝑜𝑙𝑢𝑚𝑛 𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝑆𝑐𝑜𝑟𝑒 =𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝐶𝑜𝑢𝑛𝑡

𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝐼𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝐶𝑜𝑙𝑢𝑚𝑛 𝐶𝑜𝑢𝑛𝑡

Table 1 Table 1 Count Table 2 Table 2 Count Combined Count Score

GWTERRS_ID 25,674 SRTIDEN_ID 174,994 10 0.000389499

SARADAP_MAJR_CODE_1 138

SGBSTDN_MAJR_CODE_1 169 132 0.956521739

SCBCRSE_CRSE_NUMB 2,413 SCRSCHD_CRSE_NUMB 2,301 2,301 1

SHBDIPL_NAME 25,512 SWRAPFE_NAME 3,111 53 0.017036323

SHRASES_BEGIN_DATE 645 SARPSES_BEGIN_DTE 7,775 0 0

SHRIDEN_CITY 342 SPREMRG_CITY 1,918 254 0.742690058

SMBARUL_AREA 397 SMBATRK_AREA 389 248 0.637532134

SMRDORJ_CRN 14,305 SSRMEET_CRN 12,899 10,385 0.805101171

SMRDOUS_AREA 362 SMRSACA_AREA 259 257 0.992277992

SPREMRG_CITY 1,918 SWVSBGI_CITY 13,523 111 0.057872784

Examples from Overlap Scores


Candidate vs. Non-Candidate Distributions


Conclusion

• Lots of Columns Appear to Match but Don’t Always…

• Which Ones Really Do Match?

• Some Solutions • Expert Opinion

• Metadata-based Matchers

• Instance-based Matchers

• Semantic Matchers

• Making the Solutions Work Together • Logistic Regression to Limit the Candidates

• Data Overlap Comparison of the Resulting Candidate Sets

• How Well Does it Work? • Compare Overlap Counts from the Candidates with Randomly

Chosen Columns


Problems / Wish List

• Too Many Steps / Too Many Tools

• Need Streamlining and Automation

• Would like to Take Advantage of

• Visualization

• Human-Computer Interaction


Thank You!


References • Bell, R. M., Koren, Y., & Volinsky, C. (2010). All together now: A perspective on the

NETFLIX PRIZE. Chance, Volume 23, Issue 1, p24-29.

• Bellahsene, Z., Bonifati, A., & Rahm, E. (2011). Schema Matching and Mapping. Heidelberg: Springer.

• Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, Vol. 16, No. 3, 199–231.

• Doan, A., Halevy, A., & Ives, Z. (2012). Principles of Data Integration. Waltham: Elsevier Inc.

• Han, J., Kamber, M., & Pei, J. (2012). Data Mining Concepts and Techniques. Waltham: Morgan Kaufmann.

• Hosmer, D. P. (2013). Logistic Regression Modeling. Stowe: School of Public Health and Health Sciences, University of Massachusetts.

• Tamhane, A. C., & Dunlop, D. D. (1999). Statistics and Data Analysis: From Elementary to Intermediate. Pearson.


COLUMN MATCHING: Machine Learning Systems for Data … · -Bill Howe, University of Washington; Data Science on Coursera.org . Learning to Predict ... •Human-Computer Interaction

Documents