Top Banner
OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage Ben-Gurion University of The Negev Faculty of Engineering Sciences Department of Information Systems Engineering Ma'ayan Gafny, Asaf Shabtai , Lior Rokach, Yuval Elovici
26

OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Apr 01, 2015

Download

Documents

Alvin Halsted
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

OCCT: A One-Class Clustering Tree

for Implementing One-to-Many Data Linkage

Ben-Gurion University of The NegevFaculty of Engineering Sciences

Department of Information Systems Engineering

Ma'ayan Gafny, Asaf Shabtai ,Lior Rokach, Yuval Elovici

Page 2: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Page 3: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

𝑇𝐴 – a given table A 𝑇𝐡 – a given table B (our goal is to link records from table 𝑇𝐴 with one or more records from 𝑇𝐡) ȁ�𝑇𝐴ȁ� – number of records in 𝑇𝐴 ȁ�𝑇𝐡ȁ� – number of records in 𝑇𝐡

A – the set of attributes of table 𝑇𝐴 where ai is the i-th attribute

|A| – denotes the number of attributes in 𝑇𝐴

B – the set of attributes of table 𝑇𝐡 where bi is the i-th attribute

|B| – denotes the number of attributes in 𝑇𝐡 π‘Ÿ(π‘Ž) βˆˆπ‘‡π΄ – a record from table 𝑇𝐴 π‘Ÿ(𝑏) βˆˆπ‘‡π΅ – a record from table 𝑇𝐡 𝑇𝐴× 𝑇𝐡 – a table that is generated by applying Cartesian product of 𝑇𝐴 and 𝑇𝐡

r=(r(a),r(b))βŠ†TAΓ—TB – a record of 𝑇𝐴× 𝑇𝐡 π‘‡π΄π΅βŠ†π‘‡π΄Γ— 𝑇𝐡 – denoting the set of matching records π‘‡π΄π΅ΰ΄€ΰ΄€ΰ΄€ΰ΄€βŠ†π‘‡π΄Γ— 𝑇𝐡 – denoting the set of non-matching records d – a node in the OCCT model AdβŠ†A – the subset of attributes of TA that were already selected as splitting attributes in the path

from the root of the tree to node d. 𝑇𝐴𝐡(𝑑)βŠ†π‘‡π΄π΅ – the subset of matching instances at node d of the OCCT tree

π‘†π‘π‘™π‘–π‘‘π‘Žα‰€π‘‡π΄π΅(𝑑)ቁ= 𝑇𝐴𝐡(𝑑)(π‘Ž) – the splitting of 𝑇𝐴𝐡(𝑑) into n subsets according to attribute a such that

βˆ€π‘– = 1..𝑛 𝑇𝐴𝐡(𝑑𝑖)(π‘Ž) = {π‘Ÿβˆˆπ‘‡π΄π΅(𝑑)|π‘Ž = 𝑣𝑖} πœŽπ‘(𝑇𝐴𝐡(𝑑)) – selection operator that is used to select records in 𝑇𝐴𝐡(𝑑) that satisfy the given predicate

p (in this case p is a=vi) πœ‹π΄(π‘‡π΄π΅αˆΊπ‘‘αˆ») – projection operator that is used to select a subset of attributes in 𝑇𝐴𝐡(𝑑) that appear in

the attribute collection A

Definitions

Page 4: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

an … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

A = {a1,a2,a3,…,an}|A| = n

|TA| = num of records in TA

r(a) = a record from TA

B={b1,b2,b3,…,bm}|B|=m

|TB| = num of records in TB

r(b) = a record from TB

r(a) r(b)

Page 5: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

an … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

bm … b4 b3 b2 b1 an … a4 a3 a2 a1

TA x TB :

r=(r(a) , r(b))

Page 6: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Page 7: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Page 8: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

d

a=v 1

d1

a=v2

d2

bm … b1 an … a2 a1

v1

v1

v1

bm … b1 an … a2 a1

v2

v2

v2

Page 9: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Definitions

d1

d2

d4

d5

d3

Ad4 = {a1,a2}

Ad2 = {a1}

AdβŠ†A – the subset of attributes of TA that were already

selected as splitting attributes in the path from the root of the tree to node d.

Page 10: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Running Examples

Page 11: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

The data set Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Berlin Berlin Friday Afternoon 1

private Hamburg Hamburg Wednesday Afternoon 2

business Berlin Berlin Wednesday Morning 3

private Berlin Berlin Wednseday Morning 4

private Berlin Berlin Saturday Afternoon 5

private Berlin Berlin Thursday Morning 6

private Berlin Berlin Friday Afternoon 7

business Berlin Berlin Saturday Afternoon 8

private Berlin Berlin Saturday Afternoon 9

business Hamburg Hamburg Friday Afternoon 10

business Hamburg Hamburg Monday Afternoon 11

private Hamburg Hamburg Saturday Afternoon 12

private Berlin Berlin Monday Afternoon 13

private Bonn Berlin Monday Afternoon 14

private Berlin Berlin Monday Afternoon 15

private Bonn Bonn Saturday Morning 16

private Hamburg Hamburg Saturday Morning 17

private Hamburg Hamburg Saturday Morning 18

private Hamburg Hamburg Friday Afternoon 19

Page 12: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

The data set – cont .Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Bonn Hamburg Friday Afternoon 20

private Berlin Hamburg Friday Morning 21

business Berlin Berlin Friday Morning 22

private Berlin Berlin Friday Morning 23

private Berlin Berlin Wednseday Afternoon 24

private Berlin Berlin Thursday Afternoon 25

business Berlin Berlin Thursday Afternoon 26

business Bonn Bonn Monday Afternoon 27

private Hamburg Bonn Monday Afternoon 28

business Berlin Bonn Monday Afternoon 29

business Bonn Bonn Wednseday Afternoon 30

private Bonn Bonn Friday Afternoon 31

Page 13: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard

Page 14: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:β€’ Request locationβ€’ Request day of weekβ€’ Request part of day

Page 15: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

dreqLocation

!= Berlin

reqLocation = Berlin

W1 = 16/31

W3 = 6/31

W2 = 9/31

Score1=1/23

Score3=1/23

Score2=2/23

*

*

*

+

+

Score(SplitreqLocation) =0.0561d

reqLocation !=Hamburg

reqLocation = Hamburg

dreqLocation

!= Bonn

reqLocation = Bonn

Page 16: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

ddayOfWeek!=

Monday

dayOfWeek= Monday

W1 = 7/31

W3 = 3/31

W2 = 5/31

Score1=3/15

Score3=3/15

Score2=5/15

*

*

*+

+

Score(SplitdayOfWeek) =0.260

ddayOfWeek!= Wednesday

dayOfWeek= Wednesday

ddayOfWeek!=

Thursday

dayOfWeek = Thursday

W4 = 9/31Score4=5/15 *ddayOfWeek

!= Friday

dayOfWeek = Friday

W5= 7/31Score5=3/15 *ddayOfWeek

!= Friday

dayOfWeek = Friday

+

+

Page 17: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

CGJ– Splitting the root of the tree

dpartOfDay= Afternoon

partOfDay= Morning

Score1=4/23

Score(SplitpartOfDay) = 0.173

Page 18: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:β€’ Request location 0.0561β€’ Request day of week 0.260β€’ Request part of day 0.173

The split in the root

Page 19: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Fine Grained Jaccard

Page 20: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Fine Grained Jaccard – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 21: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Least Probable Intersections

Page 22: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 23: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Customer TypeCustomer CityRequest LocationRequest Day Of Week

Request Part Of DayRequest ID

privateBerlinBerlinFridayAfternoon

privateHamburgHamburgWednsedayAfternoon

businessBerlinBerlinWednsedayMorning

privateBerlinBerlinWednsedayMorning

privateBerlinBerlinSaturdayAfternoon

privateBerlinBerlinThursdayMorning

privateBerlinBerlinFridayAfternoon

businessBerlinBerlinSaturdayAfternoon

privateBerlinBerlinSaturdayAfternoon

businessHamburgHamburgFridayAfternoon

businessHamburgHamburgMondayAfternoon

privateHamburgHamburgSaturdayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBerlinMondayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBonnSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgFridayAfternoon

privateBonnHamburgFridayAfternoon

privateBerlinHamburgFridayMorning

businessBerlinBerlinFridayMorning

privateBerlinBerlinFridayMorning

privateBerlinBerlinWednsedayAfternoon

privateBerlinBerlinThursdayAfternoon

businessBerlinBerlinThursdayAfternoon

businessBonnBonnMondayAfternoon

privateHamburgBonnMondayAfternoon

businessBerlinBonnMondayAfternoon

businessBonnBonnWednsedayAfternoon

privateBonnBonnFridayAfternoon

Req. Location != Berlin

Req. Loca

tion = B

erlin

Page 24: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = B

erlin

d

Page 25: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

Maximum Likelihood Estimation

Page 26: OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage.

RequestLocation

Berli

nBonn

Hamburg

Cust.City

Cust. Type

Cust.City

Cust. Type

Cust.City

Cust. Type

MLE – Splitting the root of the tree

p(Cust. City|Cust. Type) p(Cust. Type|Cust. City)