Top Banner
1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
41

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

1

Fully Automatic Cross-Associations

Deepayan Chakrabarti (CMU)Spiros Papadimitriou (CMU)Dharmendra Modha (IBM)Christos Faloutsos (CMU and IBM)

Page 2: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

2

Problem Definition

Products

Cus

tom

ers

Cus

tom

er G

roup

s

Product Groups

Simultaneously group customers and products, or, documents and words, or, users and preferences …

Page 3: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

3

Problem Definition

Desiderata:

1. Simultaneously discover row and column groups

2. Fully Automatic: No “magic numbers”

3. Scalable to large matrices

Page 4: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

4

Closely Related Work

Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be

specified

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large graphs

Page 5: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

5

Other Related Work

K-means and variants: [Pelleg+/2000, Hamerly+/2003]

“Frequent itemsets”: [Agrawal+/1994]

Information Retrieval:[Deerwester+1990, Hoffman/1999]

Graph Partitioning:[Karypis+/1998]

Do not cluster rows and cols simultaneously

User must specify “support”

Choosing the number of “concepts”

Number of partitions

Measure of imbalance between clusters

Page 6: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

6

What makes a cross-association “good”?

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

Why is this better?

implies

Page 7: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

7

Main Idea

Good Compression

Good Clusteringimplies

Column groups

Row

gro

ups

pi1 = ni

1 / (ni1 + ni

0)

(ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi

Binary Matrix

+Σi

Page 8: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

8

Examples

One row group, one column group

high low

m row group, n column group

highlow

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

Page 9: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

9

What makes a cross-association “good”?

Why is this better?

low low

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Page 10: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

10

Algorithmsk =

5 row groups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

l = 5 col groups

Page 11: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

11

Algorithmsl = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Page 12: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

12

Fixed k and ll = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Page 13: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

13

Fixed k and l

Column groups

Row

gro

ups Shuffles:

for each row:

shuffle it to the row group which minimizes the code cost

Page 14: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

14

Fixed k and l

Column groups

Row

gro

ups

Ditto for column shuffles

… and repeat …

Page 15: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

15

Choosing k and ll = 5

k = 5

Start with initial matrix

Choose better values for k and l

Final cross-association

Lower the encoding cost

Find good groups for fixed k and l

Page 16: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

16

Choosing k and ll = 5

k = 5

Split:1. Find the row group R with the maximum entropy per row

2. Choose the rows in R whose removal reduces the entropy per row in R

3. Send these rows to the new row group, and set k=k+1

Page 17: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

17

Choosing k and ll = 5

k = 5

Split:

Similar for column groups too.

Page 18: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

18

Algorithmsl = 5

k = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-association

Lower the encoding cost

Shuffles

Splits

Page 19: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

19

Experiments

l = 5 col groups

k = 5 row

groups

“Customer-Product” graph with Zipfian sizes, no noise

Page 20: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

20

Experiments

“Quasi block-diagonal” graph with Zipfian sizes, noise=10%

l = 8 col groups

k = 6 row

groups

Page 21: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

21

Experiments

“White Noise” graph: we find the existing spurious patterns

l = 3 col groups

k = 2 row

groups

Page 22: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

22

Experiments“CLASSIC”

• 3,893 documents

• 4,303 words

• 176,347 “dots”

Combination of 3 sources:

• MEDLINE (medical)

• CISI (info. retrieval)

• CRANFIELD (aerodynamics)

Doc

umen

ts

Words

Page 23: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

23

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

Doc

umen

ts

Words

Page 24: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

24

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

insipidus, alveolar, aortic, death, prognosis, intravenous

blood, disease, clinical, cell, tissue, patient

Page 25: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

25

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CISI(Information Retrieval)

providing, studying, records, development, students, rules

abstract, notation, works, construct, bibliographies

Page 26: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

26

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CRANFIELD (aerodynamics)

shape, nasa, leading, assumed, thin

CISI(Information Retrieval)

Page 27: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

27

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE(medical)

CISI(IR)

CRANFIELD (aerodynamics)

paint, examination, fall, raise, leave, based

Page 28: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

28

ExperimentsN

SF

Gra

nt P

ropo

sals

Words in abstract

“GRANTS”

• 13,297 documents

• 5,298 words

• 805,063 “dots”

Page 29: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

29

Experiments

“GRANTS” graph of documents & words: k=41, l=28

NS

F G

rant

Pro

posa

ls

Words in abstract

Page 30: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

30

Experiments

“GRANTS” graph of documents & words: k=41, l=28

The Cross-Associations refer to topics:

• Genetics

• Physics

• Mathematics

• …

Page 31: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

31

Experiments

“Who-trusts-whom” graph from epinions.com: k=18, l=16

Ep

inio

ns.

com

use

r

Epinions.com user

Page 32: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

32

Experiments

Number of “dots”

Tim

e (

secs

)

Splits

Shuffles

Linear on the number of “dots”: Scalable

Page 33: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

33

Conclusions

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large matrices

Page 34: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

34

Cross-Associations ≠ Co-clustering !Information-theoretic

co-clustering Cross-Associations

1. Lossy Compression.

2. Approximates the original matrix, while trying to minimize KL-divergence.

3. The number of row and column groups must be given by the user.

1. Lossless Compression.

2. Always provides complete information about the matrix, for any number of row and column groups.

3. Chosen automatically using the MDL principle.

Page 35: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

35

Experiments

Page 36: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

36

Experiments

“Clickstream” graph of users and websites: k=15, l=13

Use

rs

Webpages

Page 37: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

37

Fixed k and ll = 5

k = 5

Start with initial matrix

Choose better values for k and l

Final cross-associations

Lower the encoding cost

Find good groups for fixed k and l

swaps swaps

Page 38: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

38

Experimentsl = 5 col groups

k = 5 row

groups

“Caveman” graph with Zipfian cave sizes, no noise

Page 39: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

39

Aim

Given any binary matrix a “good” cross-association will have low cost

But how can we find such a cross-association?

l = 5 col groups

k = 5 row

groups

Page 40: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

40

Main Idea

sizei * H(pi) +Cost of describing cross-associations

Code Cost Description Cost

Σi Total Encoding Cost =

Good Compression

Better Clusteringimplies

Minimize the total cost

Page 41: 1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

41

Main Idea

How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good

clustering

Good Compression

Better Clusteringimplies