Julia Stoyanovich "Making interval-based clustering rank-aware"

Julia Stoyanovich (University of Pennsylvania)

joint work with

Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv

University)

Making Interval-Based Clustering Rank-Aware

Яндекс 23.08.2011

Research Directions

• Representation of Large Complex Datasets

– Symmetric relationships [VLDB 2004]

– Faceted databases [VLDB 2005, Internet Archaeology 2007]

– Schema polynomials [EDBT 2008]

– Probabilistic databases [ICDE 2011]

– Scientific workflows with provenance [CIDR 2011, ICDT 2011]

• Information Discovery in Large Complex Datasets

– Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008]

– Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011]

– Rank-aware clustering [CIKM 2009, EDBT 2011]

– Exploring repositories of scientific workflows [WANDS 2010, AMW 2011]

– Exploring repositories of functional genomics experiments [submitted]

– Estimating susceptibility to genetic disorders [Bioinformatics 2007]

2 Яндекс 23.08.2011

Applications and Prototypes

• The Faceted Query Engine applied to archaeology

• Biological data management

– MutaGeneSys – estimating individual genetic disease susceptibility

– AnnotCompute – exploring repositories of microarray experiments

– SkylineSearch – semantic ranking and result visualization for PubMed

– myExperiment topics – exploring repositories of scientific workflows

• “Shopping and dating”

– Yahoo! Garçon – a collaborative tagging recommender system

– Yahoo! FindLove – rank-aware clustering for dating data

3 Яндекс 23.08.2011

Ranked Exploration of Structured Datasets

Dating service user Mike

• Find matches – age: [18,40]

– education: at least some college

– income: > $50,000 / year

• Rank by income from higher to lower

• Problems

– too many results

– results are homogeneous at top ranks,

due to correlations among attributes!

– correlations may be complex,

depend on the selection criteria and

on the ranking function

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

… 999 matches

PhD, 36 years old

makes $100K

… 9999 matches

BS, 27 years old

makes $80K

4 Яндекс 23.08.2011

-- edu > BS

-- income > $50K

Observe that

1. % of women with income > $50K increases with age

2. % women with post-graduate education increases until age 29, then plateaus

There is a clear positive correlation between

1. age and income, for all ages

2. education and income, at least until age 29

An Example from Yahoo! Personals

Correlations are local

5 Яндекс 23.08.2011

age: 26-37

edu: PhD

income: 100-130K age: 33-40

income: 125-150K

age: 18-25

edu: BS, MS

income: 50-75K

edu: MS

income: 50-75K

age: 26-30

income: 75-110K

Goal: Find Clusters that Correlate with Ranking

6 Яндекс 23.08.2011

Roadmap

• Introduction

➞Rank-aware clustering

– The formalism

– The BARAC algorithm

• Experimental evaluation

– Effectiveness

– Efficiency

• Conclusion

7 Яндекс 23.08.2011

What Is Subspace Clustering?

Parsons et al., SIGKDD Explorations 6(1), 2006

8 Яндекс 23.08.2011

Parsons et al., SIGKDD Explorations 6(1), 2006

9

Why Do We Need Subspace Clustering?

Яндекс 23.08.2011

How Do We Find Subspace Clusters?

• Finds clusters in multiple, possibly overlapping, subspaces

– Dimensionality reduction per cluster

– Lower-dimensional clusters are easier to identify and their descriptions are more palatable to the users

– Example: “age 20-25” and “edu = BS” and “income 25K-50K”

• Two main approaches

– Top-down: start with full dimensionality and refine

– Bottom-up: start with dense units in 1D,

combine to find higher-dimensional clusters

• Issues

– What is a cluster? – need a measure of quality

– How do we find clusters? – need a search strategy

10 Яндекс 23.08.2011

• User specifies a conjunction of filtering conditions, e.g.,

• User specifies a ranking function, e.g., linear combination

We do not restrict the set of ranking functions, but assume that ranking is derived from, or correlates with, attribute values

Given a query Q and a ranking function R, find rank-aware clusters

in subspaces of the dataset. Clusters are subspaces that:

• have sufficient rank-aware quality

• are tight

• are maximal

Problem Statement

Q : age 20,40 edu Bachelors

R :[income,],[age,]

11 Яндекс 23.08.2011

• BuildGrid

– split each dimension into intervals

– compute top-N for each interval

• Merge

– merge neighboring intervals using rank-aware locality (interval dominance)

• Join

– build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality

BARAC: Bottom-up Algorithm for Rank-Aware Clustering

12

ensures tightness

ensures maximality and rank-aware quality

Яндекс 23.08.2011

Avoiding Match Homogeneity at Top Ranks

age: 25-40

income: 75-150K

Cluster descriptions must accurately describe the top-N items

Tightness will give us this property

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

MBA, 40 years old

makes $150K

… 999 matches

PhD, 36 years old

makes $100K

… 9999 matches

BS, 27 years old

makes $80K

13 Яндекс 23.08.2011

age: 40

income:150K

Ranked Intervals and Interval Dominance

• Ranked intervals: description, contents (items), top-N

– I1: age [25,30], I2: edu = MBA

• Interval dominance is a rank-aware measure of locality, defined

– over 2 consecutive intervals on the same attribute

– for a ranking function R, integer N, and dominance threshold θdom (0.5, 1]

I1 + I2 : age [20,29]

R3 : rel serv (asc)

I1 <>10,0.5 I2

R1 : age (asc)

I2 <10,1 I1

R2 : 0.3inc + 0.7edu (desc)

I1 <10,0.8 I2

I1 : age [20,24] I2 : age [25,29]

top

-10

I1 dominates I2 if

14 Яндекс 23.08.2011

Property 1: Tightness

age: 30-39

edu: PhD

age: 35-39

edu: PhD

I1 : age [30,34] I2 : age [35,39] I1 + I2 : age [30,39]

15

if I1 dominates I2, then add I1 and I2 to the search space else add I1, I2, and I1+ I2 to the search space

Яндекс 23.08.2011

36 years old 38 years old

R :[income,]

Choose Best from Among Comparable

?

age: 33-40

income: 126-150K

age: 33-40

income: 70-100K

>

Rank-aware clustering quality will give us this property

age: 33-40

income: 125-150K

age: 26-30

income: 75-110K

≠ ?

16 Яндекс 23.08.2011

R :[income,]

Ranked Subspaces and Clusters

A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct

attributes, e.g., S: { age [25,30] , edu = MBA }

• interpreted as a conjunction of predicates over dataset D

• dimensionality = number of intervals

Goal: find subspaces that have sufficient rank-aware clustering quality

17

All rank-aware clustering quality measures

– compare the top-N list of a ranked subspace to the top-N lists of its constituent ranked intervals

– are defined for a ranking function R, an integer N, and a quality threshold θ Q (0.5, 1]

Яндекс 23.08.2011

Property 2: Rank-Aware Clustering Quality

age: 25-29

m1 99K

m3 90K

m7 75K

m9 65K

edu: BS

m1 99K

m2 95K

m3 90K

m4 85K

age: 25-29

edu: BS

m1 99K

m3 90K

age: 30-34

edu: BS

m2 95K

m4 85K

18

R : income

N 3

Q 2

3 age: 30-34

m6 125K

m8 110K

m10 100K

m2 95K

m4 85K

m5 85K

Яндекс 23.08.2011

Rank-Aware Clustering Quality Measures

• QtopN : subspace contains > θ Q items from the top-N of its intervals

– Considers top-N lists as sets

• QSCORE : subspace contains > θ Q high-scoring items from the top-N of its intervals

– Based on the sums of scores of top-N items

• QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items from the top-N of its intervals

– Based on NDCG, incorporates both scores and ranks

• Clustering quality measures must exhibit downward closure

– Quality of a subspace is no higher than the quality of its included subspaces

– Holds trivially for density-based measures, due to set properties

– Also holds for our measures, details omitted here

19 Яндекс 23.08.2011

Property 3: Maximality

Maximality will give us this property

comes for free with bottom-up subspace clustering

age: 25-40

edu: PhD

edu: PhD

income: 100-130K

age: 25-40

income: 100-130K

age: 25-40

edu: PhD

income: 100-130K

20

Avoid producing redundant clusters

Яндекс 23.08.2011

• BuildGrid

– split each dimension into intervals

– compute top-N for each interval

• Merge

– merge neighboring intervals using rank-aware locality (interval dominance)

• Join

– build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality

BARAC Recap

21

ensures tightness

ensures maximality and rank-aware quality

Яндекс 23.08.2011

Complexity of BARAC

• Polynomial in input size, exponential in the number of attributes

• Exponential dependency is unavoidable!

– Even counting distinct maximal frequent itemsets is #P-complete

• Example

– 1 item for each combination of attribute values

– each item has an arbitrary distinct score

– find rank-aware clusters with QtopN, N = 1

– there is 1 cluster per item, so an exponential number of clusters!

• But lower in practice

– correlations are local

– clustering quality requires 50% overlap at top-N

22 Яндекс 23.08.2011

Roadmap

• Introduction

• Rank-aware clustering

– The formalism


➞Experimental evaluation

– Effectiveness

– Efficiency

• Conclusion

23 Яндекс 23.08.2011

Experimental Dataset: Yahoo! Personals

• Data and users

– 5 weeks, 454 users, 861 searches

– 19 filtering attributes, 17 clustering attributes, 6 ranking attributes

– Filtering on attributes, user-specified

– Filtering on geo location (only for effectiveness evaluation)

– QtopN clustering quality metric

• Ranking function: weighted sum

– sum of normalized per-attribute distances from best attribute value

from among matches

– attributes: age, height, body type, education, income, religious

services

– personalized by user: choice of attributes, sort order, normalization

24 Яндекс 23.08.2011

Evaluation of Effectiveness: User Study

list groups

top-100 top list top groups

BARAC BARAC list BARAC groups

presentation

co

nte

nt

25 Яндекс 23.08.2011

26 Яндекс 23.08.2011

27 Яндекс 23.08.2011

Effectiveness Metrics and Results

• Users may fave matches and / or groups

– When a group is faved, all matches in that group are faved

• A productive search has at least 1 faved match/group

treatment % prod.

searches

num. faves per

search

num. faves per prod.

search

top list 17 0.84 5.05

top group 14 0.87 7.33 / 1.17 groups

BARAC list 15 0.74 4.93

BARAC group 20 1.55 12.38 / 1.91 groups

28 Яндекс 23.08.2011

Evaluation of Efficiency

• Summary of results: BARAC is scalable

– runtimes of BuildGrid and Join dominate performance

– runtime of Merge is negligible

• All reported results are over the complete set of female profiles

in Yahoo! Personals, without any location-based filtering!

29 Яндекс 23.08.2011





30

0

1000

2000

3000

4000

5000

6000

7000

8000

0 100000 200000 300000 400000 500000

# items

ru

nti

me o

f B

uil

dG

rid

(m

s)

runtime of BuildGrid

Яндекс 23.08.2011





31

runtime of Join

0

500

1000

1500

2000

2500

3000

3500

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

# clustering dimensions

ru

nti

me

of

Jo

in (

ms)

Яндекс 23.08.2011

Performance of Join

0

100

200

300

400

500

600

0.5 0.6 0.7 0.8 0.9 1

quality threshold

ru

nti

me o

f Jo

in (

ms)

9D

8D

7D

6D

5D

4D

3D

* results for 100 Yahoo! Personals users on the full Y!P dataset.

32 Яндекс 23.08.2011

Performance of Join

0

100

200

300

400

500

600

700

800

900

1000

0.5 0.6 0.7 0.8 0.9 1

dominance threshold

ru

nti

me o

f Jo

in (

ms)

9D

8D

7D

6D

5D

4D

3D


33 Яндекс 23.08.2011

Roadmap

• Introduction

• Rank-aware clustering

– The formalism


• Experimental evaluation

– Effectiveness

– Efficiency

➞Conclusion

34 Яндекс 23.08.2011

Rank-Aware Clustering: Recap

• Formalized rank-aware clustering, a novel

data exploration paradigm

• Developed a rank-aware measure of locality and a

family of rank-aware clustering quality measures

• Proposed BARAC: a bottom-up algorithm for rank-

aware clustering

• Presented an experimental evaluation on Yahoo!

Personals (also restaurants in Yahoo! Local)

• Effectiveness

• Efficiency

age: 33-40

inc: 126-150K

age: 26-30

inc: 75-110K

age: 18-25

edu: BS, MS

inc: 50-75K

0

1000

2000

3000

4000

5000

6000

7000

8000

0 100000 200000 300000 400000 500000

# items

ru

nti

me o

f B

uil

dG

rid

(m

s)

35 Яндекс 23.08.2011

Related Work

• Subspace clustering

– CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]

– Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002]

• Ranking of structured data

– Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al,

2003]

– Rank-aware attribute selection [Das et al, 2006]

• Integrating ranking with clustering

– Mixture model, mutual reinforcement between ranking and clustering, for

heterogeneous information networks, e.g., DBLP [Sun et al, 2009]

• Diversification

– Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru

et al, 2004], …

– Database queries [Chen and Li, 2007], [Vee et al, 2008]

– Recommendation [Boim et al, 2011], [Yu et al, 2009]

36 Яндекс 23.08.2011

0

2

4

6

8

10

12

0 20 40 60 80 100

rank

score

attribute-rank

geo-rank

37

Future Work: Choosing a Clustering Quality Measure

Яндекс 23.08.2011

Thank you!

Яндекс 23.08.2011

Take 1: Density-Based Clustering

age: 18-25 age: 26-30 age: 31-35 age: 36-40

income: 50-75K income: 101-125K Income: 126-150K income: 76-100K

min density = 2

39 Яндекс 23.08.2011

Take 1: Density-Based Clustering

age: 36-40 age: 31-35

income: 50-75K income: 76-100K

age: 18-30

income: 101-150K

age: 36-40

income: 101-150K age: 18-30

Income: 50-75K

min density = 2

40 Яндекс 23.08.2011

Take 2: A Lower Threshold?

age: 18-25 age: 26-30 age: 31-35 age: 36-40

income: 50-75K income: 101-125K income 126-150K income: 76-100K

min density = 1

41 Яндекс 23.08.2011

Take 2: A Lower Threshold?

age: 18-40

income: 50-150K

density > 0

age: 18-40; income: 50-150K

42 Яндекс 23.08.2011

Performance of BARAC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

<30sec <20sec <15sec <10sec <5 sec <1 sec

BuildGrid

Join

Total


43 Яндекс 23.08.2011

Julia Stoyanovich "Making interval-based clustering rank-aware"

Technology

phd income

ms income

matches age

college income

years old

subspace clustering

edu bs

rankaware clustering