Julia Stoyanovich (University of Pennsylvania) joint work with Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv University) Making Interval-Based Clustering Rank-Aware Яндекс 23.08.2011
Jul 13, 2015
Julia Stoyanovich (University of Pennsylvania)
joint work with
Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv
University)
Making Interval-Based Clustering Rank-Aware
Яндекс 23.08.2011
Research Directions
• Representation of Large Complex Datasets
– Symmetric relationships [VLDB 2004]
– Faceted databases [VLDB 2005, Internet Archaeology 2007]
– Schema polynomials [EDBT 2008]
– Probabilistic databases [ICDE 2011]
– Scientific workflows with provenance [CIDR 2011, ICDT 2011]
• Information Discovery in Large Complex Datasets
– Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008]
– Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011]
– Rank-aware clustering [CIKM 2009, EDBT 2011]
– Exploring repositories of scientific workflows [WANDS 2010, AMW 2011]
– Exploring repositories of functional genomics experiments [submitted]
– Estimating susceptibility to genetic disorders [Bioinformatics 2007]
2 Яндекс 23.08.2011
Applications and Prototypes
• The Faceted Query Engine applied to archaeology
• Biological data management
– MutaGeneSys – estimating individual genetic disease susceptibility
– AnnotCompute – exploring repositories of microarray experiments
– SkylineSearch – semantic ranking and result visualization for PubMed
– myExperiment topics – exploring repositories of scientific workflows
• “Shopping and dating”
– Yahoo! Garçon – a collaborative tagging recommender system
– Yahoo! FindLove – rank-aware clustering for dating data
3 Яндекс 23.08.2011
Ranked Exploration of Structured Datasets
Dating service user Mike
• Find matches – age: [18,40]
– education: at least some college
– income: > $50,000 / year
• Rank by income from higher to lower
• Problems
– too many results
– results are homogeneous at top ranks,
due to correlations among attributes!
– correlations may be complex,
depend on the selection criteria and
on the ranking function
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
… 999 matches
PhD, 36 years old
makes $100K
… 9999 matches
BS, 27 years old
makes $80K
4 Яндекс 23.08.2011
-- edu > BS
-- income > $50K
Observe that
1. % of women with income > $50K increases with age
2. % women with post-graduate education increases until age 29, then plateaus
There is a clear positive correlation between
1. age and income, for all ages
2. education and income, at least until age 29
An Example from Yahoo! Personals
Correlations are local
5 Яндекс 23.08.2011
age: 26-37
edu: PhD
income: 100-130K age: 33-40
income: 125-150K
age: 18-25
edu: BS, MS
income: 50-75K
edu: MS
income: 50-75K
age: 26-30
income: 75-110K
Goal: Find Clusters that Correlate with Ranking
6 Яндекс 23.08.2011
Roadmap
• Introduction
➞Rank-aware clustering
– The formalism
– The BARAC algorithm
• Experimental evaluation
– Effectiveness
– Efficiency
• Conclusion
7 Яндекс 23.08.2011
What Is Subspace Clustering?
Parsons et al., SIGKDD Explorations 6(1), 2006
8 Яндекс 23.08.2011
Parsons et al., SIGKDD Explorations 6(1), 2006
9
Why Do We Need Subspace Clustering?
Яндекс 23.08.2011
How Do We Find Subspace Clusters?
• Finds clusters in multiple, possibly overlapping, subspaces
– Dimensionality reduction per cluster
– Lower-dimensional clusters are easier to identify and their descriptions are more palatable to the users
– Example: “age 20-25” and “edu = BS” and “income 25K-50K”
• Two main approaches
– Top-down: start with full dimensionality and refine
– Bottom-up: start with dense units in 1D,
combine to find higher-dimensional clusters
• Issues
– What is a cluster? – need a measure of quality
– How do we find clusters? – need a search strategy
10 Яндекс 23.08.2011
• User specifies a conjunction of filtering conditions, e.g.,
• User specifies a ranking function, e.g., linear combination
We do not restrict the set of ranking functions, but assume that ranking is derived from, or correlates with, attribute values
Given a query Q and a ranking function R, find rank-aware clusters
in subspaces of the dataset. Clusters are subspaces that:
• have sufficient rank-aware quality
• are tight
• are maximal
Problem Statement
Q : age 20,40 edu Bachelors
R :[income,],[age,]
11 Яндекс 23.08.2011
• BuildGrid
– split each dimension into intervals
– compute top-N for each interval
• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)
• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality
BARAC: Bottom-up Algorithm for Rank-Aware Clustering
12
ensures tightness
ensures maximality and rank-aware quality
Яндекс 23.08.2011
Avoiding Match Homogeneity at Top Ranks
age: 25-40
income: 75-150K
Cluster descriptions must accurately describe the top-N items
Tightness will give us this property
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
… 999 matches
PhD, 36 years old
makes $100K
… 9999 matches
BS, 27 years old
makes $80K
13 Яндекс 23.08.2011
age: 40
income:150K
Ranked Intervals and Interval Dominance
• Ranked intervals: description, contents (items), top-N
– I1: age [25,30], I2: edu = MBA
• Interval dominance is a rank-aware measure of locality, defined
– over 2 consecutive intervals on the same attribute
– for a ranking function R, integer N, and dominance threshold θdom (0.5, 1]
I1 + I2 : age [20,29]
R3 : rel serv (asc)
I1 <>10,0.5 I2
R1 : age (asc)
I2 <10,1 I1
R2 : 0.3inc + 0.7edu (desc)
I1 <10,0.8 I2
I1 : age [20,24] I2 : age [25,29]
top
-10
I1 dominates I2 if
14 Яндекс 23.08.2011
Property 1: Tightness
age: 30-39
edu: PhD
age: 35-39
edu: PhD
I1 : age [30,34] I2 : age [35,39] I1 + I2 : age [30,39]
15
if I1 dominates I2, then add I1 and I2 to the search space else add I1, I2, and I1+ I2 to the search space
Яндекс 23.08.2011
36 years old 38 years old
R :[income,]
Choose Best from Among Comparable
?
age: 33-40
income: 126-150K
age: 33-40
income: 70-100K
>
Rank-aware clustering quality will give us this property
age: 33-40
income: 125-150K
age: 26-30
income: 75-110K
≠ ?
16 Яндекс 23.08.2011
R :[income,]
Ranked Subspaces and Clusters
A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct
attributes, e.g., S: { age [25,30] , edu = MBA }
• interpreted as a conjunction of predicates over dataset D
• dimensionality = number of intervals
Goal: find subspaces that have sufficient rank-aware clustering quality
17
All rank-aware clustering quality measures
– compare the top-N list of a ranked subspace to the top-N lists of its constituent ranked intervals
– are defined for a ranking function R, an integer N, and a quality threshold θ Q (0.5, 1]
Яндекс 23.08.2011
Property 2: Rank-Aware Clustering Quality
age: 25-29
m1 99K
m3 90K
m7 75K
m9 65K
edu: BS
m1 99K
m2 95K
m3 90K
m4 85K
age: 25-29
edu: BS
m1 99K
m3 90K
age: 30-34
edu: BS
m2 95K
m4 85K
18
R : income
N 3
Q 2
3 age: 30-34
m6 125K
m8 110K
m10 100K
m2 95K
m4 85K
m5 85K
Яндекс 23.08.2011
Rank-Aware Clustering Quality Measures
• QtopN : subspace contains > θ Q items from the top-N of its intervals
– Considers top-N lists as sets
• QSCORE : subspace contains > θ Q high-scoring items from the top-N of its intervals
– Based on the sums of scores of top-N items
• QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items from the top-N of its intervals
– Based on NDCG, incorporates both scores and ranks
• Clustering quality measures must exhibit downward closure
– Quality of a subspace is no higher than the quality of its included subspaces
– Holds trivially for density-based measures, due to set properties
– Also holds for our measures, details omitted here
19 Яндекс 23.08.2011
Property 3: Maximality
Maximality will give us this property
comes for free with bottom-up subspace clustering
age: 25-40
edu: PhD
edu: PhD
income: 100-130K
age: 25-40
income: 100-130K
age: 25-40
edu: PhD
income: 100-130K
20
Avoid producing redundant clusters
Яндекс 23.08.2011
• BuildGrid
– split each dimension into intervals
– compute top-N for each interval
• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)
• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality
BARAC Recap
21
ensures tightness
ensures maximality and rank-aware quality
Яндекс 23.08.2011
Complexity of BARAC
• Polynomial in input size, exponential in the number of attributes
• Exponential dependency is unavoidable!
– Even counting distinct maximal frequent itemsets is #P-complete
• Example
– 1 item for each combination of attribute values
– each item has an arbitrary distinct score
– find rank-aware clusters with QtopN, N = 1
– there is 1 cluster per item, so an exponential number of clusters!
• But lower in practice
– correlations are local
– clustering quality requires 50% overlap at top-N
22 Яндекс 23.08.2011
Roadmap
• Introduction
• Rank-aware clustering
– The formalism
– The BARAC algorithm
➞Experimental evaluation
– Effectiveness
– Efficiency
• Conclusion
23 Яндекс 23.08.2011
Experimental Dataset: Yahoo! Personals
• Data and users
– 5 weeks, 454 users, 861 searches
– 19 filtering attributes, 17 clustering attributes, 6 ranking attributes
– Filtering on attributes, user-specified
– Filtering on geo location (only for effectiveness evaluation)
– QtopN clustering quality metric
• Ranking function: weighted sum
– sum of normalized per-attribute distances from best attribute value
from among matches
– attributes: age, height, body type, education, income, religious
services
– personalized by user: choice of attributes, sort order, normalization
24 Яндекс 23.08.2011
Evaluation of Effectiveness: User Study
list groups
top-100 top list top groups
BARAC BARAC list BARAC groups
presentation
co
nte
nt
25 Яндекс 23.08.2011
26 Яндекс 23.08.2011
27 Яндекс 23.08.2011
Effectiveness Metrics and Results
• Users may fave matches and / or groups
– When a group is faved, all matches in that group are faved
• A productive search has at least 1 faved match/group
treatment % prod.
searches
num. faves per
search
num. faves per prod.
search
top list 17 0.84 5.05
top group 14 0.87 7.33 / 1.17 groups
BARAC list 15 0.74 4.93
BARAC group 20 1.55 12.38 / 1.91 groups
28 Яндекс 23.08.2011
Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
• All reported results are over the complete set of female profiles
in Yahoo! Personals, without any location-based filtering!
29 Яндекс 23.08.2011
Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
30
0
1000
2000
3000
4000
5000
6000
7000
8000
0 100000 200000 300000 400000 500000
# items
ru
nti
me o
f B
uil
dG
rid
(m
s)
runtime of BuildGrid
Яндекс 23.08.2011
Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
31
runtime of Join
0
500
1000
1500
2000
2500
3000
3500
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# clustering dimensions
ru
nti
me
of
Jo
in (
ms)
Яндекс 23.08.2011
Performance of Join
0
100
200
300
400
500
600
0.5 0.6 0.7 0.8 0.9 1
quality threshold
ru
nti
me o
f Jo
in (
ms)
9D
8D
7D
6D
5D
4D
3D
* results for 100 Yahoo! Personals users on the full Y!P dataset.
32 Яндекс 23.08.2011
Performance of Join
0
100
200
300
400
500
600
700
800
900
1000
0.5 0.6 0.7 0.8 0.9 1
dominance threshold
ru
nti
me o
f Jo
in (
ms)
9D
8D
7D
6D
5D
4D
3D
* results for 100 Yahoo! Personals users on the full Y!P dataset.
33 Яндекс 23.08.2011
Roadmap
• Introduction
• Rank-aware clustering
– The formalism
– The BARAC algorithm
• Experimental evaluation
– Effectiveness
– Efficiency
➞Conclusion
34 Яндекс 23.08.2011
Rank-Aware Clustering: Recap
• Formalized rank-aware clustering, a novel
data exploration paradigm
• Developed a rank-aware measure of locality and a
family of rank-aware clustering quality measures
• Proposed BARAC: a bottom-up algorithm for rank-
aware clustering
• Presented an experimental evaluation on Yahoo!
Personals (also restaurants in Yahoo! Local)
• Effectiveness
• Efficiency
age: 33-40
inc: 126-150K
age: 26-30
inc: 75-110K
age: 18-25
edu: BS, MS
inc: 50-75K
0
1000
2000
3000
4000
5000
6000
7000
8000
0 100000 200000 300000 400000 500000
# items
ru
nti
me o
f B
uil
dG
rid
(m
s)
35 Яндекс 23.08.2011
Related Work
• Subspace clustering
– CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]
– Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002]
• Ranking of structured data
– Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al,
2003]
– Rank-aware attribute selection [Das et al, 2006]
• Integrating ranking with clustering
– Mixture model, mutual reinforcement between ranking and clustering, for
heterogeneous information networks, e.g., DBLP [Sun et al, 2009]
• Diversification
– Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru
et al, 2004], …
– Database queries [Chen and Li, 2007], [Vee et al, 2008]
– Recommendation [Boim et al, 2011], [Yu et al, 2009]
36 Яндекс 23.08.2011
0
2
4
6
8
10
12
0 20 40 60 80 100
rank
score
attribute-rank
geo-rank
37
Future Work: Choosing a Clustering Quality Measure
Яндекс 23.08.2011
Thank you!
Яндекс 23.08.2011
Take 1: Density-Based Clustering
age: 18-25 age: 26-30 age: 31-35 age: 36-40
income: 50-75K income: 101-125K Income: 126-150K income: 76-100K
min density = 2
39 Яндекс 23.08.2011
Take 1: Density-Based Clustering
age: 36-40 age: 31-35
income: 50-75K income: 76-100K
age: 18-30
income: 101-150K
age: 36-40
income: 101-150K age: 18-30
Income: 50-75K
min density = 2
40 Яндекс 23.08.2011
Take 2: A Lower Threshold?
age: 18-25 age: 26-30 age: 31-35 age: 36-40
income: 50-75K income: 101-125K income 126-150K income: 76-100K
min density = 1
41 Яндекс 23.08.2011
Take 2: A Lower Threshold?
age: 18-40
income: 50-150K
density > 0
age: 18-40; income: 50-150K
42 Яндекс 23.08.2011
Performance of BARAC
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
<30sec <20sec <15sec <10sec <5 sec <1 sec
BuildGrid
Join
Total
* results for 100 Yahoo! Personals users on the full Y!P dataset.
43 Яндекс 23.08.2011