+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009
+
Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich
Presented by: Ashish ChawlaCSE 6339 Spring 2009
+Overview Opportunity:
Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker.
Incorporate social behavior into processing search queries
Network Aware Search Results determined by opinion of network.
Existing top-k are too space intensive Dependence of scores on seeker’s network
Investigate clustering seekers based on behavior of networks
Del.icio.us datasets were used for experiments
2
+Introduction
What is Network Aware Search?
Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook
Users contribute content
annotate items (photos, videos, URLs, …) with tags form social networks
friends/family, interest-based need help discovering relevant content
What is Relevance of an item?
3
+What is Network-Aware Search?
4
+Claims
Define what is network-aware search.
Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy.
Refine score upper-bounds based on the user’s network and tagging behavior
5
+Data Model
Roger, i1, musicRoger, i3, musicRoger, i5, sports…Hugo, i1, musicHugo, i22, music…Minnie, i2, sports…Linda, i2, footballLinda, i28, news…
Tagged(user u,item i,tag t)
Taggers = u TaggedSeekers = u Link
Link(u1,v1): directed edgeNetwork (u) = { v | Link (u,v) }For seeker u1ε Seekers, Network(u1) = neighbors of u1
Link (user u, user v)
6
+What are Scores?
Query is a set of tags Q = {t1,t2,…,tn} example: fashion, www, sports, artificial
intelligence
For a seeker u, a tag t, and a item I (Score per tag)score(i,u,t) = f(|Network(u) {v, |Tagged(v,i,t)}|)
Overall Score of the queryscore(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u,
tn))
f and g are monotone, where f = COUNT, g = SUM
7
+Problem Statement
Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall
score
8
+Standard Top-k Processing
Q = {t1,t2,…,tn}
Inverted lists per tag, IL1, IL2, … ILn, sorted on scores
score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3))
Intuition high-scoring items are close to the top of most lists
Fagin-style processing: NRA (no random access)
access all lists sequentially in parallel
maintain a heap sorted on partial scores
stop when partial score of kth item > best case score of unseen/incomplete items
9
+
Item 25Item 25
0.60.6
Item 17Item 17
0.60.6
Item 83Item 83
0.90.9
item78item78
0.50.5
item38item38
0.60.6
item17item17
0.70.7
item83item83
0.40.4
item14item14
0.60.6
item61item61
0.30.3
item17item17
0.30.3
item5item5
0.60.6
item81item81
0.20.2
item21item21
0.20.2
item83item83
0.50.5
item65item65
0.10.1
item91item91
0.10.1
item21item21
0.30.3
item10item10
0.10.1
item44item44
0.10.1
Item 83Item 83
[0.9, 2.1][0.9, 2.1]
Item 17Item 17
[0.6, 2.1][0.6, 2.1]
item 25item 25
[0.6, 2.1][0.6, 2.1]
worst score
best-score
Min top-2 score : 0.6
Threshold (Max of unseen tuples): 2.1
Pruning Candidates:Min top-2 < best score of candidate
Stopping ConditionThreshold < min top-2 ?
List 1 List 2 List 3 Candidates
0.6+0.6+0.9=2.1
NRA10
+
item 25item 25
0.60.6
item 17item 17
0.60.6
item 83item 83
0.90.9
item 78item 78
0.50.5
item 38item 38
0.60.6
item 17item 17
0.70.7
item 83item 83
0.40.4
item 14item 14
0.60.6
item 61item 61
0.30.3
item 17item 17
0.30.3
item 5item 5
0.60.6
item 81item 81
0.20.2
item 21item 21
0.20.2
item 83item 83
0.50.5
item 65item 65
0.10.1
item 91item 91
0.10.1
item 21item 21
0.30.3
item 10item 10
0.10.1
item 44item 44
0.10.1
worst score
best-score
Min top-2 score : 0.9
Threshold (Max of unseen tuples): 1.8
Pruning Candidates:Min top-2 < best score of candidate
Stopping ConditionThreshold < min top-2 ?
item 17item 17
[1.3, [1.3, 1.81.8]]
item 83item 83
[0.9, [0.9, 2.02.0]]
item 25item 25
[0.6, [0.6, 1.91.9]]
item 38item 38
[0.6, [0.6, 1.81.8]]
item 78item 78
[0.5, [0.5, 1.81.8]]
List 1 List 2 List 3 Candidates
NRA11
+
item 25item 25
0.60.6
item 17item 17
0.60.6
item 83item 83
0.90.9
item 78item 78
0.50.5
item 38item 38
0.60.6
item 17item 17
0.70.7
item 83item 83
0.40.4
item 14item 14
0.60.6
item 61item 61
0.30.3
item 17item 17
0.30.3
item 5item 5
0.60.6
item 81item 81
0.20.2
item 21item 21
0.20.2
item 83item 83
0.50.5
item 65item 65
0.10.1
item 91item 91
0.10.1
item 21item 21
0.30.3
item 10item 10
0.10.1
item 44item 44
0.10.1
worst score
best-score
item 83item 83
[1.3, [1.3, 1.91.9]]
item 17item 17
[1.3, [1.3, 1.91.9]]
item 25item 25
[0.6, [0.6, 1.51.5]]
item 78item 78
[0.5, [0.5, 1.41.4]]
Min top-2 score : 1.3
Threshold (Max of unseen tuples): 1.3
Pruning Candidates:Min top-2 < best score of candidate
Stopping ConditionThreshold < min top-2 ?
no more new items can get into top-2
but, extra candidates left in queue
List 1 List 2 List 3 Candidates
NRA12
+
item 25item 25
0.60.6
item 17item 17
0.60.6
item 83item 83
0.90.9
item 78item 78
0.50.5
item 38item 38
0.60.6
item 17item 17
0.70.7
item 83item 83
0.40.4
item 14item 14
0.60.6
item 61item 61
0.30.3
item 17item 17
0.30.3
item 5item 5
0.60.6
item 81item 81
0.20.2
item 21item 21
0.20.2
item 83item 83
0.50.5
item 65item 65
0.10.1
item 91item 91
0.10.1
item 21item 21
0.30.3
item 10item 10
0.10.1
item 44item 44
0.10.1
worst score
best-score
Min top-2 score : 1.3
Threshold (Max of unseen tuples): 1.1
Pruning Candidates:Min top-2 < best score of candidate
Stopping ConditionThreshold < min top-2 ?
no more new items can get into top-2
but, extra candidates left in queue
item 17item 17
1.61.6
item 83item 83
[1.3, [1.3, 1.91.9]]
item 25item 25
[0.6, [0.6, 1.41.4]]
List 1 List 2 List 3 Candidates
NRA13
+
item 25item 25
0.60.6
item 17item 17
0.60.6
item 83item 83
0.90.9
item 78item 78
0.50.5
item 38item 38
0.60.6
item 17item 17
0.70.7
item 83item 83
0.40.4
item 14item 14
0.60.6
item 61item 61
0.30.3
item 17item 17
0.30.3
item 5item 5
0.60.6
item 81item 81
0.20.2
item 21item 21
0.20.2
item 83item 83
0.50.5
item 65item 65
0.10.1
item 91item 91
0.10.1
item 21item 21
0.30.3
item 10item 10
0.10.1
item 44item 44
0.10.1
Min top-2 score : 1.6
Threshold (Max of unseen tuples): 0.8
Pruning Candidates:Min top-2 < best score of candidate
item 83item 83
1.81.8
item 17item 17
1.61.6
List 1 List 2 List 3 Candidates
NRA14
+
NRA performs only sorted accesses (SA) (No Random Access)
Random access (RA) lookup actual (final) score of an item often very useful
Problems with NRA high bookkeeping overhead for “high” values of k, gain in even access
cost not significant
NRA15
+TA
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1lis
ts s
ort
ed b
y
score
List 1 List 2 List 3
(a1 , a2 , a3)
16
+TA
item 25
0.6
item 17
0.6
item 83
0.9
item 78
0.5
item 38
0.6
item 17
0.7
item 83
0.4
item 14
0.6
item 61
0.3
item 17
0.3
item 5
0.6
item 81
0.2
item 21
0.2
item 83
0.5
item 65
0.1
item 91
0.1
item 21
0.3
item 10
0.1
item 44
0.1
lists
sort
ed b
y
score
item 83
1.8
item 17
1.6
item 25
0.6
Candidatesmin top-2 score: 1.6maximum score for unseen items: 2.1
TA Algorithm: round 1
List 1 List 2 List 3
read one item from every list
Random access
Random access
17
+Computing Exact Scores: Naïve
item score
i7
i1i2i3i4i5i6
i816
736562403918
16
seeker Jane
i7
i5i9i2i6i5i8
i3
seeker Ann
10
533630151410
5
scoreitem
tag = photositem score
i7
i1i8i4i2i3i6
i915
302927252320
13
seeker Jane
i4
i5i2i8i7i1i6
i3
seeker Ann
60
998078757263
50
scoreitem
tag = music
Typical: Maintain single inverted list per (seeker, tag), items ordered by score
+ can use standard top-k algorithms-- high space overhead
18
+Computing Score Upper-Bounds Space saving strategy.
Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag.
Here every item is stored at most once.
Q now: what score to store with each entry? We store the maximum score that an item can have across
all possible seekers.
This is Global Upper-Bound strategy
Limitation: Time to dynamically computing exact scores at query time.
19
+Score Upper-Bounds
Global Upper-Bound (GUB): 1 list per tag
tag = music
item taggers upper-bound
i6
i1i2i3i5i4i9
i7i8
Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Miguel, …Kath, …
18
736562534036
1616
all seekers
+ low space overhead-- item upper-bounds, and list order(!) may differ from EXACT for most users-- time to dynamically computing exact scores at query time.
How do we do top-k processing with score upper-bounds?
Q: what score to store with each entry?
We store the maximum score that an item can have across all possible seekers.
20
+Top-k with Score Upper-Bounds
gNRA - “generalized no random access” access all lists sequentially in parallel maintain a heap with partial exact scores stop when partial exact score of kth item > highest
possible score from unseen/incomplete items (computed using current list upper-bounds)
21
+gNRA – NRA Generalization
22
+gTA – TA Generalization
23
+Performance of Global Upper Bound (GUB) and Exact
Space overhead total # number of entries in all inverted lists
Query processing time # of cursor moves
24
GUB Exactspace (IL entries)
74K 63M
time 479-18K 13 - 189 space baseline
time baseline
+Clustering and Query-Processing
We want to reduce the distance between score upper-bound and the exact score.
Greater the distance, more processing may be required
Core idea Cluster users into groups and compute upper-
bound for the group.
Intuition group users whose behavior is similar
25
+Clustering Seekers Cluster the seekers based on similarity in their
scores (because score of an item depends on the network).
Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster).
Query processing for Q = t1.. tn and seeker u, we First find the cluster C(u) And then perform aggregation over the
collection
Global Upper-Bound (GUB) is where all seekers fall into the same cluster.
26
+Clustering Seekers
assign each seeker to a cluster
compute an inverted list per cluster ub(i,t,C) = maxuC|Network(u) {v|Tagged(v,i,tj)}|
+ tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound
-- space overhead still high (trade-off)
27
item taggersupper-bound
chanel
pumagucciadidasdieselversace nike
prada
Miguel,…Kath, …Sam, …Miguel, …Peter, …Jane, …Mary, …Chris, …
18
736562534036
16
Global Upper-Bound
item taggersupper-bound
gucciversacechanelpradapuma
Bob,…Peter, …Mary, …Chris, …Alice, …
6540181610
Example of Clusters
C1: seekers Bob & Alice
item taggersupper-bound
pumaadidasdieselnike
Miguel,…Sam, …Miguel, …Jane, …
73625336
gucci Kath, … 5
C2: seekers Sam & Miguel
+How do we cluster seekers?
Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard.
Proofs by reduction from independent task scheduling problem and minimum sum of squares problem
Authors present some heuristics Use some form of Normalized Discounted Cumulative Gain
(NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword.
The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order
28
+NDCG - Example
i docID Log I –base2
Ranking
Rank/log i – base 2
Ideal ranking
Ideal ranking/log i –base 2
1 D1 0.00 3N/A 3N/A2 D2 1.00 2 2.00 3.00 3.003 D3 1.58 3 1.89 2.00 1.264 D4 2.00 0 0.00 2.00 1.005 D5 2.32 1 0.43 1.00 0.436 D6 2.58 2 0.77 0.00 0.00Cumulative Gain (CG) 9.49 5.10 5.69Distributive CG 8.10Ideal DCG 8.69Normalized CG (nDCG) 8.10/8.69 = 0.93
29
+Clustering Taggers
For each tag t we partition the taggers into separate clusters.
We form inverted list and an item i in the list for cluster C gets the score as maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}|
How to cluster taggers? Graph with nodes as taggers and an edge exists between nodes v1
and v2 iff:
Items(v1,t) ∩ Items(v2,t) ≥ threshold
30
+Clustering Seekers Metrics
Space Global Upper Bound has the lowest overhead. ASC and NCT achieve an order of magnitude improvement
in space overhead over Exact.
Time Both gNRA and gTA outperform Global Upper-bound. ASC outperforms NCT on both sequential and total accesses
in all cases for gTA and in all cases except one for gNRA. Inverted lists are shorter Score upper-bound order similar to exact score order for
many users
Average % improvement over Global Upper-Bound Normalized Cut: 38-72% Ratio Association 67-87%
31
+Clustering Seekers
Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users
32
+Clustering Taggers
Space Overhead is significantly lower than that of Exact and of
Cluster-Seekers
Time Best Case: all taggers relevant to a seeker will reside in a
single cluster Worst Case: All taggers will reside in separate clusters.
Idea: cluster taggers based on overlap in tagging assign each tagger to a cluster compute cluster upper-bounds: ub(i,t,C) = maxuSeekers, vC|Network(u) {v |Tagged(v,i,tj)}|
33
+Clustering Taggers
34
+Conclusion and Next Steps
Cluster-Taggers worked best for seekers whose network fell into at most 3 * #tags clusters For others, query execution time degraded due to the
number of inverted lists that had to be processed
For these seekers Cluster-Taggers outperformed Cluster-Seekers in all cases Cluster-Taggers outperforms Global Upper-Bound by 94-
97%, in all cases.
Extended traditional top-k algorithms
Achieved a balance between time and space consumption.
35