This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. A Publish/Subscribe Model for Top-k Matching Over Continuous
Data-streams Author: Y.S. Horawalavithana 10002103 Supervisor: Dr.
D.N. Ranasinghe
2. Outline Motivation Research Problem Re-cap proposal defense!
Design & Architecture Related Work Contribution Scoring
Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
3. Motivation The Big Filter
4. General Publish/Subscribe Model
5. Traditional Pub/Sub Matching
6. Drawbacks in Boolean Matching Traditional
Publish/SubscribePublish Subscribe Notify Bob likes to update about
smartphones. He prefers to get notify on products from Verizon
& AT&T. But Ideally Bob prefers to get notify on products
from Verizon only if there are not enough notifications from
AT&T.
7. Drawbacks in Boolean Matching (Contd.) Subscriptions &
matching publications are considered as equally important.
Publications are delivered to Bob whenever there is a satisfied
subscription. Bob may be either overloaded with publications or
receive too few publications over time, Impossible to compare
different matching publications with respect to Bobs subscriptions
as ranking functions are not defined, and Partial matching between
subscriptions and publications is not supported.
8. Top-k Publish/Subscribe Expressive stateful query processing
systems to overcome the drawbacks identified in traditional pub/sub
systems User defined parameter k restricts the delivered
publications Pub/Sub Matching? Top-k pub/sub scoring or ranking
Pub/Sub Indexing? Indexing to support personalized subscriptions
Indexing to support continuous Top-k publications retrieval
9. Outline Motivation Research Problem Re-cap proposal defense!
Design & Architecture Related Work Contribution Scoring
Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
10. Research Goal How to alleviate the Information Overload
problem based on publish/subscribe communication paradigm which is
augmented by different scoring mechanisms over continuous
information-streams?
11. Research Problem 1. How to define an efficient scoring
algorithm by integrating query independent & dependent score
metrics taken into account? - Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in
state-of-the- art publish/subscribe systems under a) large
subscription volume, b) high event rate(velocity) and, c) the
variety of subscribable attributes, to support top-k matching
queries?
12. Scope
13. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
14. Centralized Top-k Publish/Subscribe
15. Why not client centered Top-k matching with Traditional
pub/sub layer on Top? In subscriber point of view, We support
partial matching between subscriptions & publications
Personalized subscriptions We address the overlapping interest of
many subscribers Experiment with system resiliency: Retrieve Top-k
results on domain knowledge We can have large volume of
subscription space with variety of attributes through an efficient
in-memory indexing mechanism In publisher point of view, Depend on
the order of incoming matched publications
16. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
18. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
20. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
21. Comparison: Subscription (Contd.) Typical Pub/Sub Just
matching a publication whenever theres a satisfied subscription
Top-k Pub/Sub A publication is scored against a satisfied
subscription space Item = Smartphone Item = Smartphone Carrier =
AT&T Carrier = AT&T Item = Smartphone Carrier = AT&T
Item = Smartphone Item = Smartphone Carrier = AT&T Carrier =
AT&T Item = Smartphone Carrier = AT&T
22. Comparison: Subscription Typical Pub/Sub All subscriptions
are considered equally No personalized subscriptions Top-k Pub/Sub
Subscribers can express some events are more important than others
by ranking subscriptions can have a degree of user interest over
subscription space limit redundancy by avoiding results with
overlapping content AT&T Smartphone" include in Smartphone Make
rare events visible
23. How to assign preference over subscription? Quantitative
approach Assign interest to each subscription Qualitative approach
Specify the interest between two subscriptions Item = Smartphone
Item = Smartphone Carrier = AT&T Carrier = AT&T 0.7 0.5 0.9
Item = Smartphone Item = Smartphone Carrier = AT&T Carrier =
AT&T > <
24. Personalized subscriptions Explicit Global Ordering
Explicit Local Ordering Explicit Local + Implicit Global Ordering
Subscription Preferences Attribute Preferences
Attribute-Subscription Preferences Carrier = AT&T OS = Android
0.9 Carrier = Verizon OS = iOS 0.7 > Carrier = AT&T Carrier
= Verizon > OS = iOS OS = Android < Carrier = AT&T (0.6)
OS = Android (0.3) Carrier = Verizon (0.2) OS = iOS (0.5) Carrier =
AT&T (0.3) OS = iOS (0.7) Brand = Apple (0.4)
25. We Propose: Relating Attributes a) Subscription covering b)
Subscription Merging c) Relating Attributes attribute1 attribute2
attribute1 attribute2 attribute1 attribute2 S1 S2 S3 S1 S2 S3
26. Relating Attributes: Demonstration Let's assume that, Bob
would like to get notify on products related with following
personalized queries:
32. Subscription Indexing Can have a performance bottleneck
when, Matching between publication & user personalized
subscription space. Extensively studied in pub/sub community Dont
re-invent the wheel We extend an existing indexing mechanism to,
Apply our personalized subscription model
33. Decision Making opIndex Dynamically adopt to the variety of
attributes Two-space partitioning Attribute & operator Can
support a wide range of operators Ex: Regular Expression Perform
better when subscription space become larger index construction
time, memory cost and, query processing time. k-Index, BE* Index
Cant deal with the variety of attributes Three-space partitioning
Subscription size, Attribute & Value Supports only a small set
of operators Are outperformed by opIndex
34. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
35. Events Novelty Motivation: A popular news pub/sub system
like Google news maintain publications within last 30 days, but
most of the time produce top-k results within last day or two. Most
important in Top-k computation, Demonstration using time policy to
compute Top-k results
36. When to compute Top-k results? Our matching model deal with
continuous data-stream Impossible to filter an unbounded stream We
should have a time policy to compute Top-k results per subscription
I. Continuous II. Periodic III. Sliding Windows
37. Sliding Window Top-k computation Compute top-k results
based on publications within moving windows (time or events) e.g.
w=2 P1 P2 P3 P4 P5 P6 P7 P8 P9 T 2T 3T 4T 5T P1 P2 P4
38. Remark: Sliding Window Adaptive than continuous &
periodic when w = 1; act as continuous when w = T; act as periodic
But here w is Flexible We can dynamically change w based on event
arrival rate Can address streams other than Poisson distribution
Without losing generality, our model based on sliding event windows
But when event window becomes larger?
39. Freshness: Time Decaying Problem Older publications may
prevent the newer publications to enter into top-k results Solution
Lease or Expire using a time decay function We combine Freshness
with relevancy score
40. Time Decaying Function We consider Forward decay to compute
the publication age So we dont have to compute the decay score each
window
41. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
42. Relevancy Decaying Function
43. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
44. Event Diversity In Top-k publish/subscribe, getting a
diverse results within Top-k publications play a major role As an
example, Bob would like to get notify about smart-phones from the
carrier=AT&T and brand=HTC. Without the notion of diversity,
delivered top-k publications may have much similarity between them.
Even though, the received publications are personalized, Bob may
recognize such a system as not effective.
45. Define Diversity: Taxonomy Result Diversification
Dissimilarity Coverage Novelty Discrete or continuous domain
46. Dissimilarity Choosing to deliver items that are dissimilar
to each other P-dispersion problem Selecting k items out of n, such
that, the average pairwise distance between the selected items is
maximized NP-Hard k-diversity problem Is based on p-dispersion
problem Rely on heuristics to solve large instance of the
problem
47. K-diversity problem Let P be the set of matching
publications; |P| = n, and given a distance metric d to express the
dissimilarity between publication points, finding the diverse set
of P such that = arg max , ;
48. MAXDIVREL Diversity Address continuous k-diversity
problem
49. Not to reinvent the wheel Most diversity definitions are
aligned with, P-dispersion problem Here, we do consider to combine
diversity & relevancy as, mono-objective formulation Not more
based on p-dispersion
50. Beyond Diversity & Relevance We select a set of diverse
set which, increase the "global" importance of a selected
publication, and reduce the "global" importance of a non-selected
publication. We define the problem in static version, MAXDIVREL
k-diversity problem We define the problem in continuous version,
MAXDIVREL continuous k-diversity problem
51. Demonstration: MAXDIVREL
52. Definition: MAXDIVREL (static version)
53. MAXDIVREL k-diversity problem Can map into Top-k
representative query problem in graph databases which is NP-Hard
Specialized version of set cover problem Can prove!
54. MAXDIVREL k-diversity Algorithm: Greedy
55. MAXDIVREL Continuous k-diversity problem Continuity
Requirements Durability an item is selected as diversified in
window may still have the chance to be in + 1 window if it's not
expired & other valid items in + 1 window are failed to compete
with it. Order Publication stream follow the chronological order We
avoid the selection of item j as diverse later, when we already
selected an item i which is not-older than j.
56. Definition: MAXDIVREL (continuous version)
57. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
58. MAXDIVREL continuous k-diversity problem Apply MAXDIVREL
k-diversity Greedy algorithm in each window Time complexity When
re-calculating neighborhood We propose an incremental MAXDIVREL
algorithm Calculate neighborhood at window + 1 using already
calculated neighborhood at window Indexing publications at each
window Combine with subscription indexing Dual-indexing
mechanism!
59. Outline Motivation Research Problem Re-cap proposal
defense! Design & Architecture Related Work Contribution
Scoring Algorithm Query Personalization Events Novelty Relevancy +
Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do
List
60. To Do List: Implementation Indexing based on inverted-index
Why inverted index? Centralized, will try Cloud Based Using message
broker system E.g. RabbitMQ, ZeroMQ, ActiveMQ Why RabbitMQ?
61. To Do List: Evaluation Multiple Directions Zipf property
Using synthetic & real data-set (e.g. zipf distribution tool,
Ebay, AOL Query logs) Algorithm efficiency Experiment with, The
volume of subscriptions The variety of publications The arrival
rate of publications (e.g. dynamic sliding window model) Using
POIKILO evaluation tool Dual-Indexing Performance & Scalability
Experiment with, Index construction time at each window Memory cost
Query processing time (e.g. Neighborhood calculation)
62. Thank You! Your review will be Golden! Welcome to read the
design chapters!