[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

1. A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams Author: Y.S. Horawalavithana 10002103 Supervisor: Dr. D.N. Ranasinghe

2. Outline Motivation Research Problem Re-cap proposal defense! Design & Architecture Related Work Contribution Scoring Algorithm Query Personalization Events Novelty Relevancy + Freshness MAXDIVREL Diversity Dual-Indexing mechanism To Do List

3. Motivation The Big Filter

4. General Publish/Subscribe Model

5. Traditional Pub/Sub Matching

6. Drawbacks in Boolean Matching Traditional Publish/SubscribePublish Subscribe Notify Bob likes to update about smartphones. He prefers to get notify on products from Verizon & AT&T. But Ideally Bob prefers to get notify on products from Verizon only if there are not enough notifications from AT&T.

7. Drawbacks in Boolean Matching (Contd.) Subscriptions & matching publications are considered as equally important. Publications are delivered to Bob whenever there is a satisfied subscription. Bob may be either overloaded with publications or receive too few publications over time, Impossible to compare different matching publications with respect to Bobs subscriptions as ranking functions are not defined, and Partial matching between subscriptions and publications is not supported.

8. Top-k Publish/Subscribe Expressive stateful query processing systems to overcome the drawbacks identified in traditional pub/sub systems User defined parameter k restricts the delivered publications Pub/Sub Matching? Top-k pub/sub scoring or ranking Pub/Sub Indexing? Indexing to support personalized subscriptions Indexing to support continuous Top-k publications retrieval

10. Research Goal How to alleviate the Information Overload problem based on publish/subscribe communication paradigm which is augmented by different scoring mechanisms over continuous information-streams?

11. Research Problem 1. How to define an efficient scoring algorithm by integrating query independent & dependent score metrics taken into account? - Relevance, Freshness & Diversity 2. How to adapt existing indexing data structures used in state-of-the- art publish/subscribe systems under a) large subscription volume, b) high event rate(velocity) and, c) the variety of subscribable attributes, to support top-k matching queries?

12. Scope

14. Centralized Top-k Publish/Subscribe

15. Why not client centered Top-k matching with Traditional pub/sub layer on Top? In subscriber point of view, We support partial matching between subscriptions & publications Personalized subscriptions We address the overlapping interest of many subscribers Experiment with system resiliency: Retrieve Top-k results on domain knowledge We can have large volume of subscription space with variety of attributes through an efficient in-memory indexing mechanism In publisher point of view, Depend on the order of incoming matched publications

17. Expire Expire Publication Store Subscription Store Subscription Indexing Relevance Matching Publication Stream Matching Publication Store Publication (Relevance Score) Publication Indexing Top-k Continuous Diversity Personalized Subscription Personalized Subscription Personalized Subscription Dissimilarity Relevancy Event Delivery Top-k Notification Store Notification Notification Notification Sliding window

19. k-index(Whang2009) BE*Tree-index(Sadhogi2012) gridIndex(Pripuzi2012) opIndex(Zhang2014) MAXMIN Diversity/ Cover Tree(Drosou2014) Pref_pub/sub(Drosou2009) Top-k/w pub/sub (Pripuzi2012) Forward_Decay (Cormode2009) Binary_Decsions (Campailla2001) Publication_Aging (Shraer2013) Pref_pub/sub with diversity (Pitoura2009) DIsC_diversity (Drosou2012) Top-k representative Queries (Ranu2014)

21. Comparison: Subscription (Contd.) Typical Pub/Sub Just matching a publication whenever theres a satisfied subscription Top-k Pub/Sub A publication is scored against a satisfied subscription space Item = Smartphone Item = Smartphone Carrier = AT&T Carrier = AT&T Item = Smartphone Carrier = AT&T Item = Smartphone Item = Smartphone Carrier = AT&T Carrier = AT&T Item = Smartphone Carrier = AT&T

22. Comparison: Subscription Typical Pub/Sub All subscriptions are considered equally No personalized subscriptions Top-k Pub/Sub Subscribers can express some events are more important than others by ranking subscriptions can have a degree of user interest over subscription space limit redundancy by avoiding results with overlapping content AT&T Smartphone" include in Smartphone Make rare events visible

23. How to assign preference over subscription? Quantitative approach Assign interest to each subscription Qualitative approach Specify the interest between two subscriptions Item = Smartphone Item = Smartphone Carrier = AT&T Carrier = AT&T 0.7 0.5 0.9 Item = Smartphone Item = Smartphone Carrier = AT&T Carrier = AT&T > <

24. Personalized subscriptions Explicit Global Ordering Explicit Local Ordering Explicit Local + Implicit Global Ordering Subscription Preferences Attribute Preferences Attribute-Subscription Preferences Carrier = AT&T OS = Android 0.9 Carrier = Verizon OS = iOS 0.7 > Carrier = AT&T Carrier = Verizon > OS = iOS OS = Android < Carrier = AT&T (0.6) OS = Android (0.3) Carrier = Verizon (0.2) OS = iOS (0.5) Carrier = AT&T (0.3) OS = iOS (0.7) Brand = Apple (0.4)

25. We Propose: Relating Attributes a) Subscription covering b) Subscription Merging c) Relating Attributes attribute1 attribute2 attribute1 attribute2 attribute1 attribute2 S1 S2 S3 S1 S2 S3

26. Relating Attributes: Demonstration Let's assume that, Bob would like to get notify on products related with following personalized queries:

27. Relating Attributes: Demonstration Brand=HTC(0.3) Storage 32GB (0.6) 2 Carrier = Verizon (0.5) Storage 32GB (0.2) 2.5 Carrier = AT&T (0.4) Storage 16(0.7) 1.75 Brand = HTC (0.3) 1.3 2.3

28. 2 Carrier = Verizon Storage 32GB 2.5 Carrier = AT&T Storage 16 1.75 Brand = HTC 1.3 2.3

29. Relating Attributes: Demonstration A seller pushes a product

30. 2 Carrier = Verizon Storage 32GB 2.5 Carrier = AT&T Storage 16 1.75 Brand = HTC 1.3 2.3

31. Relevancy Score

32. Subscription Indexing Can have a performance bottleneck when, Matching between publication & user personalized subscription space. Extensively studied in pub/sub community Dont re-invent the wheel We extend an existing indexing mechanism to, Apply our personalized subscription model

33. Decision Making opIndex Dynamically adopt to the variety of attributes Two-space partitioning Attribute & operator Can support a wide range of operators Ex: Regular Expression Perform better when subscription space become larger index construction time, memory cost and, query processing time. k-Index, BE* Index Cant deal with the variety of attributes Three-space partitioning Subscription size, Attribute & Value Supports only a small set of operators Are outperformed by opIndex

35. Events Novelty Motivation: A popular news pub/sub system like Google news maintain publications within last 30 days, but most of the time produce top-k results within last day or two. Most important in Top-k computation, Demonstration using time policy to compute Top-k results

36. When to compute Top-k results? Our matching model deal with continuous data-stream Impossible to filter an unbounded stream We should have a time policy to compute Top-k results per subscription I. Continuous II. Periodic III. Sliding Windows

37. Sliding Window Top-k computation Compute top-k results based on publications within moving windows (time or events) e.g. w=2 P1 P2 P3 P4 P5 P6 P7 P8 P9 T 2T 3T 4T 5T P1 P2 P4

38. Remark: Sliding Window Adaptive than continuous & periodic when w = 1; act as continuous when w = T; act as periodic But here w is Flexible We can dynamically change w based on event arrival rate Can address streams other than Poisson distribution Without losing generality, our model based on sliding event windows But when event window becomes larger?

39. Freshness: Time Decaying Problem Older publications may prevent the newer publications to enter into top-k results Solution Lease or Expire using a time decay function We combine Freshness with relevancy score

40. Time Decaying Function We consider Forward decay to compute the publication age So we dont have to compute the decay score each window

42. Relevancy Decaying Function

44. Event Diversity In Top-k publish/subscribe, getting a diverse results within Top-k publications play a major role As an example, Bob would like to get notify about smart-phones from the carrier=AT&T and brand=HTC. Without the notion of diversity, delivered top-k publications may have much similarity between them. Even though, the received publications are personalized, Bob may recognize such a system as not effective.

45. Define Diversity: Taxonomy Result Diversification Dissimilarity Coverage Novelty Discrete or continuous domain

46. Dissimilarity Choosing to deliver items that are dissimilar to each other P-dispersion problem Selecting k items out of n, such that, the average pairwise distance between the selected items is maximized NP-Hard k-diversity problem Is based on p-dispersion problem Rely on heuristics to solve large instance of the problem

47. K-diversity problem Let P be the set of matching publications; |P| = n, and given a distance metric d to express the dissimilarity between publication points, finding the diverse set of P such that = arg max , ;

48. MAXDIVREL Diversity Address continuous k-diversity problem

49. Not to reinvent the wheel Most diversity definitions are aligned with, P-dispersion problem Here, we do consider to combine diversity & relevancy as, mono-objective formulation Not more based on p-dispersion

50. Beyond Diversity & Relevance We select a set of diverse set which, increase the "global" importance of a selected publication, and reduce the "global" importance of a non-selected publication. We define the problem in static version, MAXDIVREL k-diversity problem We define the problem in continuous version, MAXDIVREL continuous k-diversity problem

51. Demonstration: MAXDIVREL

52. Definition: MAXDIVREL (static version)

53. MAXDIVREL k-diversity problem Can map into Top-k representative query problem in graph databases which is NP-Hard Specialized version of set cover problem Can prove!

54. MAXDIVREL k-diversity Algorithm: Greedy

55. MAXDIVREL Continuous k-diversity problem Continuity Requirements Durability an item is selected as diversified in window may still have the chance to be in + 1 window if it's not expired & other valid items in + 1 window are failed to compete with it. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not-older than j.

56. Definition: MAXDIVREL (continuous version)

58. MAXDIVREL continuous k-diversity problem Apply MAXDIVREL k-diversity Greedy algorithm in each window Time complexity When re-calculating neighborhood We propose an incremental MAXDIVREL algorithm Calculate neighborhood at window + 1 using already calculated neighborhood at window Indexing publications at each window Combine with subscription indexing Dual-indexing mechanism!

60. To Do List: Implementation Indexing based on inverted-index Why inverted index? Centralized, will try Cloud Based Using message broker system E.g. RabbitMQ, ZeroMQ, ActiveMQ Why RabbitMQ?

61. To Do List: Evaluation Multiple Directions Zipf property Using synthetic & real data-set (e.g. zipf distribution tool, Ebay, AOL Query logs) Algorithm efficiency Experiment with, The volume of subscriptions The variety of publications The arrival rate of publications (e.g. dynamic sliding window model) Using POIKILO evaluation tool Dual-Indexing Performance & Scalability Experiment with, Index construction time at each window Memory cost Query processing time (e.g. Neighborhood calculation)

62. Thank You! Your review will be Golden! Welcome to read the design chapters!

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

Technology

traditional pubsub matching

partial matching

matching queries

ranking pubsub indexing

publications retrieval

freshness diversity

efficient scoring algorithm

traditional pubsub layer