Top Banner
oheila Dehghanzadeh
35

Offsite presentation original

Feb 23, 2017

Download

Education

sally.de
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Offsite presentation original

Soheila Dehghanzadeh

Page 2: Offsite presentation original

Agenda•I

ntroduction to Trade-offs in Integration Systems

•Requirements and Research Questions

•Contributions

•Conclusions and Future Work

Page 3: Offsite presentation original

Introduction

•What is data integration?

• “Combining data from different distributed sources”1.

•Why is it important?

• Most queries requires integrating data from various sources.

•Why is it challenging?

• Sources are autonomous and distributed.• Distributing query among sources to provide the response has performance, scalability

and availability problems.• Caching solves above problems but leads to inconsistencies.• Maintaining cache increases latency.

31. https://en.wikipedia.org/wiki/Data_integration

Page 4: Offsite presentation original

The latency/consistency trade-off

4

High consistencyLow consistency

Low latency

High latency

Ideal caseData

warehouse

Mediator systems

Page 5: Offsite presentation original

Data integration

•Data integration approaches

• Data warehouse (DW)• Low latency• Low consistency

High consistencyLow consistency

Low latency

High latency

Ideal caseData

warehouse

Mediator systems

Page 6: Offsite presentation original

Data warehouse

Low latencyLow consistency

Page 7: Offsite presentation original

Data Market: Lowest latency with a consistency threshold

Minimize cost (financial and latency) as far as consistency is

above a threshold

Find me emails of

“The North Face”

customers.

My existing data can provide you a

response with 60% freshness.Ok

Here is the responseNo, I want

the fastest response with at

least 80% freshness

To provide 80%

freshness you need to wait 30 sec

and pay 60$

Page 8: Offsite presentation original

Research Question 1

How to optimally maintain data when consistency is restricted and latency is demanded to be

minimized?

8

Page 9: Offsite presentation original

Summary of contribution 1

•A method to estimate the response freshness using the existing data (JIST2014, ISWC2014).

• Extend summarization techniques to trace the freshness.• Indexing, histogram and Qtree • Use summary to estimate the response freshness.

•Evaluation

• We managed to estimate the freshness of a query with 6% error rate.

•Future work

• Use more advanced summarizations to lower the error rate.

9

Page 10: Offsite presentation original

Data integration

•Data integration approaches

• Data warehouse (DW)• Low latency• Low consistency

• Mediator systems (MS)• High latency• High consistency

High consistencyLow consistency

Low latency

High latency

Ideal caseData

warehouse

Mediator systems

Data warehouse

Page 11: Offsite presentation original

Mediator System

High latencyhigh consistency

Page 12: Offsite presentation original

Mediator system: Highest consistency with a latency threshold

Join

RDF Stream Generator

Background data(SPARQL endpoint)

12

Page 13: Offsite presentation original

Mediator system: Highest consistency with a latency threshold

Join

RDF Stream Generator

Background data(SPARQL endpoint)

Local View

13

Page 14: Offsite presentation original

Mediator system: Highest consistency with a latency threshold

Join

RDF Stream Generator

Background data(SPARQL endpoint)

Local View

Maintenance Process

Freshness decreases

Refresh Cost/Quality trade-

off

14

Page 15: Offsite presentation original

Research Question 2

How to optimally maintain data when the latency is restricted and consistency is demanded to be

maximized?

15

Page 16: Offsite presentation original

Summary of contribution 2•A

maintenance process to maximize consistency with respect to latency constraint (WWW2015, ICWE2015).

• Query driven: maintain cache entries that are involved in current evaluation• Freshness driven: maintain cache entries that

• Are stale• Change less frequently • Affect future evaluations

•Evaluation

• The proposed approach outperforms a set of baseline policies.

•This work has already been followed up

• Queries with FILTER clauses (ICWE2016)• Queries with complex join patterns (ISWC2016)

16

Page 17: Offsite presentation original

Data integration

•Data integration approaches

• Data warehouse (DW)• Low latency• Low consistency

• Mediator systems (MS)• High latency• High consistency

•Integration in a real system

High consistencyLow consistency

Low latency

High latency

Ideal caseData

warehouse

Mediator systems

Data warehouse

Mediator systems

Page 18: Offsite presentation original

Contributing the proposed policies to CSPARQL

• So far we assumed all required data to provide the response exists in the local cache but needs to be maintained.

• What if required data does not fit in the local cache?

18

entries

SERVICE Provider

Local cache

Page 19: Offsite presentation original

Research Question 3

How to take into account space constraint while optimizing data integration with regards to

latency or consistency constraints?

19

Page 20: Offsite presentation original

20

Summary of contribution 3• An extension of the maintenance policy (contribution 2) to take into

account both latency and space constraints. • Fetching policies to cope with cache incompleteness • A freshness based cache replacement policy • An implementation in CSPARQL

• Evaluation• The proposed replacement policy outperforms state-of-the-art

replacement policies.• Future work

• Investigating more complex queries (e.g., with multiple SERVICE clauses, complex join patterns)

Page 21: Offsite presentation original

Conclusions•A

n ideal integration engine (low latency and high consistency) is not possible because these two dimensions are in trade-off.

•Contributions:

• Optimizing response latency with consistency threshold has been studied in the context of Data Marketplace.

• A maintenance policy to optimize response consistency with latency threshold in the context of knowledge-based event processing.

• Introduction of space constraints to integrate my approach in CSPARQL.

•Integration must be optimized according to application requirements to tune the consistency/latency trade-off.

21

High consistencyLow consistency

Low latency

High latency

Ideal caseData

warehouse

Mediator systems

Data warehouse

Mediator systems

Page 22: Offsite presentation original

Slide 22

Page 23: Offsite presentation original

Data IntegrationData Stream Data Source

CacheMaintenance

Process

Freshness decreases

Refresh based on latency constraint

Query (critical latency)

Data Source Data Source

Cache

Maintenance Process

Freshness decreases

Refresh based on consistency constraint

Query (critical consistency)

1. Maintaining cache based on latency constraint of query (Event Detection)

2. Maintaining cache based on consistency constraint of query (Data Market)

[email protected] Unit for Reasoning and Querying

Page 24: Offsite presentation original

Mediator system: Highest consistency with a latency threshold

24

Query: find Twitter users that have been mentioned more than 5 times in the last minute and are followed

by more than 1000 users

Stream ProcessorTwitter mention stream

#X has 1007 followers#Y has 2000 followers

#Z has 500 followers

Twitter Follower API

#X is super hero#X won the gold medal

#X broke the world record

#X is awesome#X …

#Y is super hero

#Y won the bronze medal

#Y broke the world record

#Y is awesome

#Y …

#Z is great#Z won the silver medal#Z broke the world

record#Z is awesome

Well done to #Z, #Y, #X

User Mentioned

Followed by

#X 7 1007

#Y 6 2000#X has 1007 followers

#Y has 2000 followers#Z has 600 followers

#X has 998 followers

Page 25: Offsite presentation original

Contributing the proposed policies to CSPARQL

Requirements•A local cache R•Fetch SERVICE from R•Maintain R•ESPER external time

25

The modified engine is available on github

Time stamp

entries

SERVICE Provider

Local cache

Page 26: Offsite presentation original

Workloads with significant improvements with proposed policy

•We hypothesize that WSJ-WBM is more influential if :

• Hypothesis 1: the BKG data change slower• Hypothesis 2: the BKG data changes with more diversity in change rate• Hypothesis 3: there is a negative correlation between the streaming rate

and the change rate• Hypothesis 4: total number of possible events (i.e., caching space) is

larger•T

he time overhead of WSJ-WBM is negligible

26

Page 27: Offsite presentation original

Experiments set up•A

data generator to generate various workloads with • Various change rate distributions within an interval- random or normal

distribution• Various streaming rates- the inter arrival time of elements follows a

Poisson distribution with various lambda intervals

27

Page 28: Offsite presentation original

Hypothesis 1: BKG data change slower.

28

Page 29: Offsite presentation original

Hypothesis 2: BKG data changes with more diversity in change rate.

29

Page 30: Offsite presentation original

Hypothesis 3: negative correlation between the streaming and change rate

30

Page 31: Offsite presentation original

Hypothesis 4: total number of possible events (i.e., caching space) is larger

31

Page 32: Offsite presentation original

Hypothesis 4: The time overhead of WSJ-WBM is negligible

32

LocalRemote

Page 33: Offsite presentation original

Combining RDF Streams and Remotely Stored Background Data

•We move to an approximate setting, and we introduce a local view to store part of the data involved in the query processing, and update part of it to capture the dynamicity

33

Page 34: Offsite presentation original

A query-driven maintenance process

•SELECT * WHERE WINDOW(S, ω, β) PW . SERVICE(BKG) PS

34

WINDOW clause

JOIN Proposer Ranker

MaintainerLocal View

4 2

3

1

SERVICE clause

E

C

RNDLRUWBM

CWSJWSJGNR

LRUFRP

Page 35: Offsite presentation original

Evaluation

35