Agenda 1. SIGMOD 2010 Paper: Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search 2. Cloud-based Parallel DBMS 3. Mining Workflows for Data Integration Patterns 4. Energy Efficient Complex Event Processing 9/9/2010 1 Lipyeow Lim
Agenda. SIGMOD 2010 Paper : Optimizing Content Freshness of Relations Extracted From the Web Using Keyword Search Cloud-based Parallel DBMS Mining Workflows for Data Integration Patterns Energy Efficient Complex Event Processing. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lipyeow Lim 1
Agenda1. SIGMOD 2010 Paper: Optimizing Content
Freshness of Relations Extracted From the Web Using Keyword Search
2. Cloud-based Parallel DBMS
3. Mining Workflows for Data Integration Patterns
4. Energy Efficient Complex Event Processing
9/9/2010
Lipyeow Lim 2
Optimizing Content Freshness of Relations Extracted From the
Web Using Keyword SearchMohan Yang (Shanghai Jiao Tong University),
Haixun Wang (Microsoft Research Asia),Lipyeow Lim (UHM)
Min Wang (HP Labs China)
9/9/2010
Lipyeow Lim 3
Motivating ApplicationManagement at a prominent research
institute wanted to analyze the impact of the publications of its researchers ...
Linking Open Data effort might address this issue...
9/9/2010
Lipyeow Lim 7
But ...Such API’s don’t exist (yet?)And ...
I need those citation counts by next week!
9/9/2010
Lipyeow Lim 8
Problem Statement Local database periodically synchronizes its data subset
with the data source Data source supports keyword query API only Extract relations from the top k results (ie first few
result pages) to update local databaseAt each synchronization,
find a set of queries that will maximize the “content freshness” of the local
database. Only relevant keywords are used in the queries Keywords cover the local relation Number of queries should be minimized Result size should be minimized
NP-Hard by reduction
to Set Cover
9/9/2010
Lipyeow Lim 9
Picking the Right Queries ...
The simple algorithm is fine, we just need to pick the right queries...◦ Not all tuples are equal – some don’t get updated
at all, some are updated all the time◦ Some updates are too small to be significant
LoopQ = set of keyword queriesForeach q in Q
Send q to Google ScholarScrape the first few pages into tuplesUpdate local relation using scraped tuples
Sleep for t seconds End Loop
9/9/2010
Lipyeow Lim 10
Greedy Probes Algorithm
What should greedy heuristic do ?◦ Local coverage : a good query will get results to update
as much of the local relation as possible◦ Server coverage : a good query should retrieve as few
results from the server as possible.◦ A good query updates the most critical portion of the
local relation to maximize “content freshness”
1. Q = empty set of queries2. NotCovered = set L of local tuples3. While not stopping condition do4. K = Find all keywords associated with NotCovered5. Pick q from PowerSet(K) using heuristic equation6. Add q to Q7. Remove tuples associated with q from NotCovered8. End While
Could be based on size of Q or coverage
of L
9/9/2010
Lipyeow Lim 11
Content Freshness
Weighted tuple dissimilarity◦ Some tuples are more important to update◦ = w(local)*d(local,server)
Content Freshness (take 2)Estimate the server value of citation
using an update model based on◦Current local value of citation◦Volatility of the particular citation field◦Time elapsed since last sync.
Ll
tlFlwL
SLD ),()(||1),(
F(l,t) estimates the dissimilarity between the local tuple and the server tuple at time t assuming an update model
9/9/2010
Lipyeow Lim 13
Greedy Heuristic Query efficiency
To give higher priority to “unfresh” tuples, we weight the local coverage with the freshness
Catch: local DB does not know server coverage!◦ Estimate server coverage using statistical methods◦ Estimate server coverage using another sample data source
(eg. DBLP)
|)(||)(|maxarg )(
qrageServerCoveqageLocalCoverq KPq
|)(|
),()(maxarg )(
)(qrageServerCove
tlFlwq qageLocalCoverl
KPq
9/9/2010
Lipyeow Lim 14
ExperimentsData sets:
◦Synthetic data ◦Paper citations (this presentation)◦DVD online store
Approximate Powerset(K) with all keyword pairs
Result Extraction◦Method 1: scan through all result
pages◦Method 2: scan only the first result
page 9/9/2010
Lipyeow Lim 15
Content Freshness Synthetic citation data based on known statistics A Poisson-based update model used to estimate freshness 10 queries are sent at each sync Naive 1 & 2 sends simple ID-based queries
9/9/2010
Lipyeow Lim 16
Optimizations
Approximate K using most frequent k keywords
Approximate Power Set using subsets up to size m=2 or 3.
1. Q = empty set of queries2. NotCovered = set L of local tuples3. While not stopping condition do4. K = Find all keywords associated with NotCovered5. Pick q from PowerSet(K) using heuristic equation6. Add q to Q7. Remove tuples associated with q from NotCovered8. End While
K can be large
Power Set is exponential
9/9/2010
Lipyeow Lim 17
Coverage Ratio
Coverage ratio is the fraction of local tuples covered by a set of queries
Result Extraction◦ Method 1: scan through all result pages◦ Method 2: scan only the first result page
Top 5 frequent keywords from title & Method 1
9/9/2010
Lipyeow Lim 18
ConclusionIntroduced the problem of maintaining a
local relation extracted from a web source via keyword queries
Problem is NP-Hard, so design a greedy heuristic-based algorithm
Tried one heuristic, results show some potential
Still room for more work – journal paper9/9/2010
Lipyeow Lim 19
Cloud-based Parallel DBMSs
9/9/2010
Lipyeow Lim 20
Cloud Computing
9/9/2010
Lipyeow Lim 21
Shared-Nothing Parallel Database Architecture
9/9/2010
Network
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
CPU
Disk
Memory
CPUPhysical ServerPhysical ServerPhysical ServerPhysical Server
Lipyeow Lim 22
Logical Parallel DBMS Architecture
9/9/2010
Network
DBMSCatalog DB
Parallel DB layer
DBMS
Data Fragments
Parallel DB layer
DBMS
Data Fragments
query
results
Lipyeow Lim 23
Horizontal Fragmentation: Range Partitionsid sname rating age