Top Banner
A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes (BGU)
48

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Jan 21, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data

Raghav Kaushik (UW)Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)

Page 2: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Outline

Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions

Page 3: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Data Model

Rooted, node-labeled graph with unique root; root has unique label

Nodes - objectsArcs - object-subobject relationshipIn XML context

Index tag structure No distinction between elements and attributes No distinction between tree and idref arcs Order ignored

Page 4: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Problem Statement

Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) Size ~10% of database size Efficient construction and update Tunable to a workload

Queries of the form R x, where R is a regular path expression

Schemaless data

Page 5: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Flavor of Approach

Different from traditional value indices

Structural summaries for indexing paths

Both data and index are rooted graphs

Example: Dataguide

Page 6: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Index Graph

Structural summary

Associate a set of data nodes with each index node, called its extent

Preserve data paths in index graph

Page 7: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Data graph Index graph

Example index graph

5,6

3,4

21

00

1 2

3 4

5 6

Page 8: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Index Graph (cont’d)

Can be constructed from any partition

Node for every equivalence class C

Edge between C and C’ if exists an

edge v v’ with v in C and v’ in C’

Preserves data paths, no false drops

Our structures are all index graphs

Page 9: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Prior Schemes

Dataguide [Goldman, Widom 1997] Deterministic automaton corresponding

to data graph

Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index

Can be exponential in size!

Page 10: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Prior Schemes (cont’d)

1-index [Milo, Suciu 1999] NFA rather than DFA (smaller) split graph nodes into equivalence classes

based on incoming paths from the root Computing best split is PSPACE complete Go for refinements (approximations)

similaritybisimilarity

Page 11: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Limitations of Prior Work

Size Dataguide sizes subject to exponential blow-

up 1-index size can be big too!

Update No known update algorithm for 1-index

Designed to answer queries involving arbitrarily complex paths, but... such paths may never show up in queries

Page 12: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

ROOT

metro

cultural business neighborhoods

museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

Local Similarity

Page 13: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Main Contributions

New family of approximate index structures

Applicable to Approximate Schema Statistics

Query evaluation using approximate indexes

Preliminary performance studyUpdate algorithms

Page 14: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Approximate Indexes

Motivation: Smaller More efficient query processing Limited update cost - maintain local

informationApproximate dataguide [Goldman, et.al]

path merging, object matching, etc no formal basis (but different goal) no study of effect on query processing

Page 15: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Outline

Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions

Page 16: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Graph Bisimulation

A bisimulation is a symmetric relation R between nodes

If A1 R A2 then A1 and A2 have the same labels and ...

Page 17: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

B1

A1 A2R

A1 A2

B1

R

B2R

and vice-versa!

Graph Bisimulation (cont’d)

Page 18: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Bisimilarity

Two nodes a and b are bisimilar if they are related in some bisimulation

1-index is index graph constructed from bisimulation partition

Simulation partition: similar

Page 19: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

ROOT

metro

cultural business neighborhoods

museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

Bisimulation on example

Page 20: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

k-bisimulation

Nodes A1 and A2 are 0-bisimilar iff same label

A1 and A2 are k-bisimilar iff k-1 bisimilar and

if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa

Page 21: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Data graph

0

1 2

3 4

65

Example for k-bisimulation

0

1 2

3,4

5,6

0-bisimulation

0

1 2

3 4

5,6

1-bisimulation

Page 22: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

ROOT

metro

cultural business neighborhoods

museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

A(2) for example

Page 23: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Properties

If a and b are bisimilar set of incoming paths into them is same

If a and b are k-similar or k-bisimilar set of incoming paths of length <= k are

sameIf k-bisim = k+1-bisim then k-bisim =

bisimSize: certainly smaller than bisimulation

Page 24: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Query Evaluation

Only queries studied are regular path queries of the form R x

Query Evaluation Approach: Create automaton for regexp query Run automaton on the index graph Result is union of extents belonging to

index nodes accepted by automaton

Page 25: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

1 2

3,4

5,6

Automaton Graph Index Graph

Example Query Evaluation

Page 26: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Approximate Indexes

Caveat: False positives possibleApproach: verify each node on data

graph by running reverse automaton Prohibitive cost?

Then why use approx. indices? In fact, frequently more efficient than

data graph or precise index

Page 27: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Improving Validation

First cut: Keep track of accepting-path-length for accepted nodes with path length <= k,

verification not requiredSecond step: Share traversals among

verification calls mark node-state pairs on a successful

verification path as accept similar marking for failed path

Page 28: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Improving Validation (cont’d)

Third Step: Avoid needless

verification

Example: For _*.R queries, no need to

verify all the way up to the root

Generalize the above!

Page 29: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Outline

Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions

Page 30: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Preliminary Experiments

Data used: Internet Move Database (http://www.imdb.com) 250,000 movies & TV shows 460,000 actors, etc XML version = ~1GB

We used subsets of this database ranging from 200 - 2000 movies

Whole database --> future work!

Page 31: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Preliminary Experiments

Second source: Open Directory Project (http://www.dmoz.org) Entire source available in RDF format

Subsets: (entire subtree under a topic, say shopping)

Page 32: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Storage Model

Results independent of any particular storage model

In-memory rooted graphPerformance metrics are abstract

Cost = total number of nodes visited (graph + index)

Page 33: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

IMDB#Nodes:190,000

ODP#Nodes:143,000 0

10

20

30

40

0 2 4 6 8 10 12 14 16

K parameter of Index

Pe

rce

nt

of

Da

ta G

rap

h S

ize

A(k)-Index, IMDBData

A(k)-Index, ODPData

Bisimulation Sizes

Page 34: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Query Evaluation Plans

012345678

IMDB Short IMDB Long ODP Short

Workload

No

de

Vis

its

(L

og

Sc

ale

)

1-index(fwd)

1-index(back)

G (fwd)

G(back)

1. Forward eval

2. Backward eval (assume a label index)

Page 35: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

K parameter of Index

Fra

cti

on

of

1-I

nd

ex

Co

st

Validation Cost

Index Cost

Short Queries - IMDB

Page 36: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

K parameter of Index

Fra

cti

on

of

1-I

nd

ex

Co

st Validation Cost

Index Cost

Long Queries - IMDB

Page 37: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

K parameter of Index

Fra

cti

on

of

1-I

nd

ex

Co

st

validcost

indexcost

Queries beginning with _*

Page 38: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

K parameter of Index

Fra

cti

on

of

1-I

nd

ex

Co

st

validcost

indexcost

Queries containing _*

Page 39: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Approximate Answers

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15K parameter of Index

Fra

cti

on

of

Co

rre

ct

Re

su

lt S

ize

False Positives

ToValidate

Guaranteed Results

Page 40: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

A(k)-index Update

Edge added from u to v

A(0)-index -> no change except possible addition of edge

A(1)-index -> index node containing v may change determined by set of labels in v’s

parents

Page 41: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

A(k)-index Update (contd)

A(k)-index only nodes to be considered are those at

distance < k from vMaintain tree of splitsWork iteratively:

find new A(1) position of v find new A(2) positions of v and its children …

Page 42: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Updating the 1-index

One way is generalization of A(k) updateR - any binary relation on the nodes that

is reflexive transitively closed.

A refinement of R is any subset that is reflexive transitively closed

Page 43: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Refinement

B - bisimulation relationB’ - any refinement of BB(G) - index graph built using BB’(G) - index graph built using B’

Page 44: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Theorem

Theorem: B(B’(G)) = B(G)Intuition:

Similar nodes behave similarly So, fuse them together!

Page 45: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Lazy Update

Basic Idea: G G’ , and meanwhile B(G) B(G’) Instead, “relax” the graph B(G) to B’(G’)

How? A “stable” partitioning of G is either

B(G) or its refinement. Propagate graph update on B(G) by

splitting nodes until stable.

Page 46: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

0

1

2

3

4

5

6

0 100 200 300 400 500

Number of edges added

Pe

rce

nt

inc

rea

se

in

ind

ex

siz

e Propagated Index

Accurate Index

Lazy Update Performance

Page 47: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Conclusions

Novel approximate index structures and validation techniques

Experiments demonstrate k-bisimulation index is Efficiently constructed Effective for query answering

Page 48: A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes.

Future Work

Handle more query types Branching queries Queries with selection

Annotating A(k) with statistics for query optimization

StorageApplication of update algorithms to

triggers