Top Banner
ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1 , Alex Poulovassilis 2 , Peter Wood 2 1 University Adolfo Ibanez, Chile 2 Birkbeck, University of London
40

ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Mar 28, 2015

Download

Documents

Michelle Flood
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

ESWC 2009, June 2009

Ranking Approximate Answers to Semantic Web Queries

Carlos Hurtado1, Alex Poulovassilis2, Peter Wood2 1University Adolfo Ibanez, Chile

2Birkbeck, University of London

Page 2: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Outline of the talk

1. Motivation2. Overview of our approach3. Single-conjunct queries – exact semantics4. Approximate semantics5. Multi-conjunct queries6. Conclusions and future work

Page 3: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

1. Motivation

Volumes of semi-structured data available on the web In particular, increase in the amount of RDF data e.g. in

the form of linked data Volumes and heterogeneity of such data necessitates

support for users’ querying by approximate answering techniques: o users’ queries do not have to match exactly the data

structures being queried o answers to queries are returned in ranked order, in

increasing “distance” from the original query

Page 4: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

2. Overview of our approach

We consider general semi-structured data, modelled as a graph structure e.g. RDF linked data is one kind of data that can be represented this way

Our model is a directed graph G = (V,E) where• each node in V is labelled with a constant (so ‘blank’

nodes cannot be represented)• each edge e in E is labelled with a label l(e) from a finite

alphabet ∑ Our query language is that of conjunctive regular path

queries:Z1 ,..., Zm (X1 , R1 , Y1), ..., (Xn , Rn , Yn)

where the Xi , Yi are variables or constants, the Ri are regular expressions over ∑ and the Zi are drawn from the Xi and Yi

Page 5: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Example 1 – RDF graph of a transport network

Page 6: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

“Find cities from which we can travel to city u5 using only airplanes as well as to city u6 using only trains or

busses” :?X (?X, (airplane)+, u5), (?X, (train|bus)+, u6)

Page 7: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Answer: • First conjunct generates bindings u1, u4 for ?X• Second conjunct generates bindings u1, u2, u4 for ?X• Hence answer is u1, u4

Page 8: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Approximate answers

We are interested in using weighted regular transducers to capture query approximations since, from results by Grahne and Thomo 2001, we know that single-conjunct queries with a weighted regular transducer applied can be evaluated incrementally in polynomial time

Incremental evaluation allows answers to be returned to the user in ranked order

In this paper, we extend these this approach to include also symbol inversion; and we show that multiple conjunct queries can also be evaluated in polynomial time, using an algorithm from Ilyas, Aref, Elmagarmid 2004 for computing top-k join queries

Page 9: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Weighted regular transducers

A weighted regular transducer is a Finite State Automaton in which the transitions are labelled with triples rather than single symbols:• a transition from state s to state t labelled (a,i,b)

means that if the transducer is in state s then it can move to state t on input a with cost i while outputting b

• in our context, such a transition is interpreted as stating that symbol a in a query can match label b of an edge in the graph with cost i

Page 10: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Approximate regular expression matching

In the paper, for simplicity we mainly focus on approximate regular expression matching, which can be specified using weighted regular transducers (Grahne, Thomo 2001)

The edit operations we allow are:• insertions, deletions and substitutions of symbols• inversion of symbols (i.e. edge reversal)• transposition of adjacent symbols

We envisage the user being able to specify which edit operations should be undertaken by the system when answering a particular query, or in a particular application

The user could also specify the cost associated with applying each edit operation (in the paper we assume a cost of 1 for all of them)

Page 11: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Example 2 – transport network data

Page 12: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

“Find cities reachable from Santiago by non-stop flights”, posed by user who has little knowledge of the structure of

the data:?X (Santiago, airplane, ?X)

Page 13: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

The query as posed returns no answers:?X (Santiago, airplane, ?X)

However, the query can be relaxed, by an insertion of name, to:

?X (Santiago, airplane . name, ?X) And further relaxed, by an insertion of name- to

?X (Santiago, name- . airplane . name, ?X)

This generates bindings of Temuco, Chillan for ?X These answers can be regarding as having distance 2

from the original query:• two insertions to the original query• each at an assumed cost of 1

Page 14: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

3. Single-conjunct queries

A single-conjunct query, Q, is of the form Z1, Z2 (X, R, Y)

A semipath p in graph G is a sequence of the form

v1 , l1 , v2 , l2 , …, vn , ln vn+1

where for each vi , vi+1 there is an edge vi vi+1 labelled li or an edge vi+1 vi labelled li

- in G

Semipath p conforms to regular expression R if l1 … ln is in the language denoted by R

Page 15: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Exact Semantics

Given a single-conjunct query Q, Z1, Z2 (X, R, Y)

Let θ be a matching from {X, Y} to the nodes of graph G, that maps each constant to itself

The exact answer of Q on G is the set of tuples θ(Z1, Z2) such that there is a semipath from θ(X) to θ(Y) which conforms to R

Page 16: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

4. Approximate Semantics

The edit distance from a semipath p to a semipath q is the minimum cost of any sequence of edit operations which transforms the sequence of edge labels of p to the sequence of edge labels of q• We recall that the edit operations we allow are

insertions, deletions, substitutions and inversions of symbols, and transposition of adjacent symbols

• We envisage the user being able to specify which edit operations should be applied by the system when answering a particular query, or in a particular application

• The user could also specify the cost associated with applying each edit operation (in the paper we assume a cost of 1 for all of them)

Page 17: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Approximate Semantics

The distance of a semipath p to a regular expression R, dist(p,R), is the minimum edit distance from p to any semipath that conforms to R

Given graph G, query Q and matching θ, the tuple θ(Z1, Z2) has distance dist(p,R) to Q, where p is a semipath from θ(X) to θ(Y) which has the minimum distance to R of any semipath from θ(X) to θ(Y) in G

• note, if p conforms to R, then θ(Z1, Z2) has distance 0 to Q

The approximate top-k answer of Q on G is a list containing the k tuples θ(Z1, Z2) with minimum distance to Q, ranked in order of increasing distance to Q

The approximate answer of Q on G is a list containing all the tuples at any distance to Q, ranked in order of increasing distance to Q (a maximum of O(|E|)2 tuples).

Page 18: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – naive

1. Construct approximate automaton M at distance d = |R|+|E| using a standard construction from approximate string matching• note, |R|+|E| is the maximum distance required to

obtain all tuples in the approximate answer (Lemma 1)

• M consists of d copies of MR , the NFA that recognises L(R)

• Each copy MRj , where 0 ≤ j ≤ d , represents states at

distance j from MR

• The only initial state in M is the initial state of MR0

• The final state of each MRj becomes a final state in M

• Each sub-automaton MRj is connected to MR

j+1 by transitions representing the selected edit operations, and their costs (assumed 1 for simplicity in the paper)

Page 19: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – naive

2. Form the product automation H = M x G viewing each node in the input graph G=(V,E) as both an initial and a final state

3a. If Q is of the form (n,Y) (n,R,Y) for some node n of G, then perform a uniform cost traversal of graph H, starting from node (s0

0,n) where s00 is the initial state of MR

0

We keep a list of visited nodes of H, so no node is visited twice.Whenever a node (sf

j,m) is encountered (where sfj is the final state

of some MRj ), we output m.

The distance of m to Q is given by the total cost of the path from (s0

0,n) to (sfj,m) in the traversal tree.

Page 20: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – naive

3b. If Q is of the form (X,Y) (X,R,Y)it can be evaluated by answering the query

(n,Y) (n,R,Y)for each node n of G

Lemma 2 of the paper states that the time to compute the approximate answer is polynomial in |V|, |E| and |R|

Page 21: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – incremental

The edges of graph H = M x G can be computed incrementally, avoiding pre-computation and materialisation of the entire H:

For any state si and node n of G, succ(si ,n) outputs the set of transitions which would be the successors of (si, n) in H

succ calls nextStates(MR,s,c) to return the set of states in MR reachable from state si on reading input c – this input is obtained from• the edges in G adjacent to n – for normal traversal, edge

reversal and symbol insertion, • from symbols in ∑ – for symbol deletion, and• from edges in G adjacent to n, plus a further hop of edge

traversals in G – for transpositions

Page 22: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – incremental

Incremental evaluation proceeds by:

• Constructing the NFA MR for R

• Initialising to empty the set visitedR of triples (v,n,s) stating that node n in G was visited in state s starting from node v

• Initialising a priority queue QR with quadruples of the form (v,v, s0,0) for each node v in G (unless X=n in the query, in which case only (n,n, s0,0) is enqueued)

• the fourth argument is the current distance, d• initially, d = 0

• subsequently, quadruples are added to QR in order of increasing d

• Repeatedly calling the function getNext (X,R,Y) to return the next answer tuple for the conjunct (X,R,Y), in ranked order

Page 23: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Evaluation – incremental

getNext (X,R,Y): while QR is non-empty, this:

• de-queues a tuple (v,n,s,d) from QR where d is the distance associated with visiting node n in state s of MR

having started from node v

• adds (v,n,s) to visitedR

• if s is a final state then getNext returns triple (v,n,d)

• otherwise, succ(s,n) is called, returning the set of transitions (c,w) and states (s’,m) which are the successors of (s,n) in H

• those states (s’,m) such that (v,m,s’) is already in visitedR are ignored

• for all other states, (v,m,s’,d+w) is added to QR

Page 24: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Example 4 – transport network dataSuppose that the only query edits allowable are insertion of name or

name- , and inversion of airplane. “Find cities reachable from Santiago by plane”:

?Y (Santiago, (airplane)+, ?Y)

Page 25: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Enqueue (Santiago,Santiago, s0,0)

This is de-queued, and succ(s0,Santiago) is called; which

returns transition (name-,1) and state (s01,u1)

(Santiago,u1, s01 ,1) is enqueued

(Santiago,u1, s01 ,1) is de-queued, and succ(s0

1 ,u1) is called; this returns transition (airplane,0) and state (sf

1,u4), and

transition (airplane,0) and state (sf1,u7)

(Santiago,u4, sf1 ,1) and (Santiago,u7, sf

1 ,1) are enqueued These are successively de-queued, resulting in (Santiago,u4, 1)

and (Santiago,u7, 1) being successively returned by getNext Computation continues in this way, until all answer tuples have

been returned

?Y (Santiago, (airplane)+, ?Y)

Page 26: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,
Page 27: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,
Page 28: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,
Page 29: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,
Page 30: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,
Page 31: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

5. Multi-conjunct queries

For a general conjunctive regular path query Z1 ,..., Zm (X1 , R1 , Y1), ..., (Xn , Rn , Yn)

Given a matching θ from variables to the nodes of graph G, the tuple θ(Z1, ...,Zm) has distance

dist(p1,R1,) + ... + dist(pn,Rn)

to Q, where each pi is a semipath from θ(Xi) to θ(Yi) which has the minimum distance to Ri of any semipath from θ(Xi) to θ(Yi)

The approximate top-k answer of Q on G is a list containing the k tuples θ(Z1, ...,Zm) with minimum distance to Q, ranked in order of increasing distance to Q

The approximate answer of Q on G is a list containing all the tuples at any distance to Q, ranked in order of increasing distance to Q

Page 32: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Multi-conjunct queries

To ensure polynomial time evaluation, we require that the conjuncts of Q are acyclic

This implies the existence of a join tree induced by the conjuncts of Q

We use the hash ripple join algorithm of Ilyas, Aref, Elmagarmid 2004 to incrementally evaluate Q

For each conjunct (Xi ,Ri ,Yi) of Q, we use our incremental evaluation algorithm for single-conjunct queries to compute a relation ri containing triples (n,m,d) where d is the minimum distance to Ri of any semipath from node n to node m in G

Page 33: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Multi-conjunct query evaluation

Construct the evaluation tree E of Q Initialise data structures calling recursively the procedure

open starting at root of E:• for each node of E that is a join operator, hash tables

are built for its left and right subtree (LN and RN), its “threshold” value is set to 0, and an (initially empty) priority queue is allocated for the node

• for each node of E that is a conjunct (X,R,Y), the same initialisations as earlier are performed :

• construct the NFA MR for R

• set visitedR to empty and d to 0

• initialise the priority queue QR

Page 34: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Multi-conjunct query evaluation

Incremental evaluation proceeds by calling a function getNext with the root of E

If its argument is a conjunct, getNext is as discussed earlier for single-conjunct queries

If its argument is a join operator, getNext chooses (by some heuristic) one of the two join operands, I, from which to retrieve a tuple, by recursively invoking getNext Itop is set to the distance value of the first retrieved

tuple from I, and Ibottom is updated with the distance value of the most recently retrieved tuple from I

The “threshold” value of the current node ismin(LNtop+ RNbottom , RNtop + LNbottom)

which is the lowest possible distance for join tuple yet to be computed

Page 35: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Multi-conjunct query evaluation – join operator

The current tuple, t, retrieved from I is inserted into I’s hash table, and the other hash table is probed with t to find possible join combinations with t

For each such tuple s and join tuple u, the distance of s from Q is set to the sum of the distances of t and s from Q, and u is added to the node’s priority queue

This process of generating and enqueueing join tuples repeats while the priority queue remains empty, or the distance value of the first item on the priority queue is greater than the current threshold value of the node

Finally, getNext returns the first item on the priority queue

Page 36: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

6. Conclusions and future work

The paper has explored the use of weighted regular transducers and conjunctive regular path queries in a framework for approximate querying of graph-structured data

For single-conjunct queries we have shown how approximate answers can be computed in polynomial time in the size of the query and the graph

We have also shown how answers can be computed incrementally and returned in ranked order

We have generalised the treatment to multi-conjunct queries, showing that incremental computation can still be achieved in polynomial time provided the queries are acyclic

Page 37: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Conclusions and future work

There are several directions of future work:• Implementation of our algorithms (ongoing),

determination of their practical utility and efficiency, development and empirical evaluation of optimisations

• Application in case studies e.g. RDF linked data arising in a variety of domains

• Design of end-user tools for approximate querying of semi-structured data – so that users can specify their query approximation requirements

• Extending the expressiveness of our query language, to allow path variables and predicates on paths

Page 38: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Many thanks go to Petra Selmer for her implementation of the incremental evaluation algorithm, and the screenshots.

Acknowledgements

Page 39: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Corrections

Section 2.3 should state that there are O(|R|) transitions between successive sub-automata for transpositions (because only adjacent symbols can be transposed)

Lemma 1(i) should therefore state that M has size O(d (|R| + |∑ ||R| + |R|))

Examples 3 and 4 return one more answer at distance 2 than shown, namely (sf

2 ,u1) which is reachable from (sf1 ,u4) by a

transition (airplane- ,1) (and also from (sf1,u7) by a similar

transition)

Page 40: ESWC 2009, June 2009 Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado 1, Alex Poulovassilis 2, Peter Wood 2 1 University Adolfo Ibanez,

Corrections (cont’d)

There is also a mistake in our calculations in Lemma 2 of the paper and the correct expression is O(|V| |E|3 |R|) :

If we assume that ∑ contains only labels appearing on edges in G, then the size of the approximation automaton M or R at distance |R|+|E| is O(|E|2 |R|), from Lemma 1.

The size of H = M x G is O(|E|3 |R|), since we can discard disconnected nodes from H.

Computing the approximate answer in the worst case requires |V| traversals of H, each at cost equal to the size of H i.e. a cost of O(|V| |E|3 |R|).