1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum [email protected] weikum/ Joint work with Ralf Schenkel.

1/28

Efficient Top-k Queriesfor XML Information Retrieval

Gerhard Weikum

[email protected]

http://www.mpi-sb.mpg.de/~weikum/

Joint work with Ralf Schenkel and Martin Theobald

Max-Planck-Gesellschaft

2/28

A Few Challenging Queries(on Web / Deep Web / Intranet / Personal Info)

Which drama has a scene in which a woman makes a prophecyto a Scottish nobleman that he will become king?

Which professors from Saarbruecken (SB) are teaching IR and have research projects on XML?

SB IR

XML

Who was the woman from Paris that I met at the PC meetingwhere Alon Halevy was PC Chair?

Which gene expression data from Barrett tissue in theesophagus exhibit high levels of gene A01g?

Are there any published theorems that are equivalent toor subsume my latest mathematical conjecture?

3/28

Professor

Name:GerhardWeikum

Address...

Country: GermanyTeaching: Research:

Course

Title: IR

Description: Information retrieval ...

Syllabus

...Book Article

... ...

Project

Title: Intelligent Search of XMLData

Sponsor:GermanScienceFoundation

...

XML-IR Example (1)

Select P, C, R From Index Where Professor As P And P = „ Saarbruecken“And P// Course = „ IR“ As C And P// Research = „ XML“ As R

City: SB

4/28

Article...

Select P, C, R From Index Where Professor As P And P = „ Saarbruecken“And P// Course = „ IR“ As C And P// Research = „ XML“ As R

Select P, C, R From Index Where ~Professor As P And P = „~Saarbruecken“And P//~Course = „~IR“ As C And P//~Research = „~XML“ As R

Professor

Name:GerhardWeikum

Address...

Country: GermanyTeaching: Research:

Course

Title: IR

Description: Information retrieval ...

Syllabus

...Book

...

Project

Title: Intelligent Search of XMLData

Sponsor:GermanScienceFoundation

...

XML-IR Example (2)

City: SB

Name:RalfSchenkel Teaching:

Literature

Book

Lecturer

Interests: Semistructured Data, IR

Address:Max-PlanckInstitute for CS,Germany

Seminar

Title: StatisticalLanguage Models

...

Contents: Ranked Search ...

5/28

XML-IR: History and Related WorkIR on structured docs (SGML):

1995

2000

2005

IR on XML:

XIRQL (U Dortmund)XXL (U Saarland / MPI)

XRank (Cornell U)

JuruXML (IBM Haifa )

Commercial software(MarkLogic, Verity?, Oracle?, Google?, ...)

XML query languages:

XQuery (W3C)

XPath 2.0 (W3C)INEX benchmark

Compass (U Saarland / MPI)

XPath 1.0 (W3C)XML-QL (AT&T Labs)

Web query languages:

Lorel (Stanford U)Araneus (U Roma)W3QS (Technion Haifa)

TeXQuery (AT&T Labs)FleXPath (AT&T Labs)

ELIXIR (U Dublin)

HyperStorM (GMD Darmstadt)HySpirit (U Dortmund)

WHIRL (CMU)

PowerDB-IR (ETH Zurich)

ApproXQL (U Berlin / U Munich)

Timber (U Michigan)XSearch (Hebrew U)

6/28

XML-IR Concepts

Where clause: conjunction of restricted path expressions with binding of variables

Select P, C, R From IndexWhere ~Professor As PAnd P = „Saarbruecken“And P//~Course = „Information Retrieval“ As CAnd P//~Research = „~XML“ As R

Elementary conditions on names and contents

„Semantic“ similarity conditions on names and contents

Relevance scoring based on tf*idf similarity of contents, ontological similarity of names,

aggregation of local scores into global scores

~Research = „~XML“

Query result:• query is a path/tree/graph pattern• results are isomorphic paths/subtrees/subgraphs of the data graph

Query result:• query is a pattern with relaxable conditions• results are approximate matches to query with similarity scores

applicable to both XML and HTML data graphs

7/28

Ontologies/Thesauri: Example WordNet

woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)

8/28

Ontology GraphAn ontology graph is a directed graph withconcepts (and their descriptions) as nodes and semantic relationships as edges (e.g., hypernyms).

woman

human

body

personality

character

lady

witch

nanny

MaryPoppins fairy

Lady Di

heart

...

...

...

syn (1.0)

hyper (0.9)

part (0.3)

mero (0.5)

part (0.8)

hypo (0.77)

hypo (0.3)hypo (0.35)

hypo (0.42)

instance(0.2)

instance (0.61)

instance (0.1)

Weighted edges capture strength of relationships key for identifying closely related concepts

9/28

Query Expansion

Threshold-based query expansion:substitute ~w by (c1 | ... | ck) with all ci for which sim(w, ci)

„Old hat“ in IR; highly disputed for danger of topic dilution

Approach to careful expansion:• determine phrases from query or best initial query results (e.g., forming 3-grams and looking up ontology/thesaurus entries)• if uniquely mapped to one concept then expand with synonyms and weighted hyponyms

Problem: choice of threshold see Top-k QP

10/28

Outline

Motivation & Preliminaries

Prob-k: Efficient Approximative Top-k

• TopX: Top-k for XML

•

11/28

Top-k Query Processing with Scoring

Naive join&sort QP algorithm:

algorithm

B+ tree on terms

17: 0.344: 0.4

...

performance... z-transform...

52: 0.153: 0.855: 0.6

12: 0.514: 0.4

...

28: 0.144: 0.251: 0.652: 0.3

17: 0.128: 0.7

...

17: 0.317: 0.144: 0.4

44: 0.2

11: 0.6index lists with(DocId, s = tf*idf)sorted by DocId

Given: query q = t1 t2 ... tz with z (conjunctive) keywords similarity scoring function score(q,d) for docs dD, e.g.: Find: top k results w.r.t. score(q,d) =aggr{si(d)}(e.g.: iq si(d))

Google:> 10 mio. terms> 8 bio. docs> 4 TB index

q d

q: algorithm performance z-transform

top-k ( [term=t1] (index) DocId

[term=t2] (index) DocId

... DocId

[term=tz] (index) order by s desc)

12/28

TA (Fagin’01; Güntzer/Kießling/Balke; Nepal et al.)

scan all lists Li (i=1..m) in parallel: consider dj at position posi in Li; highi := si(dj); if dj top-k then { look up s(dj) in all lists L with i; // random access compute s(dj) := aggr {s(dj) | =1..m}; if s(dj) > min score among top-k then add dj to top-k and remove min-score d from top-k; };if min score among top-k aggr {high | =1..m} then exit;

m=3aggr: sumk=2

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

f: 0.75

a: 0.95

top-k:

b: 0.8

but random accesses are expensive ! TA-sorted Prob-sorted

applicable to XML data: course = „~ Internet“ and ~topic = „performance“

13/28

TA-Sorted (aka. NRA)scan index lists in parallel: consider dj at position posi in Li; E(dj) := E(dj) {i}; highi := si(q,dj); bestscore(dj) := aggr{x1, ..., xm) with xi := si(q,dj) for iE(dj), highi for i E(dj); worstscore(dj) := aggr{x1, ..., xm) with xi := si(q,dj) for iE(dj), 0 for i E(dj); top-k := k docs with largest worstscore; if min worstscore among top-k max bestscore{d | d not in top-k} then exit;

m=3aggr: sumk=2

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

top-k:

candidates:

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

f: 0.7 + ? 0.7 + 0.1

a: 0.95

h: 0.35 + ? 0.35 + 0.5

b: 0.8

d: 0.35 + ? 0.35 + 0.5c: 0.35 + ? 0.35 + 0.3

g: 0.2 + ? 0.2 + 0.4

h: 0.45 + ? 0.45 + 0.2

d: 0.35 + ? 0.35 + 0.3

14/28

Evolution of a Candidate’s Score

scan depth

drop dfrom thecandidate queue

Approximate top-k “What is the probability that d qualifies for the top-k ?”

bestscore(d)

worstscore(d)

min-k

score

??Worst- and best-scores slowly converge to final scoreAdd d to top-k result, if worstscore(d) > min-kDrop d only if

bestscore(d) < min-k, otherwise keep it in candidate queue

Overly conservative threshold & long sequential index scans

TA family of algorithms based on invariant (with sum as aggr)i i i

i E( d ) i E( d ) i E( d )s ( d ) s( d ) s ( d ) high

worstscore(d) bestscore(d)

15/28

Top-k Queries with Probabilistic Guarantees

TA family of algorithms based on invariant (with sum as aggr)i i i

i E( d ) i E( d ) i E( d )s ( d ) s( d ) s ( d ) high

Relaxed into probabilistic invarianti i

i E( d ) i E( d )p( d ) : P [ s( d ) ] P [ s ( d ) S threshold ]

i i ii E( d ) i E( d ) i E( d )

P [ S threshold s ( d )] : P [ S ']

where the RV Si has some (postulated and/or estimated) distribution in the interval (0,highi]

f: 0.5b: 0.4c: 0.35a: 0.3h: 0.1d: 0.1

a: 0.55b: 0.2f: 0.2g: 0.2c: 0.1

h: 0.35d: 0.35b: 0.2a: 0.1c: 0.05f: 0.05

S1S2 S3

• Discard candidates with p(d) ≤ • Exit index scan when candidate list empty

highi

0

0,2

0,4

0,6

0,8

1

1,2

worstscore(d) bestscore(d)

16/28

Probabilistic Threshold Test

• postulating uniform or Zipf score distribution in [0, highi]

• compute convolution using LSTs• use Chernoff-Hoeffding tail bounds or generalized bounds for correlated dimensions (Siegel 1995)

• fitting Poisson distribution (or Poisson mixture)• over equidistant values:• easy and exact convolution

• distribution approximated by histograms:• precomputed for each dimension• dynamic convolution at query-execution time

with independent Si‘s or with correlated Si‘s

)!1(][

1

jevdP

jii

j

engineering-wise histograms work best!

0

f2(x)

1 high2

Convolution(f2(x), f3(x))

2 0δ(d)

f3(x)

high31 0

cand doc dwith 2 E(d),3 E(d)

17/28

Prob-sorted Algorithm (Smart Variant)Prob-sorted (RebuildPeriod r, QueueBound b):...scan all lists Li (i=1..m) in parallel: …same code as TA-sorted…

// queue management for all priority queues q for which d is relevant do insert d into q with priority bestscore(d); // periodic clean-up if step-number mod r = 0 then // rebuild; single bounded queue if strategy = Smart then for all queue elements e in q do update bestscore(e) with current high_i values; rebuild bounded queue with best b elements; if prob[top(q) can qualify for top-k] < then exit; if all queues are empty then exit;

18/28

TA-sorted Prob-sorted (smart)#sorted accesses 2,263,652 527,980elapsed time [s] 148.7 15.9max queue size 10849 400relative recall 1 0.69rank distance 0 39.5score error 0 0.031

Performance Results for .Gov Querieson .GOV corpus from TREC-12 Web track:1.25 Mio. docs (html, pdf, etc.)

50 keyword queries, e.g.: • „Lewis Clark expedition“, • „juvenile delinquency“, • „legalization Marihuana“, • „air bag safety reducing injuries death facts“

speedup by factor 10at high precision/recall(relative to TA-sorted);

aggressive queue mgt.even yields factor 100at 30-50 % prec./recall

19/28

.Gov Expanded Querieson .GOV corpus with query expansion based on WordNet synonyms:50 keyword queries, e.g.: • „juvenile delinquency youth minor crime law jurisdiction offense prevention“, • „legalization marijuana cannabis drug soft leaves plant smoked chewed euphoric abuse substance possession control pot grass dope weed smoke“

TA-sorted Prob-sorted (smart)#sorted accesses 22,403,490 18,287,636elapsed time [s] 7908 1066max queue size 70896 400relative recall 1 0.88rank distance 0 14.5score error 0 0.035

20/28

Performance Results for IMDB Querieson IMDB corpus (Web site: Internet Movie Database):375 000 movies, 1.2 Mio. persons (html/xml) 20 structured/text queries with Dice-coefficient-based similaritiesof categorical attributes Genre and Actor, e.g.: • Genre {Western} Actor {John Wayne, Katherine Hepburn} Description {sheriff, marshall}, • Genre {Thriller} Actor {Arnold Schwarzenegger} Description {robot}

TA-sorted Prob-sorted (smart)#sorted accesses 1,003,650 403,981elapsed time [s] 201.9 12.7max queue size 12628 400relative recall 1 0.75rank distance 0 126.7score error 0 0.25

21/28

responsetime: 0.737: 0.944: 0.8

...

22: 0.723: 0.651: 0.652: 0.6

throughput: 0.6

92: 0.967: 0.9

...52: 0.944: 0.855: 0.8

Handling Ontology-Based Query Expansions

algorithm

B+ tree index on terms

57: 0.644: 0.4

...

performance

52: 0.433: 0.375: 0.3

12: 0.914: 0.8

...

28: 0.617: 0.5561: 0.544: 0.5

44: 0.4

ontology / meta-index

iq {max jonto(i) { sim(i,j)*sj(d)) }}

performance

response time: 0.7throughput: 0.6queueing: 0.3delay: 0.25...

consider expandable query „algorithm and ~performance“ with score

dynamic query expansion with incremental on-demand merging of additional index lists

+ much more efficient than threshold-based expansion+ no threshold tuning+ no topic drift

22/28

Outline

Motivation & Preliminaries

Prob-k: Efficient Approximative Top-k

• TopX: Top-k for XML

23/28

Top-k Search on XMLTA-style algorithm should handle also tag-value conditions such as title = algorithm and ~topic = ~performance using standard indexesand exact path conditions of the form /book//literature using path indexes (pre/post index, HOPI, etc.)

Example query (NEXI, XPath & IR): //book[about(.//„Information Retrieval“ „XML“) and .//affiliation[about(.// „Stanford“)] and .//reference[about(.// „Page Rank“)] ]//publisher//country

Handling arbitrary combinationsof tag-term conditions and path conditions

TopX Algorithm

24/28

Problems Addressed by TopX

Problems: 1) content conditions (CC) on both tags and terms2) scores for elements or subtrees, docs as results3) score aggregation not necessarily monotonic4) test path conditions (PC), but avoid random accesses

Solutions: 0) disk space is cheap, disk I/O is not !1) build index lists for each tag-term pair 2) block-fetch all elements for the same doc in desc. order of MaxScore(e) = max{Score(e‘) | e‘ doc(e)}3) precompute and store scores for entire subtrees 4a) test PCs on candidates in memory4b) postpone evaluation of remaining PCs until after threshold test

Example query (NEXI, XPath & IR): //book[about(.//„Information Retrieval“ „XML“) and .//affiliation[about(.// „Stanford“)] and .//reference[about(.// „Page Rank“)] ]//publisher//country

25/28

Simplified Scoring Model for TopX

2:A

1:R

6:B

3:X 7:X

4:B 5:C

aaccab

8:B 9:C

bbb cccxy

2:X

1:A

6:B

3:B 7:C

4:B 5:C

cccabb

abc

2:B

1:Z

3:X

4:C 5:A

aaaabb

6:B 8:X

7:C

acc

9:B 10:A

bb 11:C 12:C

aabbc xyz

d1 d2 d3

restricted to tree data (disregarding links)using only tag-term conditions for scoring

score (doc d or subtree rooted at d.n) for query q with tag-term conditions A1[a1], ..., Am[am] ) matched by nodes n1, ..., nm

Example:

m i i i ii 1

i

relevance( n ,a ) specificity( A [a ])

compactness( n )

i

m i i i ii 1

it content( subtree( n ))

tf ( a , subtree( n )) inf( a , A )

tf ( t ,n )

26/28

TopX Pseudocodebased on index table L (Tag, Term, MaxScore, DocId, Score, ElemId, Pre, Post)

decompose query: content conditions (CC) & path conditions (PC);

for each index list Li (extracted from L by tag and/or term) do: block-scan next elements from same doc d; E(d) := E(d) {i}; for each CC l E(d) – {i} do: for each element pair (e, e‘) (elems(d,Li) elems(d,Ll)) do: test PC(i,l) connecting e and e‘ using pre & post of e, e‘; delete e elems(d,Li) if not e‘ such that (e,e‘) satisifes PC(i,l); delete e‘ elems(d,Ll) if not e such that (e,e‘) satisifes PC(i,l); bestscore(d) := jE(d) max{Score(e) | e elems(Lj)} + jE(d) highj; worstscore(d) := 0; if E(d) is complete then worstscore(d) := bestscore(d); ... // proceed with standard top-k algorithm test remaining PCs on d and drop d if not satisifed;

27/28

TopX Example DataExample query: //A [ .//“a“ & .//B[.//“b“] & .//C[.//“c“] ]

Tag Term MaxScore DocId Score ElemId Pre Post pre-computedindex table

with appropriateB+ tree index on(Tag, Term, MaxScore, DocId, Score, ElemId)

2:A

1:R

6:B

3:X 7:X

4:B 5:C

aaccab

8:B 9:C

bbb cccxy

2:X

1:A

6:B

3:B 7:C

4:B 5:C

cccabb

abc

2:B

1:Z

3:X

4:C 5:A

aaaabb

6:B 8:X

7:C

acc

9:B 10:A

bb 11:C 12:C

aabbc xyz

d1 d2 d3

block-scans:(A, a, d3, ...)(B, b, d1, ...)(C, c, d2, ...)(A, a, d1, ...)(B, b, d3, ...)(C, c, d3, ...)(A, a, d2, ...)(B, b, d2, ...)(C, c, d1, ...)

A a 1 d3 1 e5 5 2A a 1 d3 1/4 e10 10 9A a 1/2 d1 1/2 e2 2 4 A a 2/9 d2 2/9 e1 1 7 B b 1 d1 1 e8 8 5 B b 1 d1 1/2 e4 4 1 B b 1 d1 3/7 e6 6 8 B b 1 d3 1 e9 9 7 B b 1 d3 1/3 e2 2 4 B b 2/3 d2 2/3 e4 4 1 B b 2/3 d2 1/3 e3 3 3 B b 2/3 d2 1/3 e6 6 6 C c 1 d2 1 e5 5 2 C c 1 d2 1/3 e7 7 5 C c 2/3 d3 2/3 e7 7 5 C c 2/3 d3 1/5 e11 11 8 C c 3/5 d1 3/5 e9 9 6 C c 3/5 d1 1/2 e5 5 2

28/28

Experimental Results: INEX Benchmark

(join&sort) TopX (=0.0) TopX (=0.1)#sorted accesses 472,227 70,674 5,534#random accesses 226 206elapsed time [s] 35.6 3.2relative recall 1 1 0.85

on IEEE-CS journal and conference articles:12,000 XML docs with 12 Mio. elements,7.9 GB for all indexes

20 CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[ .//bibl[about(.//„QBIC“)] and .//p[about(.//„image retrieval“)] ]

CO

(join&sort) TopX (=0.0) TopX (=0.1)#sorted accesses 1,077,302 280,249 163,084#random accesses 2,022 1,931elapsed time [s] 118.1 21.1relative recall 1 1 0.94

CAS

29/28

INEX with Query Expansion

static expansion incremental merge(=0.8, =0.1) (=0.1)

#sorted accesses 471,128 533,385#random accesses 4,228 212elapsed time [s] 319.6 33.3max #terms 16 5 - 16

20 CO queries, e.g.: „XML editors or parsers“ 20 CAS queries, e.g.: //article[ .//bibl[about(.//„QBIC“)] and .//p[about(.//„image retrieval“)] ]

CO

CAS static expansion incremental merge(=0.8, =0.1) (=0.1)

#sorted accesses 814,123 1,271,379#random accesses 12,755 648elapsed time [s] 429.8 103.0max #terms 11 7 - 11

30/28

Conclusion: Ongoing and Future Work

Observation:Approximations with statistical guarantees are key toobtaining Web-scale efficiency(e.g., TREC’04 Terabyte benchmark: ca. 25 Mio. docs, ca. 700 000 terms, 5-50 terms per query)

Challenges:• Generalize TopX to arbitrary graphs• Efficient consideration of correlated dimensions• Integrated support for all kinds of XML similarity search: content & ontological sim, structural sim• Scheduling of index-scan steps and few random accesses• Integration of top-k operator into physical algebra and query optimizer of XML engine

1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum [email protected] weikum/ Joint work with Ralf Schenkel.

Documents

1/28 Efficient Top-k Queries for XML Information Retrieval Gerhard Weikum [email protected] weikum/ Joint work with Ralf Schenkel.