PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

TopXTopXEfficient and Versatile Efficient and Versatile

Top-k Query Processing for Top-k Query Processing for Text, Structured, and Semistructured DataText, Structured, and Semistructured Data

PhD DefenseMay 16th

2006

Martin Theobald

Max Planck Institute for Informatics

VLDB ‘05

“Native XML data base systems can store schemaless data ... ”

“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”

“XML-QL: A Query Language for XML.”

“Native XML Data Bases.”

“Proc. Query Languages Workshop, W3C,1998.”

“XML queries with an expressive power similar to that of Datalog …”

sec

article

sec

par

bib

par

title “Current Approaches to XML Data Manage-ment”

itempar

title inproc

title

//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]

An XML-IR Scenario (INEX IEEE) …

“What does XML add for retrieval? It adds formal ways …”

“w3c.org/xml”

sec

article

sec

par “Sophisticated technologies developed by smart people.”

par

title “The

XML Files”

par

title “TheOntology Game”

title“TheDirty LittleSecret”

bib

“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”

title

item

url“XML”

RANKINGRANKINGRANKINGRANKING

VAGUENESSVAGUENESSVAGUENESSVAGUENESS

PRUNINGPRUNINGPRUNINGPRUNING

Outline

Data & relevance scoring model

Database schema & indexing

TopX query processing

Index access scheduling & probabilistic candidate pruning

Dynamic query relaxation & expansion

Experiments & conclusions

Outline







Data Model

XML tree modelPre/postorder labels for all tags and merged tag-term pairs XPath Accelerator [Grust, Sigmod ’02]

Redundant full-content text nodes Full-content term frequencies ftf(ti,e)

<article>

<title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data.</par> </sec></article>

“xml data manage xml manage system vary wide expressive power native xml

data base native xml data base system store schemaless data“

“native xml data base native xml data base

system store schemaless data“

“xml data

manage”

articlearticle

titletitle absabs secsec

“xml manage system vary

wide expressivepower“

“native xml data base”

“native xml data base system store schemaless data“

titletitle parpar

1 6

2 1 3 2 4 5

5 3 6 4

ftf(“xml”, article1 ) = 4ftf(“xml”, article1 ) = 4

Full-Content Scoring Model

Extended Okapi-BM25 probabilistic model for XML with

element-specific parameterization [VLDB ’05 & INEX ’05]

Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avg.length k1 b

article 12,223 2,903 10.5 0.75

sec 96,709 413 10.5 0.75

par 1,024,907 32 10.5 0.75

fig 109,230 13 10.5 0.75

individualelementstatistics

Additional static score mass c for relaxable structural conditions

and non-conjunctive (“andish”) XPath evaluations

bib[“transactions”]vs.

par[“transactions”]

bib[“transactions”]vs.

par[“transactions”]

Outline







Inverted Block-Index for Content & Structure

sec[“xml”]

title[“native”] par[“retrieval”]

Combined inverted index over merged tag-term pairs (on redundant element full-contents)

Sequential block-scans Group elements in descending order of (maxscore, docid) per listBlock-scan all elements per doc for a given (tag, term) key

Stored as inverted files or database tables

(two B+-tree indexes over full range of attributes)

eid docid score pre post max-score

46 2 0.9 2 15 0.9

9 2 0.5 10 8 0.9

171 5 0.85 1 20 0.85

84 3 0.1 1 12 0.1

sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 14 10 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4

eid docid score pre post max-

score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

Random Access (RA)

SortedAccess

(SA)

Navigational Index

eid docid pre post

46 2 2 15

9 2 10 8

171 5 1 20

84 3 1 12

sec


sec

Additional element directoryRandom accesses on B+-tree index using (docid, tag) as keyCarefully scheduled probes

Schema-oblivious indexing & queryingNon-schematic, heterogeneous data sources (no DTD required) Supports full NEXI syntaxSupports all 13 XPath axes (+level )

Random Access

(RA)

title[“native”] par[“retrieval”]eid docid score pre post max-

score

216 17 0.9 2 15 0.9

72 3 0.8 14 10 0.8

51 2 0.5 4 12 0.5

671 31 0.4 12 23 0.4


score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

96 4 0.75 6 4 0.75

SortedAccess

(SA)

C=1.0

Outline







TopX Query Processor

Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01]

Focus on inexpensive SA & postpone expensive RA (NRA & CA) Keep intermediate top-k & enqueue partially evaluated candidates

Lower/Upper score guarantees for each candidate dRemember set of evaluated query dimensions E(d)

worstscore(d) = ∑iE(d) score(ti, ed)bestscore(d) = worstscore(d) + ∑iE(d) highi

Early min-k threshold terminationReturn current top-k, iff

TopX core engine [VLDB ’04]

SA batching & efficient queue managementMulti-threaded SA & query processingProbabilistic cost model for RA schedulingProbabilistic candidate pruning for approximate top-k results

XML engine [VLDB ’05]

Efficiently deals with uncertainty in the structure & content (“andish XPath”)Controlled amount of RA (unique among current XML-top-k engines)Dynamically switch between document & element granularity

1.0

worst=0.9best=2.9

46 worst=0.5best=2.5

9

TopX Query Processing By Example (NRA)

eid docid score pre post

46 2 0.9 2 15

9 2 0.5 10 8

171 5 0.85 1 20

84 3 0.1 1 12


216 17 0.9 2 15

72 3 0.8 14 10

51 2 0.5 4 12

671 31 0.4 12 23


3 1 1.0 1 21

28 2 0.8 8 14

182 5 0.75 3 7

96 4 0.75 6 4

worst=1.0best=2.8

3

worst=0.9best=2.8

216

171 worst=0.85best=2.75

72

worst=0.8best=2.65

worst=0.9best=2.8

46

2851

worst=0.5best=2.4

9doc2 doc17 doc1worst=0.9

best=2.75

216

doc5worst=1.0best=2.75

3

doc3

worst=0.9best=2.7

46

2851

worst=0.5best=2.3


171worst=1.7best=2.5

46

28

worst=0.5best=1.3


216

worst=1.0best=2.65

3

worst=0.85best=2.45

171

worst=0.8best=2.45

72

worst=0.8best=1.6

72

worst=0.1best=0.9

84

worst=0.9best=1.8

216

worst=1.0best=1.9

3

worst=2.2best=2.2

46

2851

worst=0.5best=0.5

9 worst=1.0best=1.6

3

worst=0.85best=2.15


171

182

worst=0.9best=1.0

216

worst=0.0best=2.9

Pseudo-

docworst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35

sec[“xml”] title[“native”]

Top-2 resultsworst=0.946 worst=0.59 worst=0.9

216

worst=1.746

28

worst=1.0

3

worst=1.6171

182

par[“retrieval”]1.0 1.0 1.00.9

0.850.1

0.90.80.5

0.8

0.75

min-2=0.0min-2=0.5min-2=0.9min-2=1.6

sec[“xml”]


Candidate queue

worst=2.246

2851

min-2=1.0

1.0 [169, 348]1.0 [351, 389]1.0 [392, 395]

0.21 [169, 348] 0.16 [351, 389] 0.11 [37, 46]

0.11 [351, 389]

1.0 [1, 419]

0.49 [174, 324] 0.14 [347, 343]0.13 [166, 164]0.12 [354, 353]

0.07 [389, 388]0.06 [354, 353]0.04 [375, 378] 0.02 [372, 371]

0.24 [354, 353]0.18 [357, 359]0.16 [65, 64]

“Andish” XPath over Element Blocks

Incremental & non-conjunctive XPath evaluations using Hash joins on the content conditions

Staircase joins [Grust, VLDB ‘03] on the structure

Tight & accurate [worstscore(d), bestscore(d)] bounds for early pruning (ensuring monotonous updates)

Virtual support elements for navigation

item=w3c

item=w3c

sec=xml

sec=retrieve

par=native

par=xml

par=database

SASA

1.0 [398, 418]

articlearticle

bibbib secsec

RARA

0.0 [*, *]

0.0 [*, *]

0.0 [*, *]

getSubtree-Score()

getParentScore()getSubtree-Score()

getSubtree-Score()

getParentScore()

worstscore(d) = 0.140.63

1.18

3.69 1.38C=1.0 C=0.2

0.2 [169, 348]0.2 [351, 389]0.2 [392, 395]

0.2 [1, 419]

0.2 [398, 418]

item=w3c

item=w3c

bibbib

Outline







MinProbe:Schedule RAs only for the most promising candidates

Extending “Expensive Predicates & Minimal Probing” [Chang&Hwang, SIGMOD ‘02]

Schedule batch of RAs on d, only iff

worstscore(d) + rd c > min-k

Random Access Scheduling – Minimal Probing

evaluated content & structure-related score

unresolved, static structural score mass

item=w3c

item=w3c

sec=xml

sec=retrieve

par=native

par=xml

par=database

articlearticle

bibbib secsec

0.16 [351, 389] 0.11 [351, 389]0.49 [174, 324] 0.06 [354, 353]0.24 [354, 353] 0.12 [354, 353]

SASA

RARA

1.0 [169, 348]

1.0 [1, 419]

1.0 [398, 418]

rank-k worstscore

Goal: Minimize overall execution cost #SA + cR/cS #RAAccess costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores)

Probabilistic cost model comparing different types of Expected Wasted Costs

EWC-RAs(d) of looking up d in the remaining structure

EWC-RAc(d) of looking up d in the remaining contentEWC-SA(d) of not seeing d in the next batch of b SAs

BenProbe: Schedule batch of RAs on d, iff

#EWC-RAs|c(d) cR/cS < #EWC-SA

Bounds the ratio between #RA and #SASchedule RAs late & lastSchedule RAs in asc. order of EWC-RAs|c(d)

Cost-based Scheduling (CA) – Ben Probing

Split the query into a set of basic, characteristic XML patterns:

twigs, paths & tag-term pairs

conjunctive

“andish”

Selectivity Estimator [VLDB ’05]

//sec[//figure=“java”] [//par=“xml”] [//bib=“vldb”]

//sec[//figure]//par

//sec[//figure]//bib

//sec[//par]//bib

//sec//figure

//sec//par

//sec//bib

//bib=“vldb”

//par=“xml”

//figure=“java”

p1 = 0.682

p2 = 0.001

p3 = 0.002

p4 = 0.688

p5 = 0.968

p6 = 0.002

p7= 0.023

p8 = 0.067

p9 = 0.011

figure=“java”figure=“java”

secsec

par=“xml”par=“xml”

bib=“vldb”bib=

“vldb”bib=

“vldb”bib=

“vldb”

secsec

PS [d satisfies a subset Y’ of structural conditions Y] =

Consider binary correlations between structural patterns and/or tag-term pairs (data sampling, query logs, etc.)

Consider structural selectivities of unresolved & non-redundant patterns Y

PS [d satisfies all structural conditions Y] =

samp

ling

samp

ling

Score Predictor [VLDB ’04]

Consider score distributions of the content-related inverted lists


score

216 17 0.9 2 15 0.9

72 3 0.8 10 8 0.8

51 2 0.5 4 12 0.5


score

3 1 1.0 1 21 1.0

28 2 0.8 8 14 0.8

182 5 0.75 3 7 0.75

title[“native”]

par[“retrieval”]0

f1

1 high1

f2

high21 0

PC [d gets in the final top-k] =

2 0δ(d)

Closed-form convolutions, e.g., truncated PoissonMoment-generating functions & Chernoff-Hoeffding boundsCombined score predictor & selectivity estimator

Convolutions of score histograms (assuming independence)Probabilistic candidate pruning:

Drop d from the candidate queue, iff

PC [d gets in the final top-k] < ε(with probabilistic guarantees for relative precision & recall)

Probabilistic candidate pruning: Drop d from the candidate queue, iff

PC [d gets in the final top-k] < ε(with probabilistic guarantees for relative precision & recall)

Outline







Dynamic and Self-tuning Query Expansion [SIGIR ’05]

Incrementally merge inverted lists for a set of active expansions exp(t1)..exp(tm) in descending order of scores s(ti, d)

Max-score aggregation for fending off topic drifts

Dynamically expand set of active expansions only when beneficial for finding the final top-k results

Specialized expansion operatorsIncremental Merge operatorNested Top-k operator (phrase matching)Supports text, structured records & XMLBoolean (but ranked) retrieval mode

d42

d11

d92

...d

21

d78

d10

d11

...d

1

d37

d42

d32

...d

87

disaster

accident

fire

transport

d66

d93

d95

...d

101

tun

nel

d95

d17

d11

...d

99

Top-k (transport, tunnel,

~disaster)

Top-k (transport, tunnel,

~disaster)

d42 d11 d92 d37 …

~disaster

Incr. Merge

TREC RobustTREC RobustTopic no. Topic no. 363 363

SASA

SASA SASA

Outline







Data Collections & Competitors

INEX ‘04 Ad-hoc Track settingIEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data46 NEXI queries with official relevance judgments and a strict quantization

e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”]

TREC ‘04 Robust Track settingAquaint news collection with 528,155 docs in 1,904 MB text data50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments

e.g., “transportation tunnel disasters” or “Hubble telescope achievements”

Competitors for XML setupDBMS-style Join&Sort

Using index full scans on the TopX index (Holistic Twig Joins)

StructIndex [Kaushik et al, Sigmod ’04]

Top-k with separate indexes for content & structureDataGuide-like structural indexEager RAs (Fagin’s TA)

StructIndex+Extent chaining technique for DataGuide-based extent identifiers

(skip scans on the content index)

INEX: TopX vs. Join&Sort & StructIndex

0

2

4

6

8

10

12

1 5 10 50 100 500 1,000

Mil

lion

s

k

# S

A +

# R

A

Join&Sort

StructIndex+

StructIndex

BenProbe

MinProbe

3.2284,424723,1690.010TopX – BenProbe

0.17

0.09

17.023,25,068761,970n/a10StructIndex

12.0109,122,318n/a10Join&Sort

1.000.3480.025,074,38477,482n/a10StructIndex+

1.3864,807635,5070.010TopX – MinProbe

1.000.0316.101,902,427882,9290.01,000TopX – BenProbe

relP

rec

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk

rel.P

rec

46 NEXI Queries

INEX: TopX with Probabilistic Pruning

0.07

0.08

0.08

0.08

0.09

0.770.342.3156,952392,3950.2510

1.000.341.3864,807635,5070.0010TopX - MinProbe

0.650.310.9248,963231,1090.5010

0.510.330.4642,174102,1180.7510

0.380.300.4635,32736,9361.0010

# SA

CPU se

c

P@k

MAP@

k

epsil

on# RAk re

l.Pre

c

46 NEXI Queries

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0 ε

rel. PrecP@10MAP

0

200,000

400,000

600,000

800,000

0 0.2 0.4 0.6 0.8 1.0 ε

#SA

+ #

RA

TopX -MinProbe

TREC Robust: Dynamic vs. Static Query Expansion

Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118)MinProbe RA scheduling for phrase matching (auxiliary term-offset table)Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions (mtop< 118)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ε

rel. Prec, Incr. Merge

rel. Prec, Static Expansion

P@10, Incr. Merge

P@10, Static Expansion

MAP, Incr. Merge

MAP, Static Expansion

0

2

4

6

8

10

12

14

16

18

20

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mil

lion

s

ε

#SA, Incr. Merge

#SA, Static Expansion

#RA, Incr. Merge

#RA, Static Expansion

50 Keyword + Phrase Queries

Conclusions

Efficient and versatile TopX query processorExtensible framework for XML-IR & full-text searchVery good precision/runtime ratio for probabilistic candidate pruningSelf-tuning solution for robust query expansions & IR-style vague search Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06]

ScalabilityOptimized for query processing IOExploits cheap disk space for redundant index structures

(constant redundancy factor of 4-5 for INEX IEEE)Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB)

INEX 2006New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML)Official host for the Topic Development and Interactive Track

(69 groups registered worldwide)TopX WebService available (SOAP connector)

That’s it. That’s it. Thank you!Thank you!

TREC Terabyte: Comparison of Scheduling Strategies

Thanks to Holger Bast & Deb Majumdar!

PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics

Documents