TopX TopX Efficient and Versatile Efficient and Versatile Top-k Query Processing for Top-k Query Processing for Text, Structured, and Semistructured Text, Structured, and Semistructured Data Data PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics VLDB ‘05
28
Embed
PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics
TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data. PhD Defense May 16 th 2006 Martin Theobald Max Planck Institute for Informatics. VLDB ‘05. An XML-IR Scenario (INEX IEEE) …. article. article. title. title. “ Current - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TopXTopXEfficient and Versatile Efficient and Versatile
Top-k Query Processing for Top-k Query Processing for Text, Structured, and Semistructured DataText, Structured, and Semistructured Data
PhD DefenseMay 16th
2006
Martin Theobald
Max Planck Institute for Informatics
VLDB ‘05
“Native XML data base systems can store schemaless data ... ”
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML-QL: A Query Language for XML.”
“Native XML Data Bases.”
“Proc. Query Languages Workshop, W3C,1998.”
“XML queries with an expres- sive power similar to that of Datalog …”
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment”
itempar
title inproc
title
//article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)]
An XML-IR Scenario (INEX IEEE) …
“What does XML add for retrieval? It adds formal ways …”
“w3c.org/xml”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “The
XML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
bib
“There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
RANKINGRANKINGRANKINGRANKING
VAGUENESSVAGUENESSVAGUENESSVAGUENESS
PRUNINGPRUNINGPRUNINGPRUNING
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Data Model
XML tree modelPre/postorder labels for all tags and merged tag-term pairs XPath Accelerator [Grust, Sigmod ’02]
Redundant full-content text nodes Full-content term frequencies ftf(ti,e)
<article>
<title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data.</par> </sec></article>
“xml data manage xml manage system vary wide expressive power native xml
data base native xml data base system store schemaless data“
“native xml data base native xml data base
system store schemaless data“
“xml data
manage”
articlearticle
titletitle absabs secsec
“xml manage system vary
wide expressivepower“
“native xml data base”
“native xml data base system store schemaless data“
Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avg.length k1 b
article 12,223 2,903 10.5 0.75
sec 96,709 413 10.5 0.75
par 1,024,907 32 10.5 0.75
fig 109,230 13 10.5 0.75
individualelementstatistics
Additional static score mass c for relaxable structural conditions
and non-conjunctive (“andish”) XPath evaluations
bib[“transactions”]vs.
par[“transactions”]
bib[“transactions”]vs.
par[“transactions”]
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Inverted Block-Index for Content & Structure
sec[“xml”]
title[“native”] par[“retrieval”]
Combined inverted index over merged tag-term pairs (on redundant element full-contents)
Sequential block-scans Group elements in descending order of (maxscore, docid) per listBlock-scan all elements per doc for a given (tag, term) key
Stored as inverted files or database tables
(two B+-tree indexes over full range of attributes)
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
sec[“xml”] title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
Random Access (RA)
SortedAccess
(SA)
Navigational Index
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
sec
title[“native”] par[“retrieval”]
sec
Additional element directoryRandom accesses on B+-tree index using (docid, tag) as keyCarefully scheduled probes
Schema-oblivious indexing & queryingNon-schematic, heterogeneous data sources (no DTD required) Supports full NEXI syntaxSupports all 13 XPath axes (+level )
Random Access
(RA)
title[“native”] par[“retrieval”]eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
SortedAccess
(SA)
C=1.0
Outline
Data & relevance scoring model
Database schema & indexing
TopX query processing
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
TopX Query Processor
Adapt Threshold Algorithm (TA) paradigm [Fagin et al., PODS ‘01]
Focus on inexpensive SA & postpone expensive RA (NRA & CA) Keep intermediate top-k & enqueue partially evaluated candidates
Lower/Upper score guarantees for each candidate dRemember set of evaluated query dimensions E(d)
Early min-k threshold terminationReturn current top-k, iff
TopX core engine [VLDB ’04]
SA batching & efficient queue managementMulti-threaded SA & query processingProbabilistic cost model for RA schedulingProbabilistic candidate pruning for approximate top-k results
XML engine [VLDB ’05]
Efficiently deals with uncertainty in the structure & content (“andish XPath”)Controlled amount of RA (unique among current XML-top-k engines)Dynamically switch between document & element granularity
Goal: Minimize overall execution cost #SA + cR/cS #RAAccess costs on d are wasted, if d does not make it into the final top-k (considering both structural selectivities & content scores)
Probabilistic cost model comparing different types of Expected Wasted Costs
EWC-RAs(d) of looking up d in the remaining structure
EWC-RAc(d) of looking up d in the remaining contentEWC-SA(d) of not seeing d in the next batch of b SAs
BenProbe: Schedule batch of RAs on d, iff
#EWC-RAs|c(d) cR/cS < #EWC-SA
Bounds the ratio between #RA and #SASchedule RAs late & lastSchedule RAs in asc. order of EWC-RAs|c(d)
Cost-based Scheduling (CA) – Ben Probing
Split the query into a set of basic, characteristic XML patterns:
Index access scheduling & probabilistic candidate pruning
Dynamic query relaxation & expansion
Experiments & conclusions
Data Collections & Competitors
INEX ‘04 Ad-hoc Track settingIEEE collection with 12,223 docs & 12M elemt’s in 534 MB XML data46 NEXI queries with official relevance judgments and a strict quantization
e.g., //article[.//bib=“QBIC” and .//par=“image retrieval”]
TREC ‘04 Robust Track settingAquaint news collection with 528,155 docs in 1,904 MB text data50 “hard” queries from TREC Robust Track ‘04 with official relevance judgments
e.g., “transportation tunnel disasters” or “Hubble telescope achievements”
Competitors for XML setupDBMS-style Join&Sort
Using index full scans on the TopX index (Holistic Twig Joins)
StructIndex [Kaushik et al, Sigmod ’04]
Top-k with separate indexes for content & structureDataGuide-like structural indexEager RAs (Fagin’s TA)
StructIndex+Extent chaining technique for DataGuide-based extent identifiers
Careful WordNet expansions using automatic Word Sense Disambiguation & phrase detection [WebDB ’03 & PKDD ’05] with (m<118)MinProbe RA scheduling for phrase matching (auxiliary term-offset table)Incremental Merge + Nested Top-k (mtop< 22) vs. Static Expansions (mtop< 118)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0ε
rel. Prec, Incr. Merge
rel. Prec, Static Expansion
P@10, Incr. Merge
P@10, Static Expansion
MAP, Incr. Merge
MAP, Static Expansion
0
2
4
6
8
10
12
14
16
18
20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mil
lion
s
ε
#SA, Incr. Merge
#SA, Static Expansion
#RA, Incr. Merge
#RA, Static Expansion
50 Keyword + Phrase Queries
Conclusions
Efficient and versatile TopX query processorExtensible framework for XML-IR & full-text searchVery good precision/runtime ratio for probabilistic candidate pruningSelf-tuning solution for robust query expansions & IR-style vague search Combined SA and RA scheduling close to lower bound for CA access cost [Submitted for VLDB ’06]
ScalabilityOptimized for query processing IOExploits cheap disk space for redundant index structures
(constant redundancy factor of 4-5 for INEX IEEE)Extensive TREC Terabyte runs with 25,000,000 text documents (426 GB)
INEX 2006New Wikipedia XML collection with 660,000 documents & 120,000,000 elements (~ 6 GB raw XML)Official host for the Topic Development and Interactive Track
(69 groups registered worldwide)TopX WebService available (SOAP connector)
That’s it. That’s it. Thank you!Thank you!
TREC Terabyte: Comparison of Scheduling Strategies