ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Fabian Suchanek, Ingmar Weber Talk at SIGIR’07 in Amsterdam, July 26th
Jan 03, 2016
ESTEREfficient Search on Text, Entities, and Relations
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
ESTEREfficient Search on Text, Entities, and Relations
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
ESTERIt’s about:
Fast Semantic Search
Keyword Search vs. Semantic Search
Keyword search
– Query: john lennon
– Answer: documents containing the words john and lennon
Semantic search
– Query: musician
– Answer: documents containing an instance of musician
Combined search
– Query: beatles musician
– Answer: documents containing the word beatles and an instance of musicianUseful by itself or as a component of a QA system
Semantic Search: Challenges + Our System
1. Entity recognition– approach 1: let users annotate (semantic web)
– approach 2: annotate (semi-)automatically
– our system: uses Wikipedia links + learns from them
2. Query Processing– build a space-efficient index
– which enables fast query answers
– our system: as compact and fast as a standard full-text engine
3. User Interface– easy to use
– yet powerful query capabilities
– our system: standard interface with interactive suggestions
Semantic Search: Challenges + Our System
1. Entity recognition– approach 1: let users annotate (semantic web)
– approach 2: annotate (semi-)automatically
– our system: uses Wikipedia links + learns from them
2. Query Processing– build a space-efficient index
– which enables fast query answers
– our system: as compact and fast as a standard full-text engine
3. User Interface– easy to use
– yet powerful query capabilities
– our system: standard interface with interactive suggestions
focus of the paperand of this talk
In the Rest of this Talk …
Efficiency
– three simple ideas (which all fail)
– our approach (which works)
Queries supported
– essentially all SPARQL queries, and
– seamless integration with ordinary full-text search
Experiments
– efficiency (great)
– quality (not so great yet)
Conclusions
– lots of interesting + challenging open problems
Efficiency: Simple Idea 1
Add “semantic tags” to the document
– e.g., add the special word tag:musician before every occurrence of a musician in a document
Problem 1: Index blowup
– e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes)
Problem 2: Limited querying capabilities
– e.g., could not produce list of musicians that occur in documents that also contain the word beatles
– i.p., could not do all SPARQL queries (more on that later)
Efficiency: Simple Idea 2
Query Expansion
– e.g., replace query word musician by disjunction
musician:aaron_copland OR … OR musician:zarah_leander
(7,593 musicians in Wikipedia)
Problem: Inefficient query processing
– one intersection per element of the disjunction needed
Efficiency: Simple Idea 3
Use a database
– map semantic queries to SQL queries on suitably constructed tables
– that’s what the Artificial-Intelligence / Semantic-Web people usually do
Problem: Inefficient + Lack of control
– building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both
– very limited control regarding efficiency aspects
Efficiency: Our Approach
Two basic operations
– prefix search of a special kind [will be explained by example]
– join [will be explained by example]
An index data structure
– which supports these two operations efficiently
Artificial words in the documents
– such that a large class of semantic queries reduces to a combination of (few of) these operations
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
entity:john_lennonentity:1964entity:liverpooletc.
entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.
entity:john_lennonetc.
twoprefix
queries
onejoin
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)
Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
artist:john_lennonartist:graham_greeneartist:pete_bestetc.
artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.
artist:john_lennonetc.
position
beatles artist:* artist:* . relation:is_a .
class:musiciantwoprefix
queries
onejoin
first figure out:musician artist
(easy)
Maintains lists for word ranges (not words)
Looks like this for person:*
abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …
Pos. 5 Pos. 14 Pos. 124 Pos. 88 …
Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …
able ablaze abroad abnormal
person:* Doc. 17 Doc. 23 Doc. 72 Doc. 72 …
Pos. 12 Pos. 3 Pos. 55 Pos. 59 …
Scor. 0.1 Scor. 0.5 Scor. 0.3 Scor. 0.5 …
person:john_lenno
nperson:ringo_starr
person:graham_gree
neperson:john_lenno
n
The HYB Index [Bast/Weber, SIGIR’06]
The HYB Index [Bast/Weber, SIGIR’06]
Maintains lists for word ranges (not words)
Provably efficient
– no more space than an inverted index (on the same data)
– each query = scan of a moderate number of (compressed) items
abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …
Pos. 5 Pos. 14 Pos. 124 Pos. 88 …
Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …
able ablaze abroad abnormal
Extremely versatile
– can do all kinds of things an inverted index cannot do (efficiently)
– autocompletion, faceted search, query expansion, errorcorrection, select and join, …
SPARQL Protocol And
RDF Query Language
(yes, it’s recursive)
Queries we can handle
We prove the following theorem:
– Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations
SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?when John_Lennon born_in_year ?when }
ESTER achieves seamless integration with full-text search
– SPARQL has no means for dealing with full text search
– XQuery can handle full-text search, but is not really suitable for semantic search
musicians bornin the same yearas John Lennon
more about supported queries in the paper
Experiments: Corpus, Ontology, Index
Corpus: English Wikipedia (xml dump from Nov. 2006)
≈ 8 GB raw xml
≈ 2,8 million documents
≈ 1 billion words
Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07)
≈ 2,5 million facts
derived from clever combination of Wikipedia + WordNet (Entities from Wikipedia, Taxonomy from WordNet)
Our Index
≈ 1.5 billion words (original + artificial)
≈ 3.3 GB total index size; ontology-only is a mere 100 MB
Note: our system works for an arbitrary corpus + ontology
Experiments: Efficiency — What Baseline?
SPARQL engines – can’t do text search
– and slow for ontology-only too (on Wikipedia: seconds)
XQuery engines – extremely slow for text search (on Wikipedia: minutes)
– and slow for ontology-only too (on Wikipedia: seconds)
Other prototypes which do semantic + full-text search– efficiency is hardly considered
– e.g., the system of Castells/Fernandez/Vallet (TKDE’07)
“… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …”
– our system: ~100ms, 2.8 million documents, 2.5 million facts
Experiments: Efficiency — Stress Test 1
Compare to ontology-only system
– the YAGO engine from WWW’07
– Onto Simple : when was [person] born [1000 queries]
– Onto Advanced: list all people from [profession] [1000 queries]
– Onto Hard : when did people die who were born in the same year as [person] [1000 queries]
Note: comparison very unfair (for our system)
Our system Onto-Only
avg. max. avg. max.
Onto Simple 2 ms 5 ms 3 ms 20 ms
Onto Advanced 9 ms 31 ms 3 ms794 ms
Onto Hard64 ms
208 ms
78 ms
550 ms
100 MB index
4 GB index
Experiments: Efficiency — Stress Test 2
Our system Full-Text Only
avg. max. avg. max.
Onto+Text Easy224 ms
772 ms 90 ms 498 ms
Onto+Text Hard279 ms
502 ms 44 ms 85 ms
Compare to text-only search engine
– state-of-the-art system from SIGIR’06
– Onto+Text Easy: counties in [US state] [50 queries]
– Onto+Text Hard: computer scientists [nationality] [50 queries]
– Full-text query: e.g. german computer scientists Note: hardly finds relevant documents
Note: comparison extremely unfair (for our system)
Experiments: Quality — Entity Recognition
Use Wikipedia links as hints
– “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …”
– “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …”
Learn other links
– use words in neighborhood as features
Accuracy
all words 2 senses 3 senses ≥4 senses
93.4% 88.2% 84.4% 80.3%
Experiments: Quality — Relevance
2 Query Sets
– People associated with [american university] [100 queries]
– Counties of [american state] [50 queries]
Ground truth
– Wikipedia has corresponding lists
e.g., List of Carnegie Mellon University People
Precision and Recallprecision@1
0recall
PEOPLE 37.3% 89.7%
COUNTIES 66.5% 97.8%
Conclusions
Semantic Retrieval System ESTER
– fast and scalable via reduction to prefix search and join
– can handle all basic SPARQL queries
– seamless integration with full-text search
– standard user interface with (semantic) suggestions
Lots of interesting and challenging problems
– simultaneous ranking of entities and documents
– proper snippet generation and highlighting
– search result quality
– … Dank je wel!
Context-Sensitive Prefix-Search
Compute completions of last query word
– which together with the previous part of the query would lead to a hit
– [DEMO: show a live example]
Extremely useful
– autocompletion search
– faceted search
– error correction, synonym search, …
– category search
for example, add place:amsterdam
then query place:* finds all instances of a place
formal definitionin the paper
Isn’t the last idea enough for semantic search?
DEMO
Do the following queries [live or recorded]
– beatles
– beatles musi
– beatles musicia
– beatles musician:john_lennon (or beatles entity:john_lennon)
Processing the query “beatles musician”
Liverpool[one of many documents mentioning John Lennon]
… in honor of the late Beatle entity:john_lennon
Liverpool[one of many documents mentioning John Lennon]
… in honor of the late Beatle entity:john_lennon
John Lennon
0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …
beatles entity:* “entity:* r:is_a class:musician”
position
Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia = 20% of all occurrences– prefix search efficient only up XXX
Solution: Frontier set– classes high up in the hierarchy [explain more]– e.g.: person, animal, substance, abstraction, …
Processing the query “beatles musician”
Liverpool[one of many documents mentioning John Lennon]
… in honour of the late Beatle person:john_lennon
Liverpool[one of many documents mentioning John Lennon]
… in honour of the late Beatle person:john_lennon
John Lennon
0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …
John Lennon
0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …
beatles person:*
person:john_lennonperson:the_queenperson:pete_bestetc.
“person:* r:is_a class:musician”
person:wolfang_amadeus_mozartperson:johann_sebastian_bachperson:john_lennonetc.
entity:john_lennon etc.
position
twoprefix
queries
one join
Our Solution, Version 1
Combination of Prefix Search + Join
– Query 1: beatles entity:* entities co-occuring with beatles
– Query 2: musician – entity:* entities which are musicians
– Join the completion from 1 & 2 musicians co-occuring with beatles
Some document about Albert
Einstein
… entity:einstein …
Some document about Albert
Einstein
… entity:einstein …
Albert Einstein
entity:albert_einsteinscientistvegetarianintellectual …
Albert Einstein
entity:albert_einsteinscientistvegetarianintellectual …
But: unspecific prefixes (entity:*) are hard
Our Solution, Version 2
Combination of Prefix Search + Join
– Query 1: translate:singer:* tells us that a singer is a musician
– Query 2: beatles musician:* musicians co-occurring with beatles
– Query 3: physicist – scientist:* musicians which are singers
– Join the completion from 1 & 2 singers co-occurring with beatles
Some document mentioning
John Lennon
… musician:john_lennon xyz:john_lennon …
Some document mentioning
John Lennon
… musician:john_lennon xyz:john_lennon …
John Lennon
musician:john_lennonxyz:john_lennon …
John Lennon
musician:john_lennonxyz:john_lennon …
[Special Doc]
TRANSLATE:singer:musician
[Special Doc]
TRANSLATE:singer:musician
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
artist:john_lennonartist:queen_elisabethartist:pete_bestetc.
artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.
person:john_lennonetc.
position
beatles artist:* artist:* . relation:is_a .
class:musiciantwoprefix
queries
onejoin
John Lennon at the Royal Variety Show in 1963, in the presence of members of the British royalty:
"Those of you in the cheaper seats can clap your hands. The rest of you, if you'll just rattle your jewellery."