Roi Blanco ([email protected]) Large-Scale Semantic Search http://labs.yahoo.com/Yahoo_Labs_Barcelona
Roi Blanco ([email protected])
Large-Scale Semantic Search
http://labs.yahoo.com/Yahoo_Labs_Barcelona
Semantic Search
• Gain insights/value over your data– Aggregate– Search
• Adding a “understanding” layer to the stages of a search engine– Typically very hard, limited success, slow, no clear benefits or
application …– Boils down to generate structure over unstructured text
• Currently, (more or less) confined within “entity-search”– Identifying (or extracting) real-world concepts in free text, with types– Although that shouldn’t be the end!
• Borrows from different fields (IR, SW, NLP, DB)– Large scale = only the efficient/reliable parts
Search is really fast, without necessarily being intelligent
Why Semantic Search? Part I
• Improvements in IR are harder and harder to come by– Machine learning using hundreds of features
• Text-based features for matching• Graph-based features provide authority
– Heavy investment in computational power, e.g. real-time indexing and instant search
• Remaining challenges are not computational, but in modeling user cognition– Need a deeper understanding of the query, the content
and/or the world at large– Could Watson explain why the answer is Toronto?
Ambiguity
What it’s like to be a machine?
Roi Blanco
What it’s like to be a machine?
✜Θ♬♬ţğ
✜Θ♬♬ţğ √∞ §®ÇĤĪ✜★♬☐✓✓ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ≠=⅚ ©§ ★✓♪ΒΓΕ℠
✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ⏎⌥°¶§ ΥΦΦΦ ✗✕☐
Poorly solved information needs
• Multiple interpretations– paris hilton
• Long tail queries– george bush (and I mean the beer brewer in Arizona)
• Multimedia search– paris hilton sexy
• Imprecise or overly precise searches – jim hendler– pictures of strong adventures people
• Searches for descriptions– countries in africa– 34 year old computer scientist living in barcelona– reliable digital camera under 300 dollars
Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
Use cases in web search
Top-1 entity with structured data
Related entitiesStructured dataextracted from HTML
Semantics at every step of the IR process
bla bla bla?
bla
blabla
q=“bla” * 3
Document processing bla
blabla
blabla
bla
IndexingRanking
“bla”θ(q,d)
Query interpretation
Result presentation
The IR engine The Web
Usability
SometimesWe also fail at using the technology
Annotated documentsBarack Obama visited Tokyo this Monday as part of an extended Asian trip.He is expected to deliver a speech at the ASEAN conference next Tuesday
Barack Obama visited Tokyo this Monday as part of an extended Asian trip.
He is expected to deliver a speech at the ASEAN conference next Tuesday
20 May 2009
28 May 2009
oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org
Semantic annotations help to generalize…
Sports team
Movie
Actor
… and understand user needs
moneyball trailer
what the user wants to do with it
Movie
Object of the query
”A child of five would understand this. Send someone to fetch a child of five”.
Groucho Marx
Is NLU that complex?
Applications• Enhanced search
– Better query understanding– Better ranking (tail/hard queries)– Better results presentation– Use heavy types, dependencies + WSD
• Advisory to employ models to minimize overfitting. (Blanco & Boldi Extending BM25 with multiple query operators. SIGIR 2012)
• Recommender systems– Structured data helps cross-domain recommendation
• Diversity in search/recommendations• Crazy prototypes!
– From Q&A to mining/retrieving heavily annotated information• Even predictions about the future!
– Matthews et al, 2010. Searching over time in the NYT. HCIR 2010• Or systems that return entity-grained answers
Other applications• Frequent pattern mining over queries
– PrefixSpan algorithm (movies)• Types as items
– Film queries are more common than Actor queries• Attributes as items
– Trailers and dvd are most commonly searched for new movie releases
– Cast and quote queries are most common for older movies• Abandonment
– ML model to predict when users abandon a some site in favor of the competition
• Combination of attributes, types for past two queries• Tree ensemble ~ set of positive/negative patterns
L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic Analysis. WWW 2013
How does correlator work?
Monty Python
Inverted Index(sentence/doc level)
Forward Index(entity level)
Flying CircusJohn CleeseBrian
Parallel Indexes• Standard index contains only tokens• Parallel indices contain annotations on the tokens – the
annotation indices must be aligned with main token index • For example: given the sentence “New York has great
pizza” where New York has been annotated as a LOCATION – Token index has five entries
(“new”, “york”, “has”, “great”, “pizza”)
– The annotation index has five entries (“LOC”, “LOC”, “O”,”O”,”O”)
Can optionally encode BIO format (e.g. LOC-B, LOC-I)• To search for the New York location entity, we search for:
“token:New ^ entity:LOC token:York ^ entity:LOC”
Parallel Indices (II)
Doc #5: Hope claims that in 1994 she run to Peter Town.
Peter D3:4, D5:9Town D5:10Hope D5:11994 D5:5…
Doc #3: The last time Peter exercised was in the XXth century.
Possible Queries: “Peter AND run” “Peter AND WNS:N_DATE” “(WSJ:CITY ^ *) AND run” “(WSJ:PERSON ^ Hope) AND run”
WSJ:PERSON D3:4, D5:1WSJ:CITY D5:9, D5:10WNS:V_DATE D5:5
(Bracketing can also be dealt with)
Resource Description Framework (RDF)
• Each resource (thing, entity) is identified by a URI– Globally unique identifiers– Locators of information
• Data is broken down into individual facts– Triples of (subject, predicate, object)
• A set of triples (an RDF graph) is published together in an RDF document
example:roi
“Roi Blanco”
name
typefoaf:Person
RDF document
Linked Data: interlinked RDF
example:roi
“Roi Blanco”
namefoaf:Person
sameAs
example:roi2worksWith
example:peter
type
type
Roi’s homepage
Yahoo
Friend-of-a-Friend ontology
Information access in the Semantic Web
• Database-style indexing of RDF data– Triple stores– Structural queries (SPARQL) – No ranking– Evaluation focused on efficiency
• IR-style indexing of RDF data– Search engines– Keyword queries – Ranking– Evaluation focused on effectiveness
• Combined methods– Keyword matching and limited join processing
Search over RDF data• Unstructured or hybrid search over RDF data
– Supporting end-users • Users who can not express their need in SPARQL
– Dealing with large-scale data• Giving up query expressivity for scale
– Dealing with heterogeneity• Users who are unaware of the schema of the data• No single schema to the data
– Example: 2.6m classes and 33k properties in Billion Triples 2009
• Entity search– Queries where the user is looking for a single entity named or described in the
query– e.g. kaz vaporizer, hospice of cincinnati, mst3000
Conclusions
• Large-scale semantic search should become a commodity soon– Plenty of open source tools for extraction, linking– (soon) and indexing, ranking semantic information
• Research challenges ahead– Making all the pieces fit together– Using more fine-grained structured information
(think of context, location, device)
Architecture overview
Doc
1. Download, uncompress, convert (if needed)
2. Sort quads by subject
3. Compute Minimal Perfect Hash (MPH)
map
map
reduce
reduce
map reduce
Index
3. Each mapper reads part of the collection
4. Each reducer builds an index for a subset of the vocabulary
5. Optionally, we also build an archive (forward-index)
5. The sub-indices are merged into a single index
6. Serving and Ranking
RDF indexing using MapReduce• Text indexing using MapReduce
– Map: parse input into (term, doc) pairs• Pre-processing such as stemming, blacklisting• To support phrase queries values are (doc, position) pairs
– Reduce: collect all values for the same key: (term, {doc1,doc2…}), output posting-list
• Secondary sort to pre-sort document ids before iteration
• RDF indexing using MapReduce– Document is all triples with a given subject
• Variations: index also RDF molecules, triples where the URI is an object– Index terms in property-values
• Keys are (field, term) pairs• Variation: distinguish values for the same property
– Index terms in the subject URI• Variation: index also terms in object URIs
Horizontal index structure• One field per position
– one for object (token), one for predicates (property), optionally one for context
• For each term, store the property on the same position in the property index– Positions are required even without phrase queries
• Query engine needs to support fields and the alignment operator✓ Dictionary is number of unique terms + number of properties✓ Occurrences is number of tokens * 2
Vertical index structure• One field (index) per property• Positions are not required• Query engine needs to support fields✓ Dictionary is number of unique terms✓ Occurrences is number of tokens
✗ Number of fields is a problem for merging, query performance• In experiments we index the N most common properties
Big data = data• Modern data-sets comprise a mixture of structured and non-structured
data– Text, news, blogs– Microformats, rdf– Images– Video– Social media (a mixture too)
• Transform unstructured data into structureddata• Entity extraction, disambiguation
• Provide value over the data– Aggregation (BI)– Search
• Scalable semantic search– Power next-generation search, recommendation, analytics etc.– Improvements linear with resources– Lightweight processes, powering interactive real-time experiences
Efficiency improvements• r-vertical (reduced-vertical) index
– One field per weight vs. one field per property– More efficient for keyword queries but loses the ability to restrict per
field– Example: three weight levels
• Pre-computation of alignments– Additional term-to-field index– Used to quickly determine which fields contain a term (in any document)
Indexing efficiency• Billion Triples 2009 dataset
– 249 GB in uncompressed N-Quad– 114 million URIs and 274 million triples with datatype properties– 2.9B / 1.4B occurrences (horiz/vert)
• Selected 300 most frequent datatype properties for vertical indexing• Resulting index is 9-10GB in size• Horizontal and vertical indexing using Hadoop
– Scale is only limited by number of machines – Number of reducers is a trade-off between speed and number of sub-indices to be merged
Run-time efficiency• Measured average execution time (including ranking)
– Using 150k queries that lead to a click on Wikipedia– Avg. length 2.2 tokens– Baseline is plain text indexing with BM25
• Results– Some cost for field-based retrieval compared to plain text indexing – AND is always faster than OR
• Except in horizontal, where alignment time dominates– r-vertical significantly improves execution time in OR mode
AND mode OR mode
plain text 46 ms 80 ms
horizontal 819 ms 847 ms
vertical 97 ms 780 ms
r-vertical 78 ms 152 ms
Efficient element retrieval• Goal
– Given an ad-hoc query, return a list of documents and annotations ranked according to their relevance to the query
• Simple Solution– For each document that matches the query, retrieve its
annotations and return the ones with the highest counts• Problems
– If there are many documents in the result set this will take too long - too many disk seeks, too much data to search through
– What if counting isn’t the best method for ranking elements?• Solution
– Special compressed data structures designed specifically for annotation retrieval
Forward Index
• Access metadata and document contents – Length, terms, annotations
• Compressed (in memory) forward indexes– Gamma, Delta, Nibble, Zeta codes (power laws)
• Retrieving and scoring annotations– Sort terms by frequency
• Random access using an extra compressed pointer list (Elias-Fano)