A Real-Time Search Engine with Lucene and S4 Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
May 26, 2015
A Real-Time Search Engine with Lucene and S4
Yahoo! S4 applied to Information Retrieval
2/5/2011 Michaël Figuière
Speaker
Michaël Figuière
@mfiguiere
blog.xebia.fr
Search Engines NoSQL
DistributedArchitectures
Our case study
A Search Engine to keep track of activities within an enterprise
The Problem
A Search Engine
Search
A Search Engine
SearchMyCustomer
A Search Engine
SearchMyCustomer
Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Document
Phone Call
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
A more complex Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
Indexing Pipeline
Text Extractor
Lucene
PhoneCall
Analyzer
Analyzer
SearchIndex
Tika
Classifier
Classifier
Mahout
More complex ...
• Entity Recognition
• Language Recognition
• Fetching linked URLs
• ...
Recognizes an entity written in any way
To index each language separately
Enhances document context by also indexing linked URLs
A Real-Time Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
A Real-Time Search Engine
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
Sales Juridic Accounting
Indexing Pipeline
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
Since Lucene 2.9
SomePre-Processing
But...
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
What if it takesone second/document on a single box ??
SomePre-Processing
Server 1
Let’s distribute it
Pre-Processing
SearchIndex
Server 2
Pre-Processing
SearchIndex
Server 3
Pre-Processing
SearchIndex
Server N
Processing logic and index structure distributed together
That’s a problem...
• Processing and index storage may have different scaling needs
• Scaling up and down an index storage is long and complex
• Expensive pre-processing may make searches slower
Depending on the search traffic, the processing overhead, ...
Whereas stateless processing is simple to scale up/down
And indexing in real-time shouldn’t make searches slower !
Let’s move it to Hadoop
Text Extractor
PhoneCall
Analyzer
Analyzer
Near Real-TimeSearch Index
SomePre-Processing
Hadoop MapReduce
SomePre-Processing
But...
• Hadoop can only deal with chunk of data
• Unbounded stream of data can’t fit into Hadoop MapReduce
• Manually bounding the stream won’t be efficient
Data must be available somewhere on HDFS
Hadoop is thought and optimized for batch processing
It’ll resulting in lot of regular and inefficient batches
S4
S4
• A distributed, fault-tolerant, stream processing system
• Elastic
• Project started in november 2010, still experimental
Based on Zookeeper
But things are moving fast !
Where does S4 come from ?
• Open Source project created by Yahoo!
• Initially built for relevant ad selection and clever positioning on webpages
• Expensive pre-processing may make searches slower
But thought to be generic enough
And indexing in real-time shouldn’t make searches slower !
Processing Element
ProcessingElement
Events OutputEvents Input
Your businesslogic goes here
Processing Node
Processing Node
ProcessingElement 1
ProcessingElement 2
ProcessingElement N
Processing Node 1
S4 Cluster
Processing Node 2 ZookeeperEventsStream
ClusterManagement
Processing Node N
Programming model
PhoneCallPE
Accept events with :
Type=PhoneCall
KeyTuple: Id=15497 EventEventType: PhoneCall
KeyTuple: «Id=15497»
Value: <serialized object>
Type: EnrichedPhoneCall
KeyTuple: «Id=15497»
Value: <serialized object>
A new ProcessingElement instance is created for each value of «Id»
An indexing pipeline with S4
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
Handles incoming eventsand load-balance themaccording to partitioning
An indexing pipeline with S4
Handles result eventsand load-balance betweenProcessing Nodes
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
An indexing pipeline with S4
Handles final resultevents and pushthem to the Indexer
ReRoutingPE
TextExtractionPETextExtractionPE
ReRoutingPE
ClassificationPEClassificationPE
MergingPE
Some drawbacks
• The system is lossy
• A workaround is to increase the incoming queue of nodes
• Still experimental
But still, events may be lost during failure
But very promising
Events may be lost when nodes are overloaded or during failure
More: Real-Time Inverted Search
SearchMyCustomer
2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...
Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E
Document
Phone Call
20 new results...
Sales Juridic Accounting
Summary
• S4 is a nice processing system for real-time search
• Not only for indexing-time, also for query-time !
• A promising roadmap....
As S4 ensures low latency, query-time processing is possible
Better failure handling, client API in major languages, initial processing with Hadoop, ...
Events may be lost when nodes are overloaded or during failure
Questions / Answers
?@mfiguiere
blog.xebia.fr