FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

A Real-Time Search Engine with Lucene and S4

Yahoo! S4 applied to Information Retrieval

2/5/2011 Michaël Figuière

Speaker

Michaël Figuière

@mfiguiere

blog.xebia.fr

Search Engines NoSQL

DistributedArchitectures

Our case study

A Search Engine to keep track of activities within an enterprise

The Problem

A Search Engine

Search

A Search Engine

SearchMyCustomer

A Search Engine

SearchMyCustomer

Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Document

Phone Call

Indexing Pipeline

Text Extractor

Lucene

PDF

PhoneCall

Analyzer

Analyzer

SearchIndex

Tika

A more complex Search Engine

SearchMyCustomer


Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call

Sales Juridic Accounting

Indexing Pipeline

Text Extractor

Lucene

PDF

PhoneCall

Analyzer

Analyzer

SearchIndex

Tika

Classifier

Classifier

Mahout

More complex ...

• Entity Recognition

• Language Recognition

• Fetching linked URLs

• ...

Recognizes an entity written in any way

To index each language separately

Enhances document context by also indexing linked URLs

A Real-Time Search Engine

SearchMyCustomer


Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call


A Real-Time Search Engine

SearchMyCustomer



Document

Phone Call


Indexing Pipeline

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer

Near Real-TimeSearch Index

SomePre-Processing

Since Lucene 2.9

SomePre-Processing

But...

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer


SomePre-Processing

What if it takesone second/document on a single box ??

SomePre-Processing

Server 1

Let’s distribute it

Pre-Processing

SearchIndex

Server 2

Pre-Processing

SearchIndex

Server 3

Pre-Processing

SearchIndex

Server N

Processing logic and index structure distributed together

That’s a problem...

• Processing and index storage may have different scaling needs

• Scaling up and down an index storage is long and complex

• Expensive pre-processing may make searches slower

Depending on the search traffic, the processing overhead, ...

Whereas stateless processing is simple to scale up/down

And indexing in real-time shouldn’t make searches slower !

Let’s move it to Hadoop

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer


SomePre-Processing

Hadoop MapReduce

SomePre-Processing

But...

• Hadoop can only deal with chunk of data

• Unbounded stream of data can’t fit into Hadoop MapReduce

• Manually bounding the stream won’t be efficient

Data must be available somewhere on HDFS

Hadoop is thought and optimized for batch processing

It’ll resulting in lot of regular and inefficient batches

S4

S4

• A distributed, fault-tolerant, stream processing system

• Elastic

• Project started in november 2010, still experimental

Based on Zookeeper

But things are moving fast !

Where does S4 come from ?

• Open Source project created by Yahoo!

• Initially built for relevant ad selection and clever positioning on webpages

• Expensive pre-processing may make searches slower

But thought to be generic enough

And indexing in real-time shouldn’t make searches slower !

Processing Element

ProcessingElement

Events OutputEvents Input

Your businesslogic goes here

Processing Node

Processing Node

ProcessingElement 1

ProcessingElement 2

ProcessingElement N

Processing Node 1

S4 Cluster

Processing Node 2 ZookeeperEventsStream

ClusterManagement

Processing Node N

Programming model

PhoneCallPE

Accept events with :

Type=PhoneCall

KeyTuple: Id=15497 EventEventType: PhoneCall

KeyTuple: «Id=15497»

Value: <serialized object>

Type: EnrichedPhoneCall

KeyTuple: «Id=15497»

Value: <serialized object>

A new ProcessingElement instance is created for each value of «Id»

An indexing pipeline with S4

ReRoutingPE

TextExtractionPETextExtractionPE

ReRoutingPE

ClassificationPEClassificationPE

MergingPE

Handles incoming eventsand load-balance themaccording to partitioning


Handles result eventsand load-balance betweenProcessing Nodes

ReRoutingPE


ReRoutingPE


MergingPE


Handles final resultevents and pushthem to the Indexer

ReRoutingPE


ReRoutingPE


MergingPE

Some drawbacks

• The system is lossy

• A workaround is to increase the incoming queue of nodes

• Still experimental

But still, events may be lost during failure

But very promising

Events may be lost when nodes are overloaded or during failure

More: Real-Time Inverted Search

SearchMyCustomer



Document

Phone Call

20 new results...


Summary

• S4 is a nice processing system for real-time search

• Not only for indexing-time, also for query-time !

• A promising roadmap....

As S4 ensures low latency, query-time processing is possible

Better failure handling, client API in major languages, initial processing with Hadoop, ...

Events may be lost when nodes are overloaded or during failure

Questions / Answers

?@mfiguiere

blog.xebia.fr

FOSDEM (feb 2011) - A real-time search engine with Lucene and S4

Technology

search traffic

search enginesearch

events processing node

batch processing

processing overhead

mycustomer time

indexing pipeline

realtime shouldnt