Top Banner
A Real-Time Search Engine with Lucene and S4 Yahoo! S4 applied to Information Retrieval 2/5/2011 Michaël Figuière
33
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Real-Time Search Engine with Lucene and S4

Yahoo! S4 applied to Information Retrieval

2/5/2011 Michaël Figuière

Page 2: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Speaker

Michaël Figuière

@mfiguiere

blog.xebia.fr

Search Engines NoSQL

DistributedArchitectures

Page 3: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Our case study

A Search Engine to keep track of activities within an enterprise

Page 4: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

The Problem

Page 5: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Search Engine

Search

Page 6: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Search Engine

SearchMyCustomer

Page 7: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Search Engine

SearchMyCustomer

Non Disclosure Agreement 12 days ago... MyCustomer agrees not to disclose any part of ...

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Document

Phone Call

Page 8: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Indexing Pipeline

Text Extractor

Lucene

PDF

PhoneCall

Analyzer

Analyzer

SearchIndex

Tika

Page 9: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A more complex Search Engine

SearchMyCustomer

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 2 days agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call

Sales Juridic Accounting

Page 10: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Indexing Pipeline

Text Extractor

Lucene

PDF

PhoneCall

Analyzer

Analyzer

SearchIndex

Tika

Classifier

Classifier

Mahout

Page 11: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

More complex ...

• Entity Recognition

• Language Recognition

• Fetching linked URLs

• ...

Recognizes an entity written in any way

To index each language separately

Enhances document context by also indexing linked URLs

Page 12: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Real-Time Search Engine

SearchMyCustomer

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call

Sales Juridic Accounting

Page 13: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

A Real-Time Search Engine

SearchMyCustomer

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call

Sales Juridic Accounting

Page 14: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Indexing Pipeline

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer

Near Real-TimeSearch Index

SomePre-Processing

Since Lucene 2.9

SomePre-Processing

Page 15: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

But...

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer

Near Real-TimeSearch Index

SomePre-Processing

What if it takesone second/document on a single box ??

SomePre-Processing

Page 16: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Server 1

Let’s distribute it

Pre-Processing

SearchIndex

Server 2

Pre-Processing

SearchIndex

Server 3

Pre-Processing

SearchIndex

Server N

Processing logic and index structure distributed together

Page 17: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

That’s a problem...

• Processing and index storage may have different scaling needs

• Scaling up and down an index storage is long and complex

• Expensive pre-processing may make searches slower

Depending on the search traffic, the processing overhead, ...

Whereas stateless processing is simple to scale up/down

And indexing in real-time shouldn’t make searches slower !

Page 18: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Let’s move it to Hadoop

Text Extractor

PDF

PhoneCall

Analyzer

Analyzer

Near Real-TimeSearch Index

SomePre-Processing

Hadoop MapReduce

SomePre-Processing

Page 19: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

But...

• Hadoop can only deal with chunk of data

• Unbounded stream of data can’t fit into Hadoop MapReduce

• Manually bounding the stream won’t be efficient

Data must be available somewhere on HDFS

Hadoop is thought and optimized for batch processing

It’ll resulting in lot of regular and inefficient batches

Page 20: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

S4

Page 21: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

S4

• A distributed, fault-tolerant, stream processing system

• Elastic

• Project started in november 2010, still experimental

Based on Zookeeper

But things are moving fast !

Page 22: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Where does S4 come from ?

• Open Source project created by Yahoo!

• Initially built for relevant ad selection and clever positioning on webpages

• Expensive pre-processing may make searches slower

But thought to be generic enough

And indexing in real-time shouldn’t make searches slower !

Page 23: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Processing Element

ProcessingElement

Events OutputEvents Input

Your businesslogic goes here

Page 24: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Processing Node

Processing Node

ProcessingElement 1

ProcessingElement 2

ProcessingElement N

Page 25: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Processing Node 1

S4 Cluster

Processing Node 2 ZookeeperEventsStream

ClusterManagement

Processing Node N

Page 26: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Programming model

PhoneCallPE

Accept events with :

Type=PhoneCall

KeyTuple: Id=15497 EventEventType: PhoneCall

KeyTuple: «Id=15497»

Value: <serialized object>

Type: EnrichedPhoneCall

KeyTuple: «Id=15497»

Value: <serialized object>

A new ProcessingElement instance is created for each value of «Id»

Page 27: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

An indexing pipeline with S4

ReRoutingPE

TextExtractionPETextExtractionPE

ReRoutingPE

ClassificationPEClassificationPE

MergingPE

Handles incoming eventsand load-balance themaccording to partitioning

Page 28: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

An indexing pipeline with S4

Handles result eventsand load-balance betweenProcessing Nodes

ReRoutingPE

TextExtractionPETextExtractionPE

ReRoutingPE

ClassificationPEClassificationPE

MergingPE

Page 29: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

An indexing pipeline with S4

Handles final resultevents and pushthem to the Indexer

ReRoutingPE

TextExtractionPETextExtractionPE

ReRoutingPE

ClassificationPEClassificationPE

MergingPE

Page 30: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Some drawbacks

• The system is lossy

• A workaround is to increase the incoming queue of nodes

• Still experimental

But still, events may be lost during failure

But very promising

Events may be lost when nodes are overloaded or during failure

Page 31: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

More: Real-Time Inverted Search

SearchMyCustomer

2010 Sales Report 1 month ago... MyCustomer: 12 M€ with 3 deals ...

Phone Call 3 seconds agoCustomer: MyCustomer Time: 9:55am Duration: 13minDescription: Invoice not received for order #2354E

Document

Phone Call

20 new results...

Sales Juridic Accounting

Page 32: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Summary

• S4 is a nice processing system for real-time search

• Not only for indexing-time, also for query-time !

• A promising roadmap....

As S4 ensures low latency, query-time processing is possible

Better failure handling, client API in major languages, initial processing with Hadoop, ...

Events may be lost when nodes are overloaded or during failure

Page 33: FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4

Questions / Answers

?@mfiguiere

blog.xebia.fr