Transcript

Search engines in the industry

a use case

Different interests

● researchers / engineers look for high precision and recall

● editors / writers are concerned about matching of queries and results

● marketers want to change / adapt results

Designing a search engine

● functional requirements○ search

■ keywords, boolean retrieval, natural language○ indexing

■ data sources■ data types

○ administration■ manage scoring / boosting functions

Designing a search engine

● architectural requirements○ resiliency○ scalability○ no downtime○ work with existing infrastructure○ platforms○ migrating from legacy systems○ talk to other systems

Designing a search engine

● performance requirements○ search

■ query per second■ time per search request

○ index■ document per second■ time per indexing request

○ SLA?

Designing a search engine

● search engine performance requirements○ recall percentiles threshold○ precision percentiles threshold○ minimize empty results

● often mostly unknown ○ published vs unpublished / to be written documents

● almost always umanageable○ cannot decide when

■ it’ll be ready■ it’ll have to be indexed■ it’ll have to be searchable

● heterogeneous○ different writers, languages, topics, styles, etc.

Data

Process

Project

● ~50M heterogeneous documents● Migrating from old commercial solution to

Apache Solr● Google like search● Targeted search for different types of

contents

Advanced capabilities

● Smart understanding of queries● Smart suggestion of queries ● Suggestion of similar / important contents● Automatic classification of contents

Responsibilities

● architecture analysis and design○ scaling under high load

● continuous definition of algorithms for indexing and searching

● system maintenance

Skills required

● basics of information retrieval● a bit of distributed systems● some natural language processing● some machine learning

Architecture analysis and design

● Shape up a prototype architecture○ separate machines for indexing and search○ multiple load balanced machines for searching○ define indexing and search algorithms

● Evaluate architecture○ stress tests (performance)○ quality tests (accuracy)

● Iterate

Architecture analysis and design

● analyze existing documents○ avg size○ language○ topics, style, etc.

● analyze existing query logs○ avg response time○ avg length (how much it takes to specify a query?)○ avg query per second

Most time spent on

● testing how documents get indexed● testing how user queries get transformer in

platform specific queries● tweaking indexing algorithms● tweaking search algorithms● tweaking ranking● platform optimization for scalability

Challenges

● Architecture constraints● Performance● Diverging stakeholders concerns● Dynamically scaling search

Sample architecture constraint #1

● Data storage has to be on NFS● Lucene is IO intensive● NFS makes it slower● Concurrent read writes makes it error prone

Sample architecture constraint #2

● Change search engine● Systems talking to the SE need to switch

API● Only in the long run● In the short run an adapter layer for old APIs

on new APIs has to be developed

Indexing performance

● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format

● The adaption layer between old and new API becomes the bottleneck

● Time to switch to the new API natively

Diverging concerns

● Article authors check the search engine exactly handles their writings wanting perfect recall and precision○ so lot of time is spent on adjusting ranking

● Markters want to be able to overcome ranking and put something they want to sell○ ranking algorithm gets breached

● Need flexible algorithms

Scale dinamically

● Search engine needs not to break even under high peaks of load

● Such peaks are often unpredictable● Need a fast way to add more computing

power

Takeaways

● small iterations (no waterfalls!)○ analyze portion of data / queries○ change search / index algorithms○ test, involve stakeholders○ forces ability to reindex quickly

● look at data (documents, query logs)

top related