Search engines in the industry

a use case

Different interests

● researchers / engineers look for high precision and recall

● editors / writers are concerned about matching of queries and results

● marketers want to change / adapt results

Designing a search engine

● functional requirements○ search

■ keywords, boolean retrieval, natural language○ indexing

■ data sources■ data types

○ administration■ manage scoring / boosting functions

● architectural requirements○ resiliency○ scalability○ no downtime○ work with existing infrastructure○ platforms○ migrating from legacy systems○ talk to other systems

● performance requirements○ search

■ query per second■ time per search request

○ index■ document per second■ time per indexing request

○ SLA?

● search engine performance requirements○ recall percentiles threshold○ precision percentiles threshold○ minimize empty results

● often mostly unknown ○ published vs unpublished / to be written documents

● almost always umanageable○ cannot decide when

■ it’ll be ready■ it’ll have to be indexed■ it’ll have to be searchable

● heterogeneous○ different writers, languages, topics, styles, etc.

Process

Project

● ~50M heterogeneous documents● Migrating from old commercial solution to

Apache Solr● Google like search● Targeted search for different types of

contents

Advanced capabilities

● Smart understanding of queries● Smart suggestion of queries ● Suggestion of similar / important contents● Automatic classification of contents

Responsibilities

● architecture analysis and design○ scaling under high load

● continuous definition of algorithms for indexing and searching

● system maintenance

Skills required

● basics of information retrieval● a bit of distributed systems● some natural language processing● some machine learning

Architecture analysis and design

● Shape up a prototype architecture○ separate machines for indexing and search○ multiple load balanced machines for searching○ define indexing and search algorithms

● Evaluate architecture○ stress tests (performance)○ quality tests (accuracy)

● Iterate

Architecture analysis and design

● analyze existing documents○ avg size○ language○ topics, style, etc.

● analyze existing query logs○ avg response time○ avg length (how much it takes to specify a query?)○ avg query per second

Most time spent on

● testing how documents get indexed● testing how user queries get transformer in

platform specific queries● tweaking indexing algorithms● tweaking search algorithms● tweaking ranking● platform optimization for scalability

Challenges

● Architecture constraints● Performance● Diverging stakeholders concerns● Dynamically scaling search

Sample architecture constraint #1

● Data storage has to be on NFS● Lucene is IO intensive● NFS makes it slower● Concurrent read writes makes it error prone

Sample architecture constraint #2

● Change search engine● Systems talking to the SE need to switch

API● Only in the long run● In the short run an adapter layer for old APIs

on new APIs has to be developed

Indexing performance

● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format

● The adaption layer between old and new API becomes the bottleneck

● Time to switch to the new API natively

Diverging concerns

● Article authors check the search engine exactly handles their writings wanting perfect recall and precision○ so lot of time is spent on adjusting ranking

● Markters want to be able to overcome ranking and put something they want to sell○ ranking algorithm gets breached

● Need flexible algorithms

Scale dinamically

● Search engine needs not to break even under high peaks of load

● Such peaks are often unpredictable● Need a fast way to add more computing

Takeaways

● small iterations (no waterfalls!)○ analyze portion of data / queries○ change search / index algorithms○ test, involve stakeholders○ forces ability to reindex quickly

● look at data (documents, query logs)

Search engines in the industry

search algorithms

scaling search

targeted search

search engines

change search engine

indexing time

indexing algorithms

search index algorithms

Technology

Travel search-engines

Search Engines & Search Engine Optimization (SEO).

Search Engines and Metasearch Engines

Hyper Search Engines

cdn-cms.f-static.com · 2018. 3. 16. · (107) semantic...

Get Your Book Listed on Search Engines and Book Industry...

Academic Search Engines -...

Web search engines and search technology

Module 3 - Internet. Search Engines Search engine anatomy...

Web search engines

Search Engines Module

Search Engines

Biomedical Search engines

GET YOUR BOOK LISTED ON SEARCH ENGINES AND BOOK INDUSTRY...

Optimizing search engines

Using Search Engines to Market your Consultancy. What are...