Top Banner
Distributed Search - Solutions and Comparison Ngọc Bùi [email protected]
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed search   solutions and comparison

Distributed Search - Solutions and Comparison

Ngọc Bù[email protected]

Page 2: Distributed search   solutions and comparison

Facts

FB:750 million active users3B photos upload each month. Record 750M photos uploaded to FB over new year’s weekend. 14M videos uploaded each monthMore than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.TBs log data daily

HOW TO FIND A NEEDLE IN THAT HUGE HAYSTACK?

Page 3: Distributed search   solutions and comparison

Centralized Search – PROBLEM?

Lucene is great: high-performance, full-featured search library Incremental indexing Boolean Query, Fuzzy Query, Range Query, Multi

Phrase Query, Wild Card Query etc… It’s great BUT:

Slow if index is very big Index bigger than on HDD No load balance No failover

Page 4: Distributed search   solutions and comparison

GOAL

Reliable index serving - by failover (master and nodes)

Scalable for traffic and index size by adding nodes Distributed TF-IDF

Page 5: Distributed search   solutions and comparison

Solution:

Documents are indexed in parallel on different machines in a cluster. When a user issues a search, it will be spawned on to multiple machines in parallel.

Choices: Katta Elastic Search HbaseDirectory (our choice)

Page 6: Distributed search   solutions and comparison

Katta

Katta is a distributed application running on many commodity hardware servers

An index for Katta is a folder with a set of subfolders. Those subfolder are called index shards.

The distributed configuration and locking system Zookeeper is used for master-node communication.

Page 7: Distributed search   solutions and comparison
Page 8: Distributed search   solutions and comparison

Pros and Cons

Pros: Copy and distribute Shards automatically on Slaves. Support distributing queries and aggregating results.

Cons: No indexing support. Incremental update index is hard Resharding is too expensive.

Page 9: Distributed search   solutions and comparison

Elastic Search (www.elasticsearch.org)

Elastic Search is an Open Source, Distributed, RESTful, Search Engine built on top of LuceneAutomatic Shard allocationAuto shard index & update indexNetwork interface (http) for data indexing, searching and administrating purely RESTful API.Schema Free.Can be integrated well with Hadoop/Map-Reduce

Page 10: Distributed search   solutions and comparison
Page 11: Distributed search   solutions and comparison

Behind Elastic

Page 12: Distributed search   solutions and comparison

automatic shard allocation

There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s).

If you want to scale out search, you can simply have more shard, replicas per shard.

Page 13: Distributed search   solutions and comparison

HbaseDirectory – What?

Directory

Page 14: Distributed search   solutions and comparison

HbaseDirectory – What?

Indexing PhaseSearching Phase

Directory

Page 15: Distributed search   solutions and comparison

HbaseDirectory – What?

Directory is distributed? No but not impossible. Distributed? Using Directory on a distributed

storage system. HDFS: slowwww Hbase: our choice since it is optimized for random

access which is appropriate for accessing lucene index.

Hbase Directory: consider Hbase as a logical “Directory”.

Page 16: Distributed search   solutions and comparison

Two Mode

Hbase Directory: lazy mode Keep lucene index file structures, porting to Hbase Only rewrite 2 libraries: FSDirectory & RAMDirectory

(Directory interface) Hbase Directory: active mode

Redesign index structure to utilize Hbase’s strength. Rewrite: 2 above + Indexreader & Indexwriter

Page 17: Distributed search   solutions and comparison

Lucene index flow – Hbase flow

Page 18: Distributed search   solutions and comparison

Performance & Conclusion

Refer to excel file HbaseDirectory – Active mode is the correct

choice. Improvement needed.