Top Banner
OCTOBER 1114, 2016 • BOSTON, MA
33

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Apr 16, 2017

Download

Technology

LucidWorks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  

Page 2: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Near  Real  8me  Indexing  Building  Real  Time  Search  Index  For  E-­‐Commerce  

 Umesh  Prasad  

Tech  Lead    @  Flipkart    

Thejus  V  M  Data  Architect  @  Flipkart  

   

Page 3: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Agenda  

•  Search  @  Flipkart  •  Need  for  Real  Time  Search  •  SolrCloud  Solu;on  •  Our  approach  •  Q  &  A  

Page 4: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Page 5: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Page 6: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Traffic  @  Flipkart  

•  Peak  Traffic    –  ~  800K  ac;ve  users  –  ~  160K    requests  per  second    

•  Search  Traffic    –  ~  40K  searches  per  second  (Service)  –  ~  10K  searches  per  second  (Solr  )  

•  Latency  –   Median  :  11  ms  –   99th  percen;le  :  1.1  second  

Page 7: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Search  @  Flipkart  

•  Catalogue    –  ~  50  main  categories  – ~  5000  sub-­‐categories  – ~  231  million  documents  – ~  90  million  SKUs  – ~  160  million  lis;ngs  

 

•  E-­‐commerce  Marketplace    – ~  100K    Sellers  – Local  Sellers  – Regional  Availability  – Logis;cs  Constraints    

Page 8: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

E-­‐commerce  Search  

•  Heavy  usage  of  drill  down  filters  •  Heavy  usage  of  face;ng  •  Only  top  results  ma\er  •  Results  grouped/collapsed  by  products    •  Serviceability  and  delivery  experience  MATTERS    

Page 9: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Agenda  

•  Search  @  Flipkart  •  Need  for  Real  Time  Search  •  SolrCloud  Solu;on  •  Our  approach  •  Q  &  A  

Page 10: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Sorry,      Stock  Over      !!?  

Page 11: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Damn  !!  Is  Offer  Over  ??  

Page 12: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

What  !!    All  Steal  Deals  Gone  ??  

Page 13: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Product  /Lis;ng:  Important  A\ributes  

Seller  Ra;ng  Service  

catalogue  service  

Promise  Service  

Availability Service

Offer  Service  

Pricing  Service  

Product  aka  SKU  

Lis;ngs  

Page 14: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Summary  :    Lucene  Document  •  Product/SKU    (Parent  Document)  

–  Lis;ng  (Child  Document)    

•  Query  :    Mostly    SKU  A\ributes            (Free  Text)  •  Filters  :  SKU  +    Lis;ng  A\ributes        (Drill  Down)  •  Ranking  :  SKU  +  Lis;ng    A\ributes        (Explicit/

Relevance)    

•  Index  Time  Join  aka  Block  Join        (Best  Performance)  

   

Page 15: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Out  Of  Stock,  but  Why  Show?  Index has Stale Availability Data

234K  Products  

Page 16: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Challenge  1  :  High  Update  Rates  

updates  /  sec   updates  /hr    

normal   Peak  

text  /  catalogue   ~10   ~100   ~100K  

pricing   ~100   ~1K   ~10  million  

availability   ~100   ~10K   ~10  million  

offer   ~100   ~10K   ~10  million  

seller  ra8ng   ~10   ~1K   ~1  million  

signal  6   ~10   ~100   ~1  million  

signal  7   ~100   ~10K   ~10  million  

signal  8   ~100   ~10K   ~10  million  

Page 17: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Challenge  2  :  Micro  Services    

Ingestion pipeline

Catalogue Pricing Availability Offers ...

Document Builder

Solr/Lucene

Change Propagation

Documents {L1,L2 … P1}

Updates Stream 1

Updates Stream 2

Updates Stream 3

●  Lucene doesn’t support Partial Updates ●  Update = Delete + Add

Page 18: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Agenda  

•  Search  @  Flipkart  •  Need  for  Real  Time  Search  •  SolrCloud  Solu;on  •  Our  approach  •  Q  &  A  

Page 19: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

SolrCloud  for  NRT  

Shard Replica

Shard Replica

Shard Replica

Shard Replica

Shard Replica

Shard Replica

Re-open searcher

Re-open searcher

Re-open searcher

Re-open searcher

Re-open searcher

Re-open searcher

Ingestion pipeline Shard Leader

Auto commit Soft Commit

Batch of documents

For Document Versioning Update Log Forward to Replica

Page 20: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

SolrCloud  Evalua;on  •  Update  =  Delete  +  Add  

–  Block  Join  Index  ⇒  Update  Whole  Block  (Product  +  Lis;ngs)  •  Updated  Document  gets  streamed  to  all  replicas  in  sync  

–  Reduces  indexing  throughput  •  Sol  commit  is  Not  Free  

–  Sol  commit  ⇒  In  Memory  Segment  –  Lots  of  Merges  –  Huge  document  churn  /  deletes  –  All  caches  s;ll  need  to  be  re-­‐generated  –  Filter  Cache  miss  specially  hurts  performance  

Page 21: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Agenda  

•  Search  @  Flipkart  •  Need  for  Real  Time  Index  •  SolrCloud  Solu;on  • Our  approach  

•  Q  &  A  

Page 22: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

ProductA

brand : Apple

availability : T

price : 45000

ProductB

brand : Samsung

availability : T

price : 23000

ProductC

brand : Apple

availability : F

price : 5000

Document ID Mappings

Posting List

(Inverted Index)

DocValues

(columunar data)

Lucene Segment

Lucene  Index  

0 ProductA

1 ProductB

2 ProductC

45000 23000 5000 Price

availability : T

brand : Samsung

brand : Apple 0 , 2

1

0 , 1

Terms Sparse Bitsets

Page 23: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

A  Typical  Search  Flow  

Query Rewrite

Results

Query

Matching

Ranking Faceting

Stats

Posting List

Doc Values

Other Components

Lucene Segment

Inverted Index

Forward Index

NRT Store

samsung mobiles Offer : exchange offer price desc

category : mobiles brand : samsung Offer : exchange offer

Page 24: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

NRT  Forward  Index  -­‐  Considera;ons  

●  Lookup  efficiency    

–  50th  percen;le  :  ~10K  matches  

–  99th  percen;le  :  ~1  million  matches  

●  Data  on  Java  heap  –  Memory  efficiency  

 

Page 25: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

NRT  Forward  Index  -­‐  Naive  Implementa;on  

NRT Forward Index Lucene Segment

Lookup Engine

0 ProductB

1 ProductA

2 ProductC

3 ProductD

ProductD ProductA

ProductB

ProductC

ProductD

True

False

False

True

100

150

200

250

ProductId(3) <ProductD,price>

DocId : 3 field: price

250

ProductId Availability Price

Latency : ~10 secs for ~1 Million lookups

Page 26: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

NRT  Store  -­‐  Forward  Index  Op;mized  

Lookup Engine

Lucene Segment

0 ProductB

1 ProductA

2 ProductC

3 ProductD

DocId : 3 Field : price

250

DocId - NrtId

0

1

2

3

3

0

1

2

NrtId(3)

2

Price(2)

Status

NRT Forward Index (Segment Independent)

100 200 250 150 Price

0 ProductA

1 ProductC

2 ProductD

3 ProductB

Availability T F F T

Status 01 10 01 00

Latency : ~100 ms for ~1 Million lookups

Page 27: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

NRT  Store  Filter  -­‐  PostFilter  

PostFilter(Price:[100 TO 150])

Lucene Segment

0 ProductB

1 ProductA

2 ProductC

3 ProductD

DocId : 3

Don’t Delegate

DocId - NrtId

0

1

2

3

3

0

1

2

NrtId(3)

2

Price(2)

Status

NRT Forward Index (Segment Independent)

100 200 250 150 Price

0 ProductA

1 ProductC

2 ProductD

3 ProductB

Availability T F F T

Status 01 10 01 00

Page 28: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

NRT Filter

NRT  Store  -­‐  Invert  index  

NRT Forward Store

NRT Inverter

Lucene Segment

0 ProductB

1 ProductA

2 ProductC

3 ProductD

NRT DocIdSet Cache

Availability : T 0 3

Offer : O1 2 3

Offer:O1 DocIdSet

Page 29: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Solr  Integra;on  Points  

•  ValueSources  •  Filtering  

–  Custom  Filter  Implementa;on  for  cached  DocIdSet  –  Custom  PostFilter  

•  Query  –  Wrapper  over  Filter  

•  Custom  FacetComponent  

Page 30: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Near  Real  Time  Solr  Architecture  

Solr

Kafka

Ingestion pipeline

NRT Forward Index

Ranking

Matching

Faceting

Redis

Bootstrap

NRT Inverted store

Solr Master

NRT Updates

Lucene Updates

Catalogue

Pricing

Availability

Offers

Seller Quality

Commit +

Replicate +

Reopen

Lucene Others

Page 31: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Accomplishments  

•  Real  ;me  sor;ng  •  Real  ;me  filtering  :  PostFilter  

–  Higher  latency  •  Near  real  ;me  filtering  :  cached  DocIdSet  

–  No  consistency  between  lookup  and  filtering  •  Independent  of  lucene  commits  •  Query  latency  comparable  to  DocValues  

–  Consistent  99%  performance  

Page 32: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Accomplishments  @  Flipkart  

●  Real  ;me  consump;on  for  ~150  Signals  

●  Reduc;on  in  shown  out  of  stock  products  by  2X  ●  Produc;on  instances  of  ~50K  updates/second  real  ;me  

Page 33: Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

Thank  you  &  

Ques8ons