Top Banner
Feeding Solr at Large Scale with Cassandra @ Cassandra Day Atlanta 2015-03-19
24

Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Aug 22, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Solr at Large Scale with Cassandra

@

Cassandra Day Atlanta 2015-03-19

Page 2: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

About Me Joseph Streeky Manager, Search Framework Development ●  Joined Careerbuilder in 2005 ●  BS Computer Science - Georgia Tech ●  Natural Language Processing - Columbia University ●  Software Engineering for SaaS - University of California, Berkeley

Page 3: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

About Me Joshua Smith

Database Administrator III [email protected]

●  Joined Careerbuilder in 2011 ●  Took over management of Cassandra in 2013 ●  BS Computer Science - Georgia Tech

Page 4: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

About Careerbuilder is the global leader in human capital solutions, helping companies target and attract their most important assets - their people. ●  More than 22 Million unique visitors a month ●  More than 300,000 employers post more than 1 million

jobs on Careerbuilder ●  Careerbuilder operates in the United States, Europe,

Canada and Asia. Its sites, combined with partnership and acquisitions, give Careerbuilder a presence in more than 55 countries worldwide.

Page 5: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

About Search @ •  1  million  ac*ve  jobs  each  month    •  60  million  ac*vely  searchable  resumes  •  500  globally  distributed  search  servers  (in  the  U.S.,  Europe,  &  the  cloud)    

•  Thousands  of  unique,  dynamically  generated  search  indexes  •  1.5  billion  search  documents  •  2-­‐3  million  searches  an  hour  

Page 6: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Search Powers…

Page 7: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Solr at Large Scale with Cassandra

Page 8: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Platform Requirements ●  Volume Requirements

o  1000+ documents / second ●  Able to scale linearly ●  Highly available ●  Easily able to deploy to multiple location

(Private datacenter vs AWS)

Page 9: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Technologies  Technologies  

Page 10: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Our  Feeding  Infrastructure  

Feeding Stack

Hadoop

SQL

RabbitMQ

Cassandra

Processing Tier

Solr

Page 11: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Steps ●  Content Creation - Translate to Solr Indexing Format (we use XML) ●  Shard - Determine Routing Rules related to this document ●  Batch - Group together documents that have the same routing rules for

batch feeding ●  Send - Send the batch to Solr ●  Verify - Verify that Solr received the batch ●  Reprocess - For any set of documents that failed during any step of

the process we place the document(s) here for reprocessing

Page 12: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Our  Search  Infrastructure  

Page 13: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Solr Solr Solr

Feeding Platform

Our  Search  Infrastructure  Query Load Balancer Query Load Balancer

Page 14: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Feeding Solr at Large Scale with Cassandra

Page 15: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Storage  in  Cassandra  

●  Two column families per pool o  Initial data o  Translated for Solr

●  Both have a DocumentID as the key ●  The initial data column family is the key and some

number of columns based on user data ●  The Translated column family is just a DocumentID and

a single content field

Page 16: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Read  and  Wri6ng  ●  Quorum Read ●  Quorum Write ●  Needed for our specific use case, if we fail to read the

newest data we have to automated way to recover

Page 17: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Cassandra Ring Specs ●  3 Node ring test ●  21 Nodes Production

o  56 TB o  Datastax Cassandra o  Version 2.0.5 o  Vnodes o  RF = 3 o  4K write/s - 3K read/s

Page 18: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Cassandra Node Specs ●  Dell R620 ●  2 x E5-2630 V2 ●  2.60 GHZ CPU ●  128 GB RAM ●  3 x 1.6TB SAS SSD in RAID 5

Page 19: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Pre 1.2 Performance Stuff ●  Compaction Fun ●  Cold Read problem ●  Garbage collection

Page 20: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Compaction Fun ●  Cassandra version 1.0.8 ●  Single threaded compaction ●  Eating up heap space until OOM

error ●  JNA not installed ●  Reduced memtable and cache size ●  increased the heap size to 12 GB

Page 21: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

Cold Read Problem ●  Refeeding involves all documents ●  Each row will be read multiple times ●  Cold reads means lots of seeks ●  Spinning disks HATE seeks

Page 22: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

SSD ●  Nightly repair times decreased from 23

hours to less than 3 hours ●  Write latency decreased from 15 ms to 2.4

ms ●  cassandra.yaml

o  concurrent_reads: 96 o  concurrent_writes: 192

Page 23: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra

AWS vs Private ●  Combination of AMI and chef to configure ●  R3.XLarge with EBS optimized

o  4 vCPU, 30 GB RAM, Provisioned IOPS ●  RF = 3 ●  2 Availability zones for high availability and

local quorum ●  Comparable performance to local datacenter ●  Currently deploying version 2.0.12

Page 24: Cassandra Day Atlanta 2015: Feeding Solr at Large Scale with Apache Cassandra