An Introduction to Distributed Search with Datastax Enterprise Search

TOO BIG TO FAILAn Introduction to Distributed Search with Cassandra and Solr

OpenSource Connections@PatriciaGorla

pgorla@o19s.com

ABOUT MESystems AnalystProgramming

Information Retrieval

Created at Facebook to power inbox search

Distributed data store run on commodity servers

Highly available

No one single point of failure

CASSANDRA

WHO USES CASSANDRA?

SEARCH + CASSANDRA, 1

• First implementation: Solandra (originally Lucandra)

• Replaced Lucene index with Cassandra column families

SEARCH + CASSANDRA, 2

•DataStax Enterprise Search

• Uses native Lucene index

• All data is retrieved from Cassandra

Datastax Enterprise Search Cluster

DistributedLinearly ScalableHighly AvailableEventually ConsistentFull-text searchAggregation

SETTING UP THE SCHEMA

• <fields>

• <field name="id" type="string" indexed="true" stored="true"/>

• <field name="name" type="text" indexed="true" stored="true"/>

• <field name="body" type="text" indexed="true" stored="true"/>

• <field name="title" type="text" indexed="true" stored="true"/>

• <field name="date" type="string" indexed="true" stored="true"/>

• </fields>

WRITING TO CLUSTER

•Write to either Cassandra clients or Solr API

•Write process is the same

• True atomic updates to Cassandra

Cassandra nodes are set up according to row-key hash.

Data can be written directly to Cassandra

Data is distributed according to row key hash and replication factor

DSE first writes to Cassandra

And then updates the secondary index on Solr

The quorum responds with success / failure

Data is now distributed evenly

READING FROM CLUSTER

• Read either Cassandra-side or through Solr API

• Cassandra: fast reads*

• Solr : full-text search

• Read direction affects performance

•Data is stored in Cassandra

Query is sent to node

Node uses gossip to find who has the information

QUERYING CASSANDRA

• Can query Solr or Cassandra directly

• Limited syntax with CQL, can use solr_query parameter

Querying Cassandra directly

Cassandra retrieves information from column

family

Querying Solr index

Row-key hashes are stored in Solr, and

Cassandra is queried for stored data

Cassandra node sends request to node with the

corresponding hash, returns information

Data is always synced

Both nodes respond with information

Updates can be committed and searched over in real time

PRODUCTION USE

•Will want a mix of analytics, search nodes

An OLTP - OLAP integrated solution

TRADEOFFS

• Changing the Solr schema requires reindex (standard for Solr)

•No multi-valued fields or composite columns

@PatriciaGorlapgorla@o19s.com

o19s.com/blog

An Introduction to Distributed Search with Datastax Enterprise Search

cassandra nodes

solr api cassandra

cassandra clients

cassandra retrievesinformation

cassandra column families

solr schema

querying solr index

distributed search

Technology

Survey on Parallel/Distributed Search Engines

Distributed Monte Carlo Tree Search - Stanford...

Distributed Memory Breadth-First Search Revisited...

Distributed search for complex heterogeneous media

Distributed search solutions and comparison

Penalties for Distributed Local Search

Datastax enterprise presentation

Distributed Search over the Hidden Web:

DataStaxODBCdriverforApache ......[DataStax ODBC driver for....

DataStaxODBCdriverforApache ......[ODBC Drivers] DataStax...

Scalable Full-Text Search with DataStax Enterprise

Distributed Search with Rendezvous Search Systems

DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel...

Webinar: Buckle Up: The Future of the Distributed Database.....

SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN,.....

Distributed Search over the Hidden Web