Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Lessons Learned with Cassandra & Spark at the USPTO

Christopher Bradford

• DataStax Certified Cassandra Architect

•Contributed to CQLEngine - Python C* •ORM

•Developed Trireme - a migration •engine for Cassandra & DSE

•Created the world’s smallest C*•cluster

Twitter: @bradfordcpGitHub: bradfordcp

OpenSource Connections

• Consulting firm based in Charlottesville Virginia• Founded in 2005• Focused on Search in 2010, specifically Solr and

Lucene• Delivering Cassandra Consulting since 2012• Datastax Gold Partner• Great with Search, Analytics and Discovery

OpenSource Connections

Bloghttp://o19s.com/blog/Twitter@o19s GitHubo19s

Exploring Search Technologies

Technologies

Architecture

Architecture – Data Layer

Data

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013

232,424260,889

288,811315,015

345,732356,493366,043382,139417,508

452,633484,955485,312482,871

520,277535,188576,763

609,052615,243

48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299

157,718196,405182,899185,224191,927

244,341247,713276,788

302,948326,033

Patent Applications & Grants

Applications Grants

WHERE Clauses

WHERE

I DON’T THINK YOU KNOW WHAT THAT MEANS

YOU KEEP USING THAT CLAUSE

CQL vs SQL: WHERE

type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764

SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names WHERE rank = 59290;InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "

CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));

CQL vs SQL: WHERE



CQLSELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;last | VRANES | 59290

CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));

CQL vs SQL: Tables

rank | type | name -------+------+---------- 25067 | last | STOBAUGH65304 | last | BRUDNER 12517 | last | SKLAR 59290 | last | VRANES 34764 | last | SCHRODT

SQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290


names names_by_rank

CQL vs SQL: Indexes


CQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290


CREATE INDEX ON names (rank);

CQL vs SQL: Recap

• Consider multiple tables with data models that support fast, efficient, querying.

• Remember that writes are extremely fast in C*. Writing to multiple tables is not necessarily a bad thing.

• Build an index table• Your model may support building an inverted index for lookups of record ids.

• Use secondary indexes***

Cluster Balancing

Unbalanced Cluster Symptoms

• Certain nodes shutting down mid-way through ingestion• Data is not cleanly distributed across the cluster

Unbalanced Cluster Causes

• Data Model – check your partitions!

• Configuration – how are your tokens split amongst the nodes?

• Hardware – is the server configured correctly?

Data

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013

232,424260,889

288,811315,015

345,732356,493366,043382,139417,508

452,633484,955485,312482,871

520,277535,188576,763

609,052615,243

48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299

157,718196,405182,899185,224191,927

244,341247,713276,788

302,948326,033

Patent Applications & Grants

Applications Grants

Balancing: Data Model

CREATE TABLE images ( year INT, id TEXT, page TEXT, image BLOB, PRIMARY KEY (year, id, page));

SELECT * FROM images WHERE year = 2015;

Sample unbalanced model

Balancing: Data Model

CREATE TABLE images ( year INT, month INT, id TEXT, page INT, image BLOB, PRIMARY KEY ((year, month), id, page));

SELECT * FROM images WHERE year = 2015 AND month IN (1,…);

Switch partition key to use multiple fields instead of just year.

Balancing: Configuration

Virtual Nodes?

Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

Hardware

Hardware

• Understand the type of hardware Cassandra runs well on.

LOCALSTORAGE

NETWORKSTORAGE

Data Ingestion

Data Loading Performance

• Did it work?• Why change it?• How could we make it better?

Spark Data Loading

Collecting Metrics

Metrics

• GitHub:• dropwizard/metrics

• Awesome Java library for collecting metrics in your code

• Demo later

Poor Performance

joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done

Poor Performance

joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done

Optimum Performance

joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done

Scope

Scope: Review

joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done

Scope: ERROR

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

Scope: Fixed!

joinedRDD = …joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … sc.push(document)// Job is done

Know your API

Java RDD != Scala RDD

APIs: mapPartitions()

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document)return partition.rows

APIs: Transformations & Actions

• Transformations: Lazily executed, the code is not executed until an action is applied.

• Ex: map

• Actions:• Operate on RDD elements

and return to the driver

• Ex: foreach

APIs: mapPartitions()

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.collect()

Understand how data is passed around

Memory: Solution

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.length.collect()

How did it go?

Demo

Questions?

Thank You© 2015. All Rights Reserved. 51

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Technology