Lessons Learned with Cassandra & Spark at the USPTO
Feb 20, 2017
Lessons Learned with Cassandra & Spark at the USPTO
Christopher Bradford
• DataStax Certified Cassandra Architect
•Contributed to CQLEngine - Python C* •ORM
•Developed Trireme - a migration •engine for Cassandra & DSE
•Created the world’s smallest C*•cluster
Twitter: @bradfordcpGitHub: bradfordcp
OpenSource Connections
• Consulting firm based in Charlottesville Virginia• Founded in 2005• Focused on Search in 2010, specifically Solr and
Lucene• Delivering Cassandra Consulting since 2012• Datastax Gold Partner• Great with Search, Analytics and Discovery
OpenSource Connections
Bloghttp://o19s.com/blog/Twitter@o19s GitHubo19s
Exploring Search Technologies
Technologies
Architecture
Architecture – Data Layer
Data
1963
1965
1967
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013
232,424260,889
288,811315,015
345,732356,493366,043382,139417,508
452,633484,955485,312482,871
520,277535,188576,763
609,052615,243
48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299
157,718196,405182,899185,224191,927
244,341247,713276,788
302,948326,033
Patent Applications & Grants
Applications Grants
WHERE Clauses
WHERE
I DON’T THINK YOU KNOW WHAT THAT MEANS
YOU KEEP USING THAT CLAUSE
CQL vs SQL: WHERE
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE rank = 59290;InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "
CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));
CQL vs SQL: WHERE
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;last | VRANES | 59290
CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));
CQL vs SQL: Tables
rank | type | name -------+------+---------- 25067 | last | STOBAUGH65304 | last | BRUDNER 12517 | last | SKLAR 59290 | last | VRANES 34764 | last | SCHRODT
SQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
names names_by_rank
CQL vs SQL: Indexes
SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
CQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290
type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764
CREATE INDEX ON names (rank);
CQL vs SQL: Recap
• Consider multiple tables with data models that support fast, efficient, querying.
• Remember that writes are extremely fast in C*. Writing to multiple tables is not necessarily a bad thing.
• Build an index table• Your model may support building an inverted index for lookups of record ids.
• Use secondary indexes***
Cluster Balancing
Unbalanced Cluster Symptoms
• Certain nodes shutting down mid-way through ingestion• Data is not cleanly distributed across the cluster
Unbalanced Cluster Causes
• Data Model – check your partitions!
• Configuration – how are your tokens split amongst the nodes?
• Hardware – is the server configured correctly?
Data
1963
1965
1967
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013
232,424260,889
288,811315,015
345,732356,493366,043382,139417,508
452,633484,955485,312482,871
520,277535,188576,763
609,052615,243
48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299
157,718196,405182,899185,224191,927
244,341247,713276,788
302,948326,033
Patent Applications & Grants
Applications Grants
Balancing: Data Model
CREATE TABLE images ( year INT, id TEXT, page TEXT, image BLOB, PRIMARY KEY (year, id, page));
SELECT * FROM images WHERE year = 2015;
Sample unbalanced model
Balancing: Data Model
CREATE TABLE images ( year INT, month INT, id TEXT, page INT, image BLOB, PRIMARY KEY ((year, month), id, page));
SELECT * FROM images WHERE year = 2015 AND month IN (1,…);
Switch partition key to use multiple fields instead of just year.
Balancing: Configuration
Virtual Nodes?
Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
Hardware
Hardware
• Understand the type of hardware Cassandra runs well on.
LOCALSTORAGE
NETWORKSTORAGE
Data Ingestion
Data Loading Performance
• Did it work?• Why change it?• How could we make it better?
Spark Data Loading
Collecting Metrics
Metrics
• GitHub:• dropwizard/metrics
• Awesome Java library for collecting metrics in your code
• Demo later
Poor Performance
joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done
Poor Performance
joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done
Optimum Performance
joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done
Scope
Scope: Review
joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done
Scope: ERROR
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Scope: Fixed!
joinedRDD = …joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … sc.push(document)// Job is done
Know your API
Java RDD != Scala RDD
APIs: mapPartitions()
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document)return partition.rows
APIs: Transformations & Actions
• Transformations: Lazily executed, the code is not executed until an action is applied.
• Ex: map
• Actions:• Operate on RDD elements
and return to the driver
• Ex: foreach
APIs: mapPartitions()
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.collect()
Understand how data is passed around
Memory: Solution
joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.length.collect()
How did it go?
Demo
Questions?
Thank You© 2015. All Rights Reserved. 51