Top Banner
Lessons Learned with Cassandra & Spark at the USPTO
51

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Feb 20, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Lessons Learned with Cassandra & Spark at the USPTO

Page 2: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Christopher Bradford

• DataStax Certified Cassandra Architect

•Contributed to CQLEngine - Python C* •ORM

•Developed Trireme - a migration •engine for Cassandra & DSE

•Created the world’s smallest C*•cluster

Twitter: @bradfordcpGitHub: bradfordcp

Page 3: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

OpenSource Connections

• Consulting firm based in Charlottesville Virginia• Founded in 2005• Focused on Search in 2010, specifically Solr and

Lucene• Delivering Cassandra Consulting since 2012• Datastax Gold Partner• Great with Search, Analytics and Discovery

Page 4: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

OpenSource Connections

Bloghttp://o19s.com/blog/Twitter@o19s GitHubo19s

Page 5: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Exploring Search Technologies

Page 6: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Technologies

Page 7: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Architecture

Page 8: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Architecture – Data Layer

Page 9: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Data

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013

232,424260,889

288,811315,015

345,732356,493366,043382,139417,508

452,633484,955485,312482,871

520,277535,188576,763

609,052615,243

48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299

157,718196,405182,899185,224191,927

244,341247,713276,788

302,948326,033

Patent Applications & Grants

Applications Grants

Page 10: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Page 11: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

WHERE Clauses

Page 12: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

WHERE

I DON’T THINK YOU KNOW WHAT THAT MEANS

YOU KEEP USING THAT CLAUSE

Page 13: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

CQL vs SQL: WHERE

type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764

SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names WHERE rank = 59290;InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "

CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));

Page 14: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

CQL vs SQL: WHERE

type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764

SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;last | VRANES | 59290

CREATE TABLE names ( type VARCHAR, name VARCHAR, rank INT, PRIMARY KEY ((type, name)));

Page 15: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

CQL vs SQL: Tables

rank | type | name -------+------+---------- 25067 | last | STOBAUGH65304 | last | BRUDNER 12517 | last | SKLAR 59290 | last | VRANES 34764 | last | SCHRODT

SQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names_by_rank WHERE rank = 59290;last | VRANES | 59290

type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764

names names_by_rank

Page 16: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

CQL vs SQL: Indexes

SQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290

CQLSELECT * FROM names WHERE rank = 59290;last | VRANES | 59290

type | name | rank------+----------+------- last | STOBAUGH | 25067 last | BRUDNER | 65304 last | SKLAR | 12517 last | VRANES | 59290 last | SCHRODT | 34764

CREATE INDEX ON names (rank);

Page 17: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

CQL vs SQL: Recap

• Consider multiple tables with data models that support fast, efficient, querying.

• Remember that writes are extremely fast in C*. Writing to multiple tables is not necessarily a bad thing.

• Build an index table• Your model may support building an inverted index for lookups of record ids.

• Use secondary indexes***

Page 18: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Cluster Balancing

Page 19: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Unbalanced Cluster Symptoms

• Certain nodes shutting down mid-way through ingestion• Data is not cleanly distributed across the cluster

Page 20: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Unbalanced Cluster Causes

• Data Model – check your partitions!

• Configuration – how are your tokens split amongst the nodes?

• Hardware – is the server configured correctly?

Page 21: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Data

1963

1965

1967

1969

1971

1973

1975

1977

1979

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

2011

2013

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

90,98292,971100,15093,48290,54498,737104,357109,359111,095105,300109,622108,011107,456109,580108,377108,648108,209112,379113,966117,987112,040120,276126,788132,665139,455151,491165,748176,264177,830186,507188,739206,090228,238211,013

232,424260,889

288,811315,015

345,732356,493366,043382,139417,508

452,633484,955485,312482,871

520,277535,188576,763

609,052615,243

48,97150,38966,64771,88669,09862,71471,23067,96481,79078,18578,62281,27876,81075,38869,78170,51452,41366,17071,06463,27661,98272,65077,24576,86289,38584,272102,53399,077106,696107,394109,746113,587113,834121,696124,069163,142169,085175,979183,970184,375187,012181,299

157,718196,405182,899185,224191,927

244,341247,713276,788

302,948326,033

Patent Applications & Grants

Applications Grants

Page 22: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Balancing: Data Model

CREATE TABLE images ( year INT, id TEXT, page TEXT, image BLOB, PRIMARY KEY (year, id, page));

SELECT * FROM images WHERE year = 2015;

Sample unbalanced model

Page 23: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Balancing: Data Model

CREATE TABLE images ( year INT, month INT, id TEXT, page INT, image BLOB, PRIMARY KEY ((year, month), id, page));

SELECT * FROM images WHERE year = 2015 AND month IN (1,…);

Switch partition key to use multiple fields instead of just year.

Page 24: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Balancing: Configuration

Page 25: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Virtual Nodes?

Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

Page 26: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Hardware

Page 27: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Hardware

• Understand the type of hardware Cassandra runs well on.

LOCALSTORAGE

NETWORKSTORAGE

Page 28: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Page 29: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Data Ingestion

Page 30: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Data Loading Performance

• Did it work?• Why change it?• How could we make it better?

Page 31: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Spark Data Loading

Page 32: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Collecting Metrics

Page 33: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Metrics

• GitHub:• dropwizard/metrics

• Awesome Java library for collecting metrics in your code

• Demo later

Page 34: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Poor Performance

joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done

Page 35: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Poor Performance

joinedRDD = …joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect()// Job is done

Page 36: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Optimum Performance

joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done

Page 37: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Scope

Page 38: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Scope: Review

joinedRDD = …sc = new SolrConnection()joinedRDD.foreach() document = … // build document sc.push(document)sc.disconnect()// Job is done

Page 39: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Scope: ERROR

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

Page 40: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Scope: Fixed!

joinedRDD = …joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … sc.push(document)// Job is done

Page 41: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Know your API

Page 42: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Java RDD != Scala RDD

Page 43: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

APIs: mapPartitions()

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document)return partition.rows

Page 44: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

APIs: Transformations & Actions

• Transformations: Lazily executed, the code is not executed until an action is applied.

• Ex: map

• Actions:• Operate on RDD elements

and return to the driver

• Ex: foreach

Page 45: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

APIs: mapPartitions()

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.collect()

Page 46: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Understand how data is passed around

Page 47: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Memory: Solution

joinedRDD = …joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) return partition.rows.length.collect()

Page 48: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

How did it go?

Page 49: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Demo

Page 50: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Questions?

Page 51: Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Thank You© 2015. All Rights Reserved. 51