Top Banner
Lessons Learned with Spark at the US Patent & Trademark Office Christopher Bradford Big Data Architect at OpenSource Connections
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lessons Learned with Spark at the US Patent & Trademark Office

Lessons Learned with Spark at the US Patent & Trademark OfficeChristopher BradfordBig Data Architect at OpenSource Connections

Page 2: Lessons Learned with Spark at the US Patent & Trademark Office

Christopher Bradford

Twitter: @bradfordcp

GitHub: bradfordcp

Page 3: Lessons Learned with Spark at the US Patent & Trademark Office

OpenSource Connections

Page 4: Lessons Learned with Spark at the US Patent & Trademark Office

Exploring Search Technologies - EST

Page 5: Lessons Learned with Spark at the US Patent & Trademark Office

EST – Technology Stack

Page 6: Lessons Learned with Spark at the US Patent & Trademark Office

EST – Data Loading

CSS Ingestion (CSS2C) Solr Ingestion (C2S)

Page 7: Lessons Learned with Spark at the US Patent & Trademark Office

EST – C2S Process

Note: some connections are omitted for clarity

Page 8: Lessons Learned with Spark at the US Patent & Trademark Office

EST – C2S Process (Scaled Out)

Note: some connections are omitted for clarity

Page 9: Lessons Learned with Spark at the US Patent & Trademark Office

EST – C2S Review

Did it work?

Why change it?

How could we make it better?

Page 10: Lessons Learned with Spark at the US Patent & Trademark Office
Page 11: Lessons Learned with Spark at the US Patent & Trademark Office

EST – Old C2S Process

Note: some connections are omitted for clarity

Page 12: Lessons Learned with Spark at the US Patent & Trademark Office

EST – Spark C2S Process

Note: some connections are omitted for clarity

Page 13: Lessons Learned with Spark at the US Patent & Trademark Office

How did this work out?Poorly

Page 14: Lessons Learned with Spark at the US Patent & Trademark Office

Poor PerformancejoinedRDD = …

joinedRDD.foreach()

document = … // build document

sc = new SolrConnection()

sc.push(document)

sc.disconnect()

// Job is done

Page 15: Lessons Learned with Spark at the US Patent & Trademark Office

Poor Performance

sc = new SolrConnection()sc.push(document)sc.disconnect()

Page 16: Lessons Learned with Spark at the US Patent & Trademark Office

Optimum PerformancejoinedRDD = …

sc = new SolrConnection()

joinedRDD.foreach()

document = … // build document

sc.push(document)

sc.disconnect()

// Job is done

joinedRDD = …

joinedRDD.foreachPartition()

sc = new SolrConnection()

partition.foreach()

document = … // build document

sc.push(document)

sc.disconnect()

// Job is done

Almost

Page 17: Lessons Learned with Spark at the US Patent & Trademark Office

The Solution!joinedRDD = …

joinedRDD.mapPartitions()

sc = new SolrConnection()

partition.foreach()

document = … // build document

sc.push(document)

sc.close()

return partition.rows

.collect()

joinedRDD = …

joinedRDD.mapPartitions()

sc = new SolrConnection()

partition.foreach()

document = … // build document

sc.push(document)

sc.close()

return partitions.rows.count

.collect()

Page 18: Lessons Learned with Spark at the US Patent & Trademark Office

Results?

Page 19: Lessons Learned with Spark at the US Patent & Trademark Office

Solr Indexing

Page 20: Lessons Learned with Spark at the US Patent & Trademark Office

Better Solr Indexing

Note: some connections are omitted for clarity

Page 21: Lessons Learned with Spark at the US Patent & Trademark Office

EST – Spark C2S Process v2

Note: some connections are omitted for clarity

Page 22: Lessons Learned with Spark at the US Patent & Trademark Office

Success?

YUP

5x faster than the original C2S process (with optimizations)

Page 23: Lessons Learned with Spark at the US Patent & Trademark Office

What’s Next?

• Optimization of the C2S Spark job• More Spark jobs• Newer version of Spark & DSE• Scala Spark jobs instead of Java