MongoDB Europe 2016 - Big Data meets Big Compute

Big Data meets Big Compute Connecting MongoDB and Spark for Fun and Profit {name:"RossLawley",role:"SeniorSoftwareEngineer",twitter:"@RossC0"}

#MDBE16

Agenda

What is Spark? How does it work? What problems can it solve? Whats the future of Spark?

Spark introduction 01 A deep dive into the connector. Configuration options Partitioning challenges How to Scale and keep data local.

Internals 03 Introducing the new connector How to install it and use it How to use it in various languages When to use it, when not to

The new connector 02

An impressive demostration of MongoDB and Spark combined!

Demo 04 I'll try and help answer any questions you might have. I'll also answer questions at the Drivers booth!

Questions 06 Quick recap. Where to go for more information

Conclusions 05

Spark an introduction

#MDBE16

What is Spark?

Fast and distributed general computing engine

• Makes it easy and fast to process large datasets •  Libraries for SQL, streaming, machine learning, graphs •  APIs in Scala, Python, Java, R •  It’s fundamentally different to what’s come before

#MDBE16

So why not just use Hadoop?

Spark is FAST •  Faster to write.

•  Friendly API in Scala, Python, Java and R

•  Faster to run. •  Up to 100x faster than Hadoop in memory •  10x faster on disk.

#MDBE16

A visual comparison

Hadoop Spark

#MDBE16

Spark History

2009

The Beginning Spark project started at UC Berkeley's AMPLab

Spark Open Sourced

2010

2015

Spark 1.3.0 – 1.5.0 R support Spark SQL out of alpha DataFrames

Spark 1.6.0 Spark 2.0 Datasets Structured Streams

2016

2013

Joined the Apache foundation

Spark 1.0.0 – 1.2.0 Scala, Java & Python Spark SQL Streaming Mlib GraphX

2014

#MDBE16

Spark Programming Model

Resilient Distributed Datasets •  An RDD is a collection of elements that is immutable, distributed and fault-

tolerant.

•  Transformations can be applied to a RDD, resulting in new RDD.

•  Actions can be applied to a RDD to obtain a value.

•  RDD is lazy.

#MDBE16

RDD Operations Transformations Actions map reduce

filter collect

flatMap count

mapPartitions save

sample lookupKey

union take

join foreach

groupByKey

reduceByKey

#MDBE16

Built in fault tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions valsearches=spark.textFile("hdfs://...").filter(_.contains("Search")).map(_.split("\t")(2)).cache().filter(_.contains("MongoDB")).count()

Mapped RDD

Filtered RDD HDFS RDD Cached

RDD Filtered

RDD Count

#MDBE16

. . .

Spark Driver

Worker 1 Worker n Worker 2

Cluster Manager

Data source

Spark topology

#MDBE16

Spark high level view

#MDBE16

Spark high level view

RDDs – Unstructured Data

Datasets – Structured Data

Spark SQL Spark Streaming MLIB GraphX

The new connector

#MDBE16

Connecting MongoDB and Spark

Big Data Storage Big Data Compute

#MDBE16

Different use cases

Applications OLTP Fine grained operations

Offline Processing Analytics Data Warehousing

#MDBE16

The MongoDB Spark Connector

MongoDB Spark Connector

#MDBE16

The MongoDB Spark Connector

•  Spark 1.6.x and Spark 2.0.x •  Scala, Python, Java, and R

•  Idiomatic Scala API •  Supports custom Aggregations •  Multiple partitioning strategies •  Automatic schema inference •  Automatic conversion to Datasets

>$SPARK_HOME/bin/spark-shell--packagesorg.mongodb.spark:mongo-spark-connector_2.10:2.0.0

“ Reynold Xin

Co-Founder and Chief Architect at Databricks

Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today.”

#MDBE16

Fare Calculation Engine One of World’s Largest Airlines Migrates from Oracle to MongoDB and Apache Spark to Support 100x performance improvement

Problem Why MongoDB Results Problem Solution Results

China Eastern targeting 130,000 seats sold every day across its web and mobile channels New fare calculation engine needed to support 20,000 search queries per second, but current Oracle platform supported only 200 per second

Apache Spark used for fare calculations, using business rules stored in MongoDB Fare calculations written to MongoDB for access by the search application MongoDB Connector for Apache Spark allows seamless integration with data locality awareness across the cluster

Cluster of less than 20 API, Spark & MongoDB nodes supports 180m fare calculations & 1.6 billion searches per day Each node delivers 15x higher performance and 10x lower latency than existing Oracle servers MongoDB Enterprise Advanced provided Ops Manager for operational automation and access to expert technical support

Connector Internals

#MDBE16

What's needed to connect to Spark?

1. Create a connection

•  This has some cost. The Mongo Java Driver runs a connection pool Authenticates connections, replica set discovery etc…

• Only two modes to support: Reads Writes

#MDBE16

What's needed to connect to Spark?

2. Partition the data • Partitions provide parallelism – splits the collection into parts • Challenges for mutable data sources as not a snapshot in time

RDD / Collection

#MDBE16

MongoSamplePartitioner The default partitioner • Over samples the collection

•  Calculate the number of partitions. Uses the average document size and the configured partition size.

•  Samples the collection, sampling n number of documents per partition •  Sorts the data by partition key •  Takes each n partition •  Adds a min and max key partition split at the start and end of the collection

{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1}, $lt: {_id: 100}} {$gte: {_id: 5000}, $lt: {_id: maxKey}} {$gte: {_id: 100}, $lt: {_id: 200}} {$gte: {_id: 4900}, $lt: {_id: 5000}}

#MDBE16

MongoShardedPartitioner

Sharded collections are already partitioned • Examines the shard config database

•  Creates partitions based on the shard chunk min and max ranges •  Stores the Shard location data for the chunk, to help promote locality •  Adds a min and max key partition split at the start and end of the collection

{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1000}, $lt: {_id: maxKey}} {$gte: {_id: 194}, $lt: {_id: 232}}

#MDBE16

Alternative Partitioners

• MongoSplitVectorPartitioner A partitioner for standalone or replicaSets. Command requires special privileges.

• MongoPaginateByCountPartitioner Creates a maximum number of partitions Costs a query to calculate each partition

• MongoPaginateBySizePartitioner As above but using average document size to determine the partitions.

•  Create your own Just implement the MongoPartitioner trait and add the full path to the config

#MDBE16

Whats needed to connect to Spark?

3. Support DataFrames & Datasets • RDD's with Schema

•  Supports Simple Types •  BinaryType, BooleanType, ByteType, CalendarIntervalType, DateType, DoubleType, FloatType,

IntegerType, LongType, NullType, ShortType, StringType, TimestampType •  Complex Types:

•  ArrayType - Typed Array •  StructType – Map

•  Unsupported Bson types use StructType similar to extended json.

#MDBE16

DataFrames & Datasets

•  Automatic Schema inference: valdataFrame=MongoSpark.load(sparkSession)

• Supply the schema caseclassPerson(firstName:String,lastName:String)valdataFrame=MongoSpark.load[Person](sparkSession)

#MDBE16

Whats needed to connect to Spark?

4. Configuration • Read Config

•  uri, database, collection, partitioner, sampleSize, localThreshold,readPreference, readConcern

• Write Config •  uri, database, collection, writeConcern

#MDBE16

The Anatomy of a read MongoSpark.load(sparkSession).count()

1.  Create a MongoRDD[Row] 2.  Infer the schema (none provided) 3.  Partition the data

4.  Calculate the Partitions . 5.  Allocate the workers 6.  For each partition on each worker:

i.  Queries and returns the cursor ii.  Iterates the cursor and sums up the data

7.  Finally, the Spark application returns the sum of the sums.

#MDBE16

Performance

• MongoDB Usual Suspects Document design Indexes Read Concern

• Spark Specifics Partitioning Strategy Data Locality

#MDBE16

. . .

Data locality

#MDBE16

Data locality

MongoS MongoS MongoS MongoS MongoS

. . .

#MDBE16

Data locality

Configure: LocalThreshold, MongoShardedPartitioner


. . .

#MDBE16

Data locality

MongoD MongoD MongoD MongoD MongoD


. . .

Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner

Demo Time!

#MDBE16

Scenario: You've won the EuroMillions lottery!

•  To celebrate you want to travel to Europes largest 50 cities!

•  The nouveau riche only have one way to travel; in style by personal helicopter!

•  It’s a logistical nightmare. "Travelling Salesman Problem"

#MDBE16

The scale of the problem

• With 50 places to visit there are: 49 x 48 x 47 x … x 3 x 2 x 1 possible ways to travel between them. This number is 63 digits long: 608,281,864,034,267,560,872,252,163,321,295,376,887,552,831,379,210,240,000,000,000

• Don't need to calculate all possible routes. Just need a route that is good enough.

#MDBE16

Choosing MongoDB and Spark

Good fit: •  Not possible directly via the aggregation framework •  CPU intensive task •  Needs code to solve the problem

Bad fit: •  Not an obviously parallel problem •  Can fork, divide and join using Spark

#MDBE16

Finding a solution with a genetic algorithm

Slightly complex but basically we're using evolution. • Randomly generate a number of routes •  Then "evolve" the routes over a number of generations

•  Crossover two parent routes to create a child route. •  Randomly mutate a % of children routes. •  Keep a percentage of the best routes.

•  After X generations will end up with a evolved route that is short

Conclusions

#MDBE16

An extremely powerful combination

• Many possible use cases •  Solve the right problems

Some operations maybe faster if performed using Aggregation Framework

• Performance •  Pick the correct partitioning strategy •  Tune MongoDB •  Tune Spark

• Spark is evolving all the time

Questions?

https://docs.mongodb.com/spark-connector

https://github.com/mongodb/mongo-spark

https://university.mongodb.com/courses/M233/about

MongoDB Europe 2016 - Big Data meets Big Compute

Data & Analytics