Top Banner
Big Data meets Big Compute Connecting MongoDB and Spark for Fun and Profit { name: "Ross Lawley", role: "Senior Software Engineer", twitter: "@RossC0" }
44

MongoDB Europe 2016 - Big Data meets Big Compute

Jan 07, 2017

Download

Data & Analytics

MongoDB
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MongoDB Europe 2016 - Big Data meets Big Compute

Big Data meets Big Compute Connecting MongoDB and Spark for Fun and Profit {name:"RossLawley",role:"SeniorSoftwareEngineer",twitter:"@RossC0"}

Page 2: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Agenda

What is Spark? How does it work? What problems can it solve? Whats the future of Spark?

Spark introduction 01 A deep dive into the connector. Configuration options Partitioning challenges How to Scale and keep data local.

Internals 03 Introducing the new connector How to install it and use it How to use it in various languages When to use it, when not to

The new connector 02

An impressive demostration of MongoDB and Spark combined!

Demo 04 I'll try and help answer any questions you might have. I'll also answer questions at the Drivers booth!

Questions 06 Quick recap. Where to go for more information

Conclusions 05

Page 3: MongoDB Europe 2016 - Big Data meets Big Compute

Spark an introduction

Page 4: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

What is Spark?

Fast and distributed general computing engine

• Makes it easy and fast to process large datasets •  Libraries for SQL, streaming, machine learning, graphs •  APIs in Scala, Python, Java, R •  It’s fundamentally different to what’s come before

Page 5: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

So why not just use Hadoop?

Spark is FAST •  Faster to write.

•  Friendly API in Scala, Python, Java and R

•  Faster to run. •  Up to 100x faster than Hadoop in memory •  10x faster on disk.

Page 6: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

A visual comparison

Hadoop Spark

Page 7: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Spark History

2009

The Beginning Spark project started at UC Berkeley's AMPLab

Spark Open Sourced

2010

2015

Spark 1.3.0 – 1.5.0 R support Spark SQL out of alpha DataFrames

Spark 1.6.0 Spark 2.0 Datasets Structured Streams

2016

2013

Joined the Apache foundation

Spark 1.0.0 – 1.2.0 Scala, Java & Python Spark SQL Streaming Mlib GraphX

2014

Page 8: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Spark Programming Model

Resilient Distributed Datasets •  An RDD is a collection of elements that is immutable, distributed and fault-

tolerant.

•  Transformations can be applied to a RDD, resulting in new RDD.

•  Actions can be applied to a RDD to obtain a value.

•  RDD is lazy.

Page 9: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

RDD Operations Transformations Actions map reduce

filter collect

flatMap count

mapPartitions save

sample lookupKey

union take

join foreach

groupByKey

reduceByKey

Page 10: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Built in fault tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions valsearches=spark.textFile("hdfs://...").filter(_.contains("Search")).map(_.split("\t")(2)).cache().filter(_.contains("MongoDB")).count()

Mapped RDD

Filtered RDD HDFS RDD Cached

RDD Filtered

RDD Count

Page 11: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

. . .

Spark Driver

Worker 1 Worker n Worker 2

Cluster Manager

Data source

Spark topology

Page 12: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Spark high level view

Page 13: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Spark high level view

RDDs – Unstructured Data

Datasets – Structured Data

Spark SQL Spark Streaming MLIB GraphX

Page 14: MongoDB Europe 2016 - Big Data meets Big Compute

The new connector

Page 15: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Connecting MongoDB and Spark

Big Data Storage Big Data Compute

Page 16: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Different use cases

Applications OLTP Fine grained operations

Offline Processing Analytics Data Warehousing

Page 17: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

The MongoDB Spark Connector

MongoDB Spark Connector

Page 18: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

The MongoDB Spark Connector

•  Spark 1.6.x and Spark 2.0.x •  Scala, Python, Java, and R

•  Idiomatic Scala API •  Supports custom Aggregations •  Multiple partitioning strategies •  Automatic schema inference •  Automatic conversion to Datasets

>$SPARK_HOME/bin/spark-shell--packagesorg.mongodb.spark:mongo-spark-connector_2.10:2.0.0

Page 19: MongoDB Europe 2016 - Big Data meets Big Compute

“ Reynold Xin

Co-Founder and Chief Architect at Databricks

Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today.”

Page 20: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Fare Calculation Engine One of World’s Largest Airlines Migrates from Oracle to MongoDB and Apache Spark to Support 100x performance improvement

Problem Why MongoDB Results Problem Solution Results

China Eastern targeting 130,000 seats sold every day across its web and mobile channels New fare calculation engine needed to support 20,000 search queries per second, but current Oracle platform supported only 200 per second

Apache Spark used for fare calculations, using business rules stored in MongoDB Fare calculations written to MongoDB for access by the search application MongoDB Connector for Apache Spark allows seamless integration with data locality awareness across the cluster

Cluster of less than 20 API, Spark & MongoDB nodes supports 180m fare calculations & 1.6 billion searches per day Each node delivers 15x higher performance and 10x lower latency than existing Oracle servers MongoDB Enterprise Advanced provided Ops Manager for operational automation and access to expert technical support

Page 21: MongoDB Europe 2016 - Big Data meets Big Compute

Connector Internals

Page 22: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

What's needed to connect to Spark?

1. Create a connection

•  This has some cost. The Mongo Java Driver runs a connection pool Authenticates connections, replica set discovery etc…

• Only two modes to support: Reads Writes

Page 23: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

What's needed to connect to Spark?

2. Partition the data • Partitions provide parallelism – splits the collection into parts • Challenges for mutable data sources as not a snapshot in time

RDD / Collection

Page 24: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

MongoSamplePartitioner The default partitioner • Over samples the collection

•  Calculate the number of partitions. Uses the average document size and the configured partition size.

•  Samples the collection, sampling n number of documents per partition •  Sorts the data by partition key •  Takes each n partition •  Adds a min and max key partition split at the start and end of the collection

{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1}, $lt: {_id: 100}} {$gte: {_id: 5000}, $lt: {_id: maxKey}} {$gte: {_id: 100}, $lt: {_id: 200}} {$gte: {_id: 4900}, $lt: {_id: 5000}}

Page 25: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

MongoShardedPartitioner

Sharded collections are already partitioned • Examines the shard config database

•  Creates partitions based on the shard chunk min and max ranges •  Stores the Shard location data for the chunk, to help promote locality •  Adds a min and max key partition split at the start and end of the collection

{$gte: {_id: minKey}, $lt: {_id: 1}} {$gte: {_id: 1000}, $lt: {_id: maxKey}} {$gte: {_id: 194}, $lt: {_id: 232}}

Page 26: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Alternative Partitioners

• MongoSplitVectorPartitioner A partitioner for standalone or replicaSets. Command requires special privileges.

• MongoPaginateByCountPartitioner Creates a maximum number of partitions Costs a query to calculate each partition

• MongoPaginateBySizePartitioner As above but using average document size to determine the partitions.

•  Create your own Just implement the MongoPartitioner trait and add the full path to the config

Page 27: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Whats needed to connect to Spark?

3. Support DataFrames & Datasets • RDD's with Schema

•  Supports Simple Types •  BinaryType, BooleanType, ByteType, CalendarIntervalType, DateType, DoubleType, FloatType,

IntegerType, LongType, NullType, ShortType, StringType, TimestampType •  Complex Types:

•  ArrayType - Typed Array •  StructType – Map

•  Unsupported Bson types use StructType similar to extended json.

Page 28: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

DataFrames & Datasets

•  Automatic Schema inference: valdataFrame=MongoSpark.load(sparkSession)

• Supply the schema caseclassPerson(firstName:String,lastName:String)valdataFrame=MongoSpark.load[Person](sparkSession)

Page 29: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Whats needed to connect to Spark?

4. Configuration • Read Config

•  uri, database, collection, partitioner, sampleSize, localThreshold,readPreference, readConcern

• Write Config •  uri, database, collection, writeConcern

Page 30: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

The Anatomy of a read MongoSpark.load(sparkSession).count()

1.  Create a MongoRDD[Row] 2.  Infer the schema (none provided) 3.  Partition the data

4.  Calculate the Partitions . 5.  Allocate the workers 6.  For each partition on each worker:

i.  Queries and returns the cursor ii.  Iterates the cursor and sums up the data

7.  Finally, the Spark application returns the sum of the sums.

Page 31: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Performance

• MongoDB Usual Suspects Document design Indexes Read Concern

• Spark Specifics Partitioning Strategy Data Locality

Page 32: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

. . .

Data locality

Page 33: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Data locality

MongoS MongoS MongoS MongoS MongoS

. . .

Page 34: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Data locality

Configure: LocalThreshold, MongoShardedPartitioner

MongoS MongoS MongoS MongoS MongoS

. . .

Page 35: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Data locality

MongoD MongoD MongoD MongoD MongoD

MongoS MongoS MongoS MongoS MongoS

. . .

Configure: ReadPreference, LocalThreshold, MongoShardedPartitioner

Page 36: MongoDB Europe 2016 - Big Data meets Big Compute

Demo Time!

Page 37: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Scenario: You've won the EuroMillions lottery!

•  To celebrate you want to travel to Europes largest 50 cities!

•  The nouveau riche only have one way to travel; in style by personal helicopter!

•  It’s a logistical nightmare. "Travelling Salesman Problem"

Page 38: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

The scale of the problem

• With 50 places to visit there are: 49 x 48 x 47 x … x 3 x 2 x 1 possible ways to travel between them. This number is 63 digits long: 608,281,864,034,267,560,872,252,163,321,295,376,887,552,831,379,210,240,000,000,000

• Don't need to calculate all possible routes. Just need a route that is good enough.

Page 39: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Choosing MongoDB and Spark

Good fit: •  Not possible directly via the aggregation framework •  CPU intensive task •  Needs code to solve the problem

Bad fit: •  Not an obviously parallel problem •  Can fork, divide and join using Spark

Page 40: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

Finding a solution with a genetic algorithm

Slightly complex but basically we're using evolution. • Randomly generate a number of routes •  Then "evolve" the routes over a number of generations

•  Crossover two parent routes to create a child route. •  Randomly mutate a % of children routes. •  Keep a percentage of the best routes.

•  After X generations will end up with a evolved route that is short

Page 41: MongoDB Europe 2016 - Big Data meets Big Compute

Conclusions

Page 42: MongoDB Europe 2016 - Big Data meets Big Compute

#MDBE16

An extremely powerful combination

• Many possible use cases •  Solve the right problems

Some operations maybe faster if performed using Aggregation Framework

• Performance •  Pick the correct partitioning strategy •  Tune MongoDB •  Tune Spark

• Spark is evolving all the time

Page 43: MongoDB Europe 2016 - Big Data meets Big Compute

Questions?

https://docs.mongodb.com/spark-connector

https://github.com/mongodb/mongo-spark

https://university.mongodb.com/courses/M233/about

Page 44: MongoDB Europe 2016 - Big Data meets Big Compute