MongoDB + Spark@blimpyacht
Level Setting
TROUGH OF DISILLUSIONMENT
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
MapReduce
Distributed Processing
HDFSYARN
Hive
Pig
Domain Specific Languages
MapReduce
Interactive Shell
Easy (-er)Caching
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFSYARN
SparkHadoop
Distributed Processing
HDFSYARN
Spark
Hadoop
Mesos
HDFSStand Alone
YARN
Spark
Hadoop
Mesos
HDFSStand AloneYARN
SparkHadoop
Mesos
Hive
Pig
HDFSStand Alone
YARN
SparkHadoop
Mesos
Hive
Pig
SparkShell
HDFSStand Alone
YARN
SparkHadoop
Mesos
Hive
Pig
SparkShell
SparkStreaming
HDFS
Stand AloneYAR
N
SparkHadoop
Mesos
Hive
Pig
SparkSQL
SparkShell
SparkStreaming
HDFS
Stand Alone
YARN
Spark
Hadoop
Mesos
Hive
Pig
SparkSQL
SparkShell
SparkStreaming
HDFS
Stand Alone
YARN
Spark
Hadoop
Mesos
Hive
Pig
SparkSQL
SparkShell
SparkStreaming
Stand Alone
YARN
Spark
Hadoop
Mesos
Hive
Pig
SparkSQL
SparkShell
SparkStreaming
SparkStreaming
Hive
SparkShell
MesosHado
op
Pig
SparkSQL
Spark
Stand Alone
YARN
Stand AloneYAR
N
SparkMesos
SparkSQL
SparkShell
SparkStreaming
Stand Alone
YARN
SparkMesos
SparkSQL
SparkShell
SparkStreaming
executor
Worker Node
executor
Worker Node
Driver
Resilient Distributed Datasets
Parallelization
Parellelize = x
Transformations
Parellelize = x
t(x) = x’
t(x’) = x’’
Transformationsfilter( func )union( func )intersection( set )distinct( n )map( function )
Action
f(x’’) = y
Parellelize = x
t(x) = x’
t(x’) = x’’
Actionscollect()count()first()take( n )reduce( function )
Lineage
f(x’’) = y
Parellelize = x
t(x) = x’
t(x’) = x’’
Transform
Transform ActionParalleliz
e
Lineage
Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e
Lineage
Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e
Lineage
Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e Transform
Transform ActionParalleliz
e
Lineage
https://github.com/mongodb/mongo-hadoop
{"_id" : ObjectId("4f16fc97d1e2d32371003e27"),"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox","mailbox" : "bass-e","filename" : "450.","headers" : {
"X-cc" : "","From" : "[email protected]","Subject" : "Re: Plays and other information","X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\
Notes inbox","Content-Transfer-Encoding" : "7bit","X-bcc" : "","To" : "[email protected]","X-Origin" : "Bass-E","X-FileName" : "ebass.nsf","X-From" : "Michael Simmons","Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)","X-To" : "Eric Bass","Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>","Content-Type" : "text/plain; charset=us-ascii","Mime-Version" : "1.0"
}}
{"_id" : ObjectId("4f16fc97d1e2d32371003e27"),"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox","lfpwoojjf0wig=-i1qf=q0qif0=i38 \-00\ 1-8" : "bass-e","filename" : "450.","headers" : {
"X-cc" : "",
"From" : "[email protected]",
"Subject" : "Re: Plays and other information","X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\
Notes inbox","Content-Transfer-Encoding" : "7bit","X-bcc" : "",
"To" : "[email protected]","X-Origin" : "Bass-E","X-FileName" : "ebass.nsf","X-From" : "Michael Simmons","Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)","X-To" : "Eric Bass","Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>","Content-Type" : "text/plain; charset=us-ascii","Mime-Version" : "1.0"
}}
{ _id : "[email protected]|[email protected]", value : 2}{ _id : "[email protected]|[email protected]", value : 2}{ _id : "[email protected]|[email protected]", value : 2 }
Eratosthenes
Democritus
Hypatia
Shemp
Euripides
Spark ConfigurationConfiguration conf = new Configuration();conf.set(
"mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat”);conf.set(
"mongo.input.uri", "mongodb://localhost:27017/db.collection”);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
Spark ContextJavaPairRDD<Object, BSONObject> documents = context.newAPIHadoopRDD( conf,
MongoInputFormat.class,Object.class,BSONObject.class
);
mongos
mongos
Data Services
Deployment Artifacts
Hadoop Connector Jar
Fat JarJava Driver
Jar
Spark Submit/usr/local/spark-1.5.1/bin/spark-submit \ --class com.mongodb.spark.examples.DataframeExample \ --master local Examples-1.0-SNAPSHOT.jar
Stand Alone
YARN
SparkMesos
SparkSQL
SparkShell
SparkStreaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple) { BSONObject header = (BSONObject)tuple._2.get("headers");
Message m = new Message(); m.setTo( (String) header.get("To") ); m.setX_From( (String) header.get("From") ); m.setMessage_ID( (String) header.get( "Message-ID" ) ); m.setBody( (String) tuple._2.get( "body" ) );
return m; } });
MognoDB & Spackcode demo
THE FUTUREAND
BEYOND THE INFINITE
Stand Alone
YARN
SparkMesos
SparkSQL
SparkShell
SparkStreaming
MongoDB + Spark
THANKS!{
name: ‘Bryan Reinero’,role: ‘Developer
Advocate’,twitter: ‘@blimpyacht’,email: