Meet Spark

© 2014 MapR Technologies 1© 2014 MapR Technologies

Chug Spark : Hello Spark

Mike Emerick, Senior Architect MapR

April 2014

© 2014 MapR Technologies 2

Agenda

• Introductions

• Log File enrichment

• ETL with ML

• Recommendation Engine

• Adhoc SQL Queries

• The Future case


Who is Mike Emerick ?

My bio the highlights.

Architect for MapR for 2.5 years.

“creative hours at Workshop 88.”


Approach to this presentation

1.No API discussion

2.Architecture features and utilization

3. Use Cases .. and Why Spark?


Spark 10,000 feet

• Fundamentally Spark is an MPP.

• Can use many Storage Subsystems.(Great for development)

• RDD, Accumulators, Broadcast.

• Map Reduce +.

• Apache Spark site has

great resources

on architecture and API.


Usecase : SQL Queries

• “Interactive SQL on Hadoop...”

• How does Spark make this easier?– Native Hive QL (SQL 93 ish)

– In memory and from disk

– Usually the first thought...

• Spark SQL



Usecase : Log file enrichment

• Why enrich my log data..?

• This is not Storm it is Batch– Similar to Hbase Async API..

• How does Spark make this easier?– Streaming API

– Sliding Windows

– SQL Hive/Shark• Connect to Hbase

– NoSQL Connectors • Hbase



Usecase : SQL mixing with ML

• Why are folks doing this..?

• How does Spark make this easier?– Native Machine learning Mlib

– Access to neartime Adhoc SQL queries

– R and SQL in the same place

– Bigger than in memory faster than MR



Usecase : Recommendation Engine

• It is a recommendation engine...

• How does Spark make this easier?– ETL and Enrichment

– Mlib makes it easy to import data.

– Mlib Training in same cluster

– NoSQL Adhoc serves recommendations

– Dynamic



Use cases build in complexity

• Adoption follows a curve of complexity– Ingestion and query

– Ingestion Enrichment Query

– Ingestion Enrichment Machine learning Query

– Ingestion Enrichment Machine learning Serving recommendations

– .....

• Spark is flattening the curve

• Why?– One framework

– Less data movement

– Access to preferred language


Future state: ~ in the year 2000

• ADAM - Genomics

• GraphX – Graph is near...

• Mlib – Look for lots of work here

• PySpark – Fastest evolving

• SparkR – Just getting started

• BlinkDB – ~ Queries

• OEM...


Business ServicesMapR is hiring in Chicago

Apache Drill Beta this Summer

Happy National Making day !

Check out W88 for Hadoop classes

Meet Spark

Technology