Spark Meetup chez Viadeo Mercredi 4 février 2015 • 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – [email protected]-Spark vs Hadoop MapReduce -Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy • 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – [email protected]-La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer • 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan @datastax.com Apache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.
34
Embed
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Spark Meetup chez ViadeoMercredi 4 février 2015
• 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – [email protected] -Spark vs HadoopMapReduce-Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy
• 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – [email protected] mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer
• 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – [email protected] Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.
Map Reduce
➜ Map() : parse inputs and generate 0 to n <key, value>
➜ Reduce() : sums all values of the same key and generate a <key, value>
WordCount Example
➜ Each map take a line as an input and break into words
• It emits a key/value pair of the word and 1
➜ Each Reducer sums the counts for each word
• It emits a key/value pair of the word and sum
Map Reduce
Hadoop MapReduce v1
Hadoop MapReduce v1
Hadoop MapReduce v1
MapReduce v1
Not good for low-latency jobs on smallest dataset
MapReduce v1
Good for off-line batch jobs on massive data
Hadoop 1
➜ Batch ONLY
• High latency jobs
HDFS (Redundant, Reliable Storage)
MapReduce1Cluster Resource Management + Data Processing
BATCH
HIVEQuery
PigScripting
CascadingAccelerate Dev.
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
MapReduce1Data Processing
BATCH
YARN (Cluster Resource Management)
OtherData Processing
…
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH
YARN (Cluster Resource Management)
Hadoop2 : Big Data
Operating System
➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS
• Simultaneously & with predictable levels of service
• Data analysts and real-time applications
HDFS (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm, SamzaSpark Streaming)
GRAPH(Giraph,GraphX)
MachineLearning(MLLIb)
In-Memory(Spark)
ONLINE(Hbase HOYA)
OTHER(ElasticSearch)
https://spark.apache.org
Apache Spark™ is a fast and general engine for large-scale data processing.
The most active project
0
50
100
150
200
250
Patches
MapReduce Storm
Yarn Spark
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Lines Added
MapReduce Storm
Yarn Spark
Spark won the
Daytona GraySort contest!
Run programs up to 100x faster than HadoopMapReduce in memory, or 10x faster on disk.
Sort on disk 100TB of data 3x faster than HadoopMapReduce using 10x fewer machines.