Top Banner
CONFIDENTIAL - RESTRICTED Introduction to Spark Maxime Dumas – Systems Engineer, Cloudera
43

Introduction to spark phoenix meetup aug 19th 2014

Aug 08, 2015

Download

Technology

Jay Etchings
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to spark phoenix meetup aug 19th 2014

CONFIDENTIAL - RESTRICTED

Introduction to SparkMaxime Dumas – Systems Engineer, Cloudera

Page 2: Introduction to spark phoenix meetup aug 19th 2014

Thirty Seconds About Max

• Systems Engineer

• aka Sales Engineer

• SoCal, AZ, NV

• former coder of PHP

• teaches meditation + yoga

• from Montreal, Canada

2

Page 3: Introduction to spark phoenix meetup aug 19th 2014

What Does Cloudera Do?

• product

• distribution of Hadoop components, Apache licensed

• enterprise tooling

• support

• training

• services (aka consulting)

• community

3

Page 4: Introduction to spark phoenix meetup aug 19th 2014

4

But first… how did we get here?

Page 5: Introduction to spark phoenix meetup aug 19th 2014

What does Hadoop look like?

5

HDFS worker(“DN”)

MRworker(“TT”)

HDFS worker(“DN”)

MRworker(“TT”)

HDFS worker(“DN”)

MRworker(“TT”)

HDFS worker(“DN”)

MRworker(“TT”)

HDFS worker(“DN”)

MRworker(“TT”)

…HDFS master(“NN”)

MRmaster(“JT”)

Standbymaster

Page 6: Introduction to spark phoenix meetup aug 19th 2014

But I want MORE!

6

HDFS worker

HDFS worker

HDFS worker

HDFS worker

HDFS worker

MapReduce

HDFS master(“NN”)

MRmaster(“JT”)

Standbymaster

Page 7: Introduction to spark phoenix meetup aug 19th 2014

Hadoop as an Architecture

The Old Way

$30,000+ per TB

Expensive & Unattainable

• Hard to scale• Network is a bottleneck• Only handles relational data• Difficult to add new fields & data types

Expensive, Special purpose, “Reliable” ServersExpensive Licensed Software

Network

Data Storage(SAN, NAS)

Compute(RDBMS, EDW)

The Hadoop Way

$300-$1,000 per TB

Affordable & Attainable

• Scales out forever• No bottlenecks• Easy to ingest any data• Agile data access

Commodity “Unreliable” ServersHybrid Open Source Software

Compute(CPU)

Memory Storage(Disk)

z

z

Page 8: Introduction to spark phoenix meetup aug 19th 2014

CDH: the App Store for Hadoop

8

Integration

Storage

Resource Management

Met

adat

a

NoSQLDBMS

…Analytic

MPPDBMS

SearchEngine

In-Memory

Batch Processing

System Management

Data Management

Support

Secu

rity

Machine Learning

MapReduce

Page 9: Introduction to spark phoenix meetup aug 19th 2014

9

Introduction to Apache Spark

Credits:

• Ben White

• Todd Lipcon

• Ted Malaska

• Jairam Ranganathan

• Jayant Shekhar

• Sandy Ryza

Page 10: Introduction to spark phoenix meetup aug 19th 2014

Can we improve on MR?

• Problems with MR:

• Very low-level: requires a lot of code to do simple things

• Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms.

10

Page 11: Introduction to spark phoenix meetup aug 19th 2014

Can we improve on MR?

• Two approaches to improve on MapReduce:

1. Special purpose systems to solve one problem domain well.• Giraph / Graphlab (graph processing)• Storm (stream processing)

2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems.• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)

Both are viable strategies depending on the problem!

11

Page 12: Introduction to spark phoenix meetup aug 19th 2014

What is Apache Spark?

Spark is a general purpose computational framework

Retains the advantages of MapReduce:• Linear scalability• Fault-tolerance• Data Locality based computations

…but offers so much more:• Leverages distributed memory for better performance• Supports iterative algorithms that are not feasible in MR• Improved developer experience• Full Directed Graph expressions for data parallel computations• Comes with libraries for machine learning, graph analysis, etc

12

Page 13: Introduction to spark phoenix meetup aug 19th 2014

Getting started with Spark

• Java API

• Interactive shells:

• Scala (spark-shell)

• Python (pyspark)

13

Page 14: Introduction to spark phoenix meetup aug 19th 2014

Execution modes

• Standalone Mode

• Dedicated master and worker daemons

• YARN Client Mode

• Launches a YARN application with the driver program running locally

• YARN Cluster Mode

• Launches a YARN application with the driver program running in the YARN ApplicationMaster

14

Dynamic resource management between Spark, MR, Impala…

Dedicated Spark runtime with static resource limits

Page 15: Introduction to spark phoenix meetup aug 19th 2014

Spark Concepts

15

Page 16: Introduction to spark phoenix meetup aug 19th 2014

Parallelized Collections

16

scala> val data = 1 to 5data: Range.Inclusive = Range(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)distData: org.apache.spark.rdd.RDD[Int] =

ParallelCollectionRDD[0]

Now I can apply parallel operations to this array:

scala> distData.reduce(_ + _)[… Adding task set 0.0 with 56 tasks …] res0: Int = 15

What just happened?!

Page 17: Introduction to spark phoenix meetup aug 19th 2014

RDD – Resilient Distributed Dataset

• Collections of objects partitioned across a cluster

• Stored in RAM or on Disk

• You can control persistence and partitioning

• Created by:

• Distributing local collection objects

• Transformation of data in storage

• Transformation of RDDs

• Automatically rebuilt on failure (resilient)

• Contains lineage to compute from storage

• Lazy materialization

17

Page 18: Introduction to spark phoenix meetup aug 19th 2014

RDD transformations

18

Page 19: Introduction to spark phoenix meetup aug 19th 2014

Operations on RDDs

Transformations lazily transform a RDD to a new RDD

• map

• flatMap

• filter

• sample

• join

• sort

• reduceByKey

• …

Actions run computation to return a value

• collect

• reduce(func)

• foreach(func)

• count

• first, take(n)

• saveAs

• …

19

Page 20: Introduction to spark phoenix meetup aug 19th 2014

Fault Tolerance

• RDDs contain lineage.

• Lineage – source location and list of transformations

• Lost partitions can be re-computed from source data

20

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))map

(func = split(...))

Page 21: Introduction to spark phoenix meetup aug 19th 2014

21

Examples

Page 22: Introduction to spark phoenix meetup aug 19th 2014

Word Count in MapReduce

22

package org.myorg;

import java.io.IOException;import java.util.*;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());context.write(word, one);

}}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}

public static void main(String[] args) throws Exception {Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);}

}

Page 23: Introduction to spark phoenix meetup aug 19th 2014

Word Count in Spark

sc.textFile(“words”)

.flatMap(line => line.split(" "))

.map(word=>(word,1))

.reduceByKey(_+_).collect()

23

Page 24: Introduction to spark phoenix meetup aug 19th 2014

Logistic Regression

• Read two sets of points

• Looks for a plane W that separates them

• Perform gradient descent:

• Start with random W

• On each iteration, sum a function of W over the data

• Move W in a direction that improves it

24

Page 25: Introduction to spark phoenix meetup aug 19th 2014

Intuition

25

Page 26: Introduction to spark phoenix meetup aug 19th 2014

Logistic Regression

26

Page 27: Introduction to spark phoenix meetup aug 19th 2014

Logistic Regression Performance

27

Page 28: Introduction to spark phoenix meetup aug 19th 2014

28

Spark and Hadoop:a Framework within a Framework

Page 29: Introduction to spark phoenix meetup aug 19th 2014

29

Integration

Storage

Resource Management

Met

adat

a

HBase …Impala Solr SparkMap

Reduce

System Management

Data Management

Support

Secu

rity

Page 30: Introduction to spark phoenix meetup aug 19th 2014

30

Page 31: Introduction to spark phoenix meetup aug 19th 2014

31

Page 32: Introduction to spark phoenix meetup aug 19th 2014

32

Integration

Storage

Resource Management

Met

adat

a

HBase …Impala Solr SparkMap

Reduce

System Management

Data Management

Support

Secu

rity

Page 33: Introduction to spark phoenix meetup aug 19th 2014

Spark Streaming

• Takes the concept of RDDs and extends it to DStreams

• Fault-tolerant like RDDs

• Transformable like RDDs

• Adds new “rolling window” operations

• Rolling averages, etc

• But keeps everything else!

• Regular Spark code works in Spark Streaming

• Can still access HDFS data, etc

33

Page 34: Introduction to spark phoenix meetup aug 19th 2014

Micro-batching for on the fly ETL

34

Page 35: Introduction to spark phoenix meetup aug 19th 2014

Fault recovery

How fast can the system recover?

35

Page 36: Introduction to spark phoenix meetup aug 19th 2014

Fault Recovery

• RDDs store dependency graph

• Because RDDs are deterministic:Missing RDDs are rebuilt in parallel on other nodes

• Stateful RDDs can have infinite lineage

• Periodic checkpoints to disk clears lineage

• Faster recovery times

• Better handling of stragglers vs row-by-row streaming

36

Page 37: Introduction to spark phoenix meetup aug 19th 2014

37

Summary

Page 38: Introduction to spark phoenix meetup aug 19th 2014

Why Spark?

• Flexible like MapReduce

• High performance

• Machine learning, iterative algorithms

• Interactive data explorations

• Concise, easy API for developer productivity

38

Page 39: Introduction to spark phoenix meetup aug 19th 2014

39

Page 40: Introduction to spark phoenix meetup aug 19th 2014

Spark

40

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Installation-Guide/cm5ig_install_spark.html

Page 41: Introduction to spark phoenix meetup aug 19th 2014

A Brief History

41

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Doug Cutting launches Nutch

project

Google releases GFS paper

Google releases MapReduce

paper

MapReduceimplemented in

Nutch

Nutch adds distributed file

system

Hadoop spun out of Nutch

project

Hadoop breaks Terasort world

record

Cloudera founded

CDH and CDH2 released CDH3 released

CDH4 released adding HA

Impala(SQL on Hadoop)

launched

Sentry and Search

launched

CDH5

Cloudera Manager released

HBase, Zookeeper, Flume and more added

to CDH

Page 42: Introduction to spark phoenix meetup aug 19th 2014

What is Apache Hadoop?

• An open-source implementation of Google’s GFS and MapReduce papers

• An Apache Software Foundation top-level project

• Good at storing and processing all kinds of data

• Reliable storage at terabyte/petabyte-scaleon unreliable (cheap) hardware

• A distributed system for counting words

42

Page 43: Introduction to spark phoenix meetup aug 19th 2014

What is Apache Hadoop?

43

Has the Flexibility to Store and Mine Any Type of Data

Ask questions across structured and unstructured data that were previously impossible to ask or solve

Not bound by a single schema

Excels atProcessing Complex Data

Scale-out architecture divides workloads across multiple nodes

Flexible file system eliminates ETL bottlenecks

ScalesEconomically

Can be deployed on industry standardhardware

Open source platform guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Scalable Fault tolerant Distributed

CORE HADOOP SYSTEM COMPONENTS