Top Banner
YARN by default (Spark on YARN) Ferran Galí i Reniu @ferrangali
85

Yarn by default (Spark on YARN)

Apr 16, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Yarn by default (Spark on YARN)

YARN by default(Spark on YARN)

Ferran Galí i Reniu@ferrangali

Page 2: Yarn by default (Spark on YARN)

About me

@ferrangali

Page 3: Yarn by default (Spark on YARN)

100 MB/s

2 TB = 3.5 hours

The Big Data problem

Page 4: Yarn by default (Spark on YARN)

100 MB/s

2 TB = 30 min

The Big Data problem

Page 6: Yarn by default (Spark on YARN)
Page 7: Yarn by default (Spark on YARN)
Page 8: Yarn by default (Spark on YARN)

HDFS

Node Node Node Node Node Node Node NodeHardware

Page 9: Yarn by default (Spark on YARN)

HDFS

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Page 10: Yarn by default (Spark on YARN)

HDFS

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Page 11: Yarn by default (Spark on YARN)

$> hadoop fs -ls

HDFS

Page 12: Yarn by default (Spark on YARN)

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$>

HDFS

Page 13: Yarn by default (Spark on YARN)

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dir

HDFS

Page 14: Yarn by default (Spark on YARN)

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$>

HDFS

Page 15: Yarn by default (Spark on YARN)

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$> hadoop fs -cat dir/file3.txt

HDFS

Page 16: Yarn by default (Spark on YARN)

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$> hadoop fs -cat dir/file3.txtline1line2line3line4line5

HDFS

Page 17: Yarn by default (Spark on YARN)

MapReduce

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Page 18: Yarn by default (Spark on YARN)

MapReduce

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing

Page 19: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 20: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 21: Yarn by default (Spark on YARN)

Map

Map

Map

Map

Reduce

Reduce

Reduce

MapReduce Job

Split

Write

Write

Write

Split

Split

Split

map(){ // Your code here}

reduce(){ // Your code here}

Page 22: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Page 23: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Page 24: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Processing Job Job

The Big Data problemData Pipeline

Application

Page 25: Yarn by default (Spark on YARN)

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Page 26: Yarn by default (Spark on YARN)

MapReduce 1.0 Architecture

NodeTaskTracker

Map Map Map Map

Map Map Map Reduce

Reduce Reduce Reduce Reduce

Page 27: Yarn by default (Spark on YARN)

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Page 28: Yarn by default (Spark on YARN)

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Map

Page 29: Yarn by default (Spark on YARN)

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Reduce

Page 30: Yarn by default (Spark on YARN)

Limitations

Page 31: Yarn by default (Spark on YARN)

Limitations

Page 32: Yarn by default (Spark on YARN)

Limitations

Page 33: Yarn by default (Spark on YARN)

YARN

Page 34: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemYARN - Yet Another Resource Negotiator

Processing

Page 35: Yarn by default (Spark on YARN)

Cores

Page 36: Yarn by default (Spark on YARN)

Memory

Page 37: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

NodeManager

x8 x8

x8 x8

Page 38: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

NodeManager

Applicationx61 core1024MB

x8 x8

x8 x8

Page 39: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

NodeManager

Applicationx61 core1024MB

x8 x8

x8 x8

Page 40: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Applicationx61 core1024MB

Container

x8 x8

x8 x8

Page 41: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Map

Map

Map

NodeManager

Map

Map

Applicationx61 core1024MB

Map

x8 x8

x8 x8

Page 42: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Reduce

Reduce

Reduce

NodeManager

Reduce

Reduce

Applicationx61 core1024MB

Reduce

x8 x8

x8 x8

Page 43: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

Page 44: Yarn by default (Spark on YARN)

NodeManagerResourceManager

YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

ApplicationMaster

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

Page 45: Yarn by default (Spark on YARN)

NodeManager

Container

ResourceManager

YARN Architecture

Container

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

ApplicationMaster

Container

Container

Applicationx61 core1024MB

Application 2x42 cores2048MB

x8 x8

x8 x8

Page 46: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

Batch

Page 47: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

In Memory / Streaming

Page 48: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing ...

Application

Page 49: Yarn by default (Spark on YARN)

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

Page 50: Yarn by default (Spark on YARN)

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Reduce

Page 51: Yarn by default (Spark on YARN)

Demo

Page 52: Yarn by default (Spark on YARN)
Page 53: Yarn by default (Spark on YARN)

The Big Data problemSpark Job

def main(args: Array[ String]): Unit = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf)

sc.rdd(...).action() sc.stop()}

Page 54: Yarn by default (Spark on YARN)

RDDs

The Big Data problemSpark Job

Main

Page 55: Yarn by default (Spark on YARN)

RDDs

Stage Stage Stage

...

The Big Data problemSpark Job

Main

Page 56: Yarn by default (Spark on YARN)

RDDs

Stage

Task Task

Task Task

Stage Stage

Task

Task

...

The Big Data problemSpark Job

Task

Main

Task Task Task

Task

Task

Task Task

Task Task

Task Task

Page 57: Yarn by default (Spark on YARN)

The Big Data problemSpark Architecture

Executor

Driver

Executor Executor Executor

coordinates stage tasks across executors

Page 58: Yarn by default (Spark on YARN)

The Big Data problemSpark Architecture

Executor

Driver

Executor Executor Executor

Main

coordinates stage tasks across executors

Page 59: Yarn by default (Spark on YARN)

The Big Data problemSpark Architecture

Executor

Driver

Executor Executor Executor

Main

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

coordinates stage tasks across executors

Page 60: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemSpark on YARN

Processing

Page 61: Yarn by default (Spark on YARN)

The Big Data problemConfiguration

export HADOOP_CONF_DIR="/opt/hadoop/hadoop_install/conf"core-site.xmlhdfs-site.xmlyarn-site.xml

Page 62: Yarn by default (Spark on YARN)

The Big Data problemConfiguration

spark.yarn.jar hdfs:///user/hadoop/libs/spark-assembly.jar# need to upload the file to HDFS

spark.eventLog.enabled truespark.eventLog.dir hdfs:///app-logs/spark/logsspark.yarn.historyServer.address historyserver_host:18080

# needed if you want to analyse finished jobs

Page 63: Yarn by default (Spark on YARN)

The Big Data problemExecutor configuration

spark.executor.cores 4# the number of tasks that will execute each executor

spark.executor.memory 4g# shared memory across all tasks

spark.yarn.executor.memoryOverhead 614m# 15-20% of total executor memory

Executor

Task

Task

Task

Task

Page 64: Yarn by default (Spark on YARN)

The Big Data problemDeployment

$ ./spark-submit--class my.main.class--master {deploy-mode}my-jar.jararg1 arg2 arg3 ...

yarn-client

yarn-cluster

Page 65: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

NodeManager

Page 66: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

Driver

NodeManager

Page 67: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

Container

Container

Container

Driver

NodeManager

Container

Container

Container

Page 68: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

Executor

Executor

Executor

Driver

NodeManager

Executor

Executor

Executor

Page 69: Yarn by default (Spark on YARN)

Demo

Page 70: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager

Client$ ./spark-submit

NodeManager

Page 71: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager

Client$ ./spark-submit

Driver

NodeManager

Page 72: Yarn by default (Spark on YARN)

NodeManager

The Big Data problem--master yarn-cluster

ResourceManagerClient

$ ./spark-submit

Driver

NodeManager

Page 73: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager

Client$ ./spark-submit

Container

Driver

Container

NodeManager

Container

Container

Container

Page 74: Yarn by default (Spark on YARN)

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager

Client$ ./spark-submit

Executor

Driver

Executor

NodeManager

Executor

Executor

Executor

Page 75: Yarn by default (Spark on YARN)

Demo

Page 76: Yarn by default (Spark on YARN)

The Big Data problemStatic allocation

spark.executor.instances 10# it will allocate this fixed number of executors

Page 77: Yarn by default (Spark on YARN)

The Big Data problemStatic allocation

Page 78: Yarn by default (Spark on YARN)

The Big Data problemDynamic allocation

spark.shuffle.service.enabled true# also need to install an auxiliary services to the nodemanagers

spark.dynamicAllocation.enabled truespark.dynamicAllocation.initialExecutors 0spark.dynamicAllocation.minExecutors 1spark.dynamicAllocation.maxExecutors 10

# will allocate and release executors as needed

Page 79: Yarn by default (Spark on YARN)

The Big Data problemDynamic allocation

Page 80: Yarn by default (Spark on YARN)

Demo

Page 81: Yarn by default (Spark on YARN)
Page 82: Yarn by default (Spark on YARN)

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemTrovit YARN cluster

Processing

Page 83: Yarn by default (Spark on YARN)

Trovit

Business Intelligence

Search engine

Mailing Push Notifications

Online Media Buying

Page 84: Yarn by default (Spark on YARN)

Questions?

Page 85: Yarn by default (Spark on YARN)

Thank YouFerran Galí i Reniu

@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0