Yarn by default (Spark on YARN)

YARN by default(Spark on YARN)

Ferran Galí i Reniu@ferrangali

About me

@ferrangali

100 MB/s

2 TB = 3.5 hours

The Big Data problem

100 MB/s

2 TB = 30 min

The Big Data problem

Source: Google data centers

Node Node Node Node Node Node Node NodeHardware

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

Hardware

Storage

$> hadoop fs -ls

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$> hadoop fs -ls dir

$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$> hadoop fs -cat dir/file3.txt

$> hadoop fs -cat dir/file3.txtline1line2line3line4line5

MapReduce

Hardware

Storage

MapReduce

Hardware

Storage

Processing

Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Hardware

Storage

Processing Job

Application

Reduce

MapReduce Job

map(){ // Your code here}

reduce(){ // Your code here}

Hardware

Storage

Processing Job

Application

Hardware

Storage

Processing Job Job

Application

Hardware

Storage

Processing Job Job

Application

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

Map Map Map Map

Map Map Map Reduce

Reduce Reduce Reduce Reduce

NodeJobTracker

NodeTaskTracker

Application

NodeJobTracker

NodeTaskTracker

Application

NodeJobTracker

NodeTaskTracker

Application

Reduce

Limitations

Hardware

Storage

Resource Manager

The Big Data problemYARN - Yet Another Resource Negotiator

Processing

Memory

NodeManagerResourceManager

YARN Architecture

NodeManager

YARN Architecture

NodeManager

Applicationx61 core1024MB

YARN Architecture

ApplicationMaster

NodeManager

YARN Architecture

ApplicationMaster

Container

NodeManager

Container

YARN Architecture

ApplicationMaster

NodeManager

YARN Architecture

ApplicationMaster

Reduce

NodeManager

Reduce

YARN Architecture

ApplicationMaster

Container

NodeManager

Container

Application 2x42 cores2048MB

YARN Architecture

ApplicationMaster

Container

NodeManager

Container

ApplicationMaster

Container

NodeManager

Container

ResourceManager

YARN Architecture

Container

ApplicationMaster

Container

NodeManager

Container

ApplicationMaster

Container

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

Hardware

Storage

Resource Manager

Processing

Application

In Memory / Streaming

Hardware

Storage

Resource Manager

Processing ...

Application

Improved Data Pipelines

Reduce

Improved Data Pipelines

Reduce

The Big Data problemSpark Job

def main(args: Array[ String]): Unit = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf)

sc.rdd(...).action() sc.stop()}

Stage Stage Stage

Task Task

Stage Stage

Task Task Task

Task Task

The Big Data problemSpark Architecture

Executor

Driver

Executor Executor Executor

coordinates stage tasks across executors

Executor

Driver

Executor

Driver

Task Task

Hardware

Storage

Resource Manager

The Big Data problemSpark on YARN

Processing

The Big Data problemConfiguration

export HADOOP_CONF_DIR="/opt/hadoop/hadoop_install/conf"core-site.xmlhdfs-site.xmlyarn-site.xml

The Big Data problemConfiguration

spark.yarn.jar hdfs:///user/hadoop/libs/spark-assembly.jar# need to upload the file to HDFS

spark.eventLog.enabled truespark.eventLog.dir hdfs:///app-logs/spark/logsspark.yarn.historyServer.address historyserver_host:18080

# needed if you want to analyse finished jobs

The Big Data problemExecutor configuration

spark.executor.cores 4# the number of tasks that will execute each executor

spark.executor.memory 4g# shared memory across all tasks

spark.yarn.executor.memoryOverhead 614m# 15-20% of total executor memory

Executor

The Big Data problemDeployment

$ ./spark-submit--class my.main.class--master {deploy-mode}my-jar.jararg1 arg2 arg3 ...

yarn-client

yarn-cluster

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

NodeManager

ResourceManager

NodeManager

Driver

NodeManager

ResourceManager

NodeManager

Container

Driver

NodeManager

Container

ResourceManager

NodeManager

Executor

Driver

NodeManager

Executor

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager

ResourceManager

NodeManager

Driver

NodeManager

ResourceManagerClient

$ ./spark-submit

Driver

NodeManager

ResourceManager

NodeManager

Container

Driver

Container

NodeManager

Container

ResourceManager

NodeManager

Executor

Driver

Executor

NodeManager

Executor

The Big Data problemStatic allocation

spark.executor.instances 10# it will allocate this fixed number of executors

The Big Data problemStatic allocation

The Big Data problemDynamic allocation

spark.shuffle.service.enabled true# also need to install an auxiliary services to the nodemanagers

spark.dynamicAllocation.enabled truespark.dynamicAllocation.initialExecutors 0spark.dynamicAllocation.minExecutors 1spark.dynamicAllocation.maxExecutors 10

# will allocate and release executors as needed

The Big Data problemDynamic allocation

Hardware

Storage

Resource Manager

The Big Data problemTrovit YARN cluster

Processing

Trovit

Business Intelligence

Search engine

Mailing Push Notifications

Online Media Buying

Questions?

Thank YouFerran Galí i Reniu

@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0

Yarn by default (Spark on YARN)

Technology

SPARK Hive 9 & 10, June 2018 - annauniv.edu on...

Hadoop 2.x Core: YARN, Tez, and Spark -...

Netflix - Productionizing Spark On Yarn For ETL At Petabyte....

Producing Spark on YARN for ETL

7–SPARK technology · • By default: −(only) 1GB/Spark...

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 -...

Productionizing a 24/7 Spark Streaming Service on YARN

CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn.....

Spark on YARN: The Road Ahead

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark & Yarn better together 1.2

Introduction to Presto on 2020 Docker at scale - trino.io...

Apache’Spark&’Apache’Zeppelin ......Use Spark ST,...

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng....

1 V - static.ucloud.cn · kjwc.jari...