Scaling Spark Workloads on YARN - Boulder/Denver July 2015

July 2015

Scaling Spark Workloads on YARN

Boulder/Denver Big Data Shane Kumpf & Mac Moore Solu2ons Engineers, Hortonworks July 2015

Agenda

§  Introduction – Why we love Spark, Spark Strategy, What’s Next

§ YARN: The Data Operating System § Spark: Processing Internals Review § Spark: on YARN § Demo: Scaling Spark on YARN in the cloud § Q & A

Made for Data Science"All apps need to get predictive at scale and fine granularity Democratizes Machine Learning"Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop

Elegant Developer APIs"DataFrames, Machine Learning, and SQL

Realize Value of Data Operating System"A key tool in the Hadoop toolbox

Community"Broad developer, customer and partner interest

Why We Love Spark at Hortonworks

Storage

YARN: Data Operating System

Governance Security

Operations

Resource Management

Hadoop/YARN Powered data operating system"100% open source, multi-tenant data platform for any application, any dataset, anywhere." Built on a centralized architecture of shared enterprise services •  Scalable Tiered Storage •  Resource and workload management •  Trusted data governance and metadata management •  Consistent operations •  Comprehensive security •  Developer APIs and tools

Data Operating System: Open Enterprise Hadoop

Themes for Spark Strategy

Spark is made for Data Science •  Lead in the community for ML optimization • Data Science theme of Spark Summit / Hadoop Summit Provide Notebooks for data exploration & visualization •  iPython Ambari Stack • Zeppelin – we’re very excited about this project Process more Hadoop data efficiently in Spark • Hive/ORC data delivered, HBase work in progress Innovate at the core • Security, Spark on YARN improvements and more

Current State of Security in Spark

Only Spark on YARN supports Kerberos today •  Leverage Kerberos for authentication

Spark reads data from HDFS & ORC •  HDFS file permissions (& Ranger integration) applicable to Spark jobs

Spark submits job to YARN queue •  YARN queue ACL (& Ranger integration) applicable to Spark jobs

Wire Encryption •  Spark has some coverage, not all channels are covered

LDAP Authentication •  No Authentication in Spark UI OOB, supports filter for hooking in LDAP

What about ORC support?

ORC – Optimized Row Columnar format ORC is an Apache TLP providing columnar storage for Hadoop

Spark ORC Support •  ORC support in HDP/Spark since 1.2.x – (Alpha) •  ORC support merged into Apache Spark in 1.4

•  Joint blog with Databricks @ hortonworks.com •  Changes between ORC 1.3.1 and Spark 1.4.1

•  ORC now uses standard API to read/write.

orc.apache.org

Introducing Apache Zeppelin…

Apache Zeppelin

Features

•  A web-based notebook for interactive analytics

•  Ad-hoc experimentation with Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc

•  Deeply integrated with Spark and Hadoop •  Can be managed via Ambari Stacks

•  Supports multiple language backends •  Pluggable “Interpreters”

•  Incubating at Apache •  100% open source and open community

Use Cases

•  Data exploration & discovery

•  Visualization - tables, graphs, charts

•  Interactive snippet-at-a-time experience

•  Collaboration and publishing

•  “Modern Data Science Studio”

Where can I find more?

• Arun Murthy’s Keynote at Hadoop Summit & SparkSummit – Hadoop Summit (http://bit.ly/1IC1BEG) – Spark Summit (http://bit.ly/1M7qw47)

• DataScience with Spark & Zeppelin Session at Hadoop Summit – http://bit.ly/1DdKeTs

• DataScience with Spark + Zeppelin Blog – http://bit.ly/1HFd545

• ORC Support in Spark Blog – http://bit.ly/1OkA1uU

YARN: The Data Operating System

YARN Introduction

The Architectural Center •  YARN moved Hadoop “beyond batch”; run batch, interactive,

and real time applications simultaneously on shared hardware. •  Intelligently places workloads on cluster members based on

resource requirements, labels, and data locality. •  Runs user code in containers, providing isolation and lifecycle

management.

Hortonworks Data PlaBorm 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° °

HDFS (Hadoop Distributed File System)

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

Apache Falcon

Apache Sqoop

Apache Flume

Apache Kafka

SECURITY

Apache Ranger

Apache Knox

Apache Falcon

OPERATIONS

Apache Ambari

Apache Zookeeper

Apache Oozie

YARN Architecture - Overview

Resource Manager •  Global resource scheduler

Node Manager •  Per-machine agent •  Manages the life-cycle of container & resource

monitoring Container

•  Basic unit of allocation •  Fine-grained resource allocation across multiple

resource types (memory, cpu, future: disk, network, gpu, etc.)

Application Master •  Per-application master that manages application

scheduling and task execution •  E.g. MapReduce Application Master

YARN Concepts

• Application – Application is a job or a long running service submitted to YARN – Examples:

–  Job: Map Reduce Job

–  Service: HBase Cluster

• Container – Basic unit of allocation

–  Map Reduce map or reduce task

–  HBase HMaster or Region Server

– Fine-grained resource allocations –  container_0 = 2GB, 1CPU

–  container_1 = 1GB, 6 CPU

– Replaces the fixed map/reduce slots from Hadoop 1

YARN Resource Request

Resource Model •  Ask for a specific amount of resources (memory,

CPU, etc.) on a specific machine or rack •  Capabilities define how much memory and CPU is

requested. •  Relax Locality = false to force containers onto

subsets of machines aka YARN node labels.

ResourceRequest

priority

resourceName

capability

numContainers

relaxLocality

YARN Capacity Scheduler

•  Elasticity •  Queues to subdivide resources •  Job submission Access Control Lists

Capacity Sharing

•  Max capacity per queue •  User limits within queue •  Preemption

Capacity Enforcement

•  Ambari Capacity Scheduler View AdministraWon

Hierarchical Queues

Adhoc 10%

DW 70%

Mrk2ng 20%

Dev 10%

Reserved 20%

Prod 70%

Prod 80%

Dev 20%

P0 70%

P1 30%

Parent

YARN capacity scheduler helps manage resources across the cluster

NodeManager NodeManager NodeManager NodeManager

Container 1.1

Container 2.4

NodeManager NodeManager NodeManager NodeManager

Container 1.2

Container 1.3

Container 2.2

Container 2.1

Container 2.3

YARN Application Submission - Walkthrough

Client2

ResourceManager

Scheduler

Spark: Processing Internals Review

First, a bit of review - What is Spark?

• Distributed runtime engine for fast large scale data processing.

• Designed for iterative computations and interactive data mining.

• Provides a API framework to support In-Memory Cluster Computing.

• Multi-language support – Scala, Java, Python

So what makes Spark fast? Data access methods are not equal!

MapReduce vs Spark

• MapReduce – On disk

• Spark – In memory

RDD – The main programming abstraction

Resilient Distributed Datasets •  Collections of objects spread

across a cluster, cached or stored in RAM or on Disk

•  Built through parallel transformations

•  Automatically rebuilt on failure •  Immutable, each transformation

creates a new RDD

Operations •  Lazy Transformations"

(e.g. map, filter, groupBy) •  Actions"

(e.g. count, collect, save)

RDD In Action

RDDRDDRDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

linesWithSpark.count()!74!!linesWithSpark.first()!# Apache Spark!

textFile = sc.textFile(”SomeFile.txt”) !

RDD Graph

map map reduceByKey collect textFile

.flatMap(line=>line.split(" "))

.reduceByKey(_ + _, 3)

.collect()

RDD[String]

RDD[List[String]]

RDD[(String, Int)]

Array[(String, Int)]

RDD[(String, Int)] .map(word=>(word, 1)))

DAG Scheduler

map map reduceByKey collect textFile

Stage 2 Stage 1

map reduceByKey collect textFile

Goals •  Split graph into stages

based on the types of transformations

•  Pipe-line narrow transformations (transformations without data movement) into a single stage

DAG Scheduler - Double Click

Stage 2 Stage 1

map reduceByKey collect textFile

Stage 2 Stage 1

Stage 1 1.  Read HDFS split 2.  Apply both maps 3.  Write shuffle data

Stage 2 1.  Read shuffle data 2.  Final reduce 3.  Send result to driver

Tasks – How work gets done

Execute task

Fetch input

Write output

The fundamental unit of work in Spark 1.  Fetch input based on the InputFormat or a shuffle. 2.  Execute the task. 3.  Materialize task output via shuffle, write, or a result to

the driver.

Input Formats control task input

• Hadoop InputFormats control how data on HDFS is read into each task. – Controls Splits – how data is split up – each task (by default) gets one split, which is typically

a single HDFS block – Controls the concept of a Record – is a record a whole line? A single word? An XML

element? • Spark can use both the old and new API InputFormats for creating RDD.

– newAPIHadoopRDD and hadoopRDD – Save time, use Hadoop InputFormats versus writing a custom RDD

Executor – The Spark Worker

Isolation for tasks 1.  Each application gets it’s own executors. 2.  Executors run tasks in threads and cache data. 3.  Run in separate processes for isolation. 4.  Lives for the duration of the application.

Executor – The Spark Worker

Execute task Fetch input

Write output

Write output Execute task

Fetch input

Write output Execute task

Fetch input

Write output

Core 1

Core 2

Core 3

task task

task task task

EXECUTOR!

The gangs all here

Application Master

Spark Driver

Executor

Worker Node

RDD Partition

Executor

Worker Node

RDD Partition

Executor

Worker Node

RDD Partition

Executor

Worker Node

RDD Partition

Spark: on YARN

Spark on YARN

Modus Operandi •  1 executor = 1 yarn container •  2 modes: yarn-client or yarn-cluster •  yarn-client = driver on the client side – good for the REPL •  yarn-cluster = driver inside the YARN application master

(below) – good for batch and automated jobs

YARN RM

App Master

Monitoring UI

Why Spark on YARN

Core Features •  Run other workloads along with Spark •  Leverage Spark Dynamic Resource Allocation •  Currently the only way to run in a kerberized environment •  Ability to provide capacity guarantees via Capacity Scheduler

Hortonworks Data PlaBorm 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° ° °

HDFS (Hadoop Distributed File System)

GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

Apache Falcon

Apache Sqoop

Apache Flume

Apache Kafka

SECURITY

Apache Ranger

Apache Knox

Apache Falcon

OPERATIONS

Apache Ambari

Apache Zookeeper

Apache Oozie

Executor Allocations on YARN

Static Allocation •  Static number of executors started on the cluster. •  Executors live for the duration of the application,

even when idle. Dynamic Allocation •  Minimal number of executors started initially. •  Executors added exponentially based on pending

tasks. •  After an idle period, executors are stopped and

resources are returned to the resource pool.

Static Allocation Details

Static Allocation •  Traditional means of starting executors on nodes.

spark-shell --master yarn-client \ --driver-memory 3686m \ --executor-memory 17g \ --executor-cores 7 \ --num-executors 7

•  Static number of executors specified by the submitter. •  Size and count of executors is key for good

performance.

Dynamic Allocation Details

Dynamic Allocation •  Scale executor count based on pending tasks

spark-shell --master yarn-client \ --driver-memory 3686m \ --executor-memory 3686m \ --executor-cores 1 \ --conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.dynamicAllocation.minExecutors=1" \ --conf "spark.dynamicAllocation.maxExecutors=100" \ --conf "spark.shuffle.service.enabled=true"

•  Minimum and maximum number of executors specified.

•  Exclusive to running Spark on YARN

Enabling Dynamic Allocation

spark_shuffle YARN aux service Dynamic allocation is not enabled OOTB

--conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.shuffle.service.enabled=true"

1.  Copy the spark-shuffle jar onto the NodeManager classpath.

2.  Configure the YARN aux service for spark_shuffle

Add: spark_shuffle to yarn.nodemanager.aux-services Add: yarn.nodemanager.aux-service.spark_shuffle.class =

Org.apache.spark.network.yarn.YarnShuffleService

3.  Restart the NodeManagers to pick up the spark-shuffle jar.

4.  Run the spark job with the dynamic allocation configs.

Dynamic Allocation Configuration Options

spark.dynamicAllocation.minExecutors Minimum number of executors, also the initial number to be spawned at

job submission. (can override initial count with initialExecutors) --conf "spark.dynamicAllocation.minExecutors=1”

spark.dynamicAllocation.maxExecutors Maximum number of executors, executors will be added

based on pending tasks up to this maximum. --conf "spark.dynamicAllocation.maxExecutors=100”

Dynamic Allocation Configuration Options

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout After the initial round of executors are scheduled, how long until the next

round of scheduling? Default: 5 seconds.

--conf "spark.dynamicAllocation.schedulerBacklogTimeout=10”

spark.dynamicAllocation.schedulerBacklogTimeout Initial Delay to wait before allocating additional executors.

Default: 5 seconds

--conf "spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=10”

Executors Started over Time

Dynamic Allocation – Good citizenship in a shared environment

spark.dynamicAllocation.executorIdleTimeout Amount of idle time in seconds before a executor container is

killed and resource returned to YARN. Default: 10 minutes --conf "spark.dynamicAllocation.executorIdleTimeout=60”

spark.dynamicAllocation.cachedExecutorIdleTimeout Because caching RDDs is key to performance, this setting has been

introduced to keep executors with cached data around longer.

--conf "spark.dynamicAllocation.cachedExecutorIdleTimeout=1800”

Sizing your Spark job

Difficult Landscape •  Conflicting recommendations often found online. •  Need knowledge of the data set, task distribution,

cluster topology, RDD cache churn, hardware profile….

1 executor per core?

It Depends

1 executor per node?

3-5 executors if I/O bound?

yarn.nodemanager.resource.memory-mb?

18gb max heap?

Commons Suggestions to improve performance

Do these things 1.  Cache RDDs in memory* 2.  Don’t spill to disk if possible 3.  Use a better serializer 4.  Consider compression 5.  Limit GC activity 6.  Get parallelism right*

1.  … or scale elastically

* New considerations with Spark on YARN

Sizing Spark Executors on YARN

Relationship 1.  Setting the executor memory size is setting the JVM heap, NOT the container. 2.  Executor memory + the greater of (10% or 384mb) = container size. 3.  To avoid wasted resources, ensure Executor memory + memoryOverhead <

yarn.scheduler.minimum-allocation-mb

Sizing Spark Executors on YARN

Relevant YARN Container Settings •  yarn.nodemanager.resource.cpu-vcores

–  Number of vcores availble for YARN containers per nodemanager •  yarn.nodemanager.resource.memory-mb

–  Total memory available for YARN containers per nodemanager •  yarn.scheduler.minimum-allocation-mb

–  Minimum resource request allowed per allocation in megabytes. –  Smallest container available for an executor

•  yarn.scheduler.maximum-allocation-mb –  Maximum resource request allowed per allocation in megabytes. –  Largest container available for an executor –  Typically equal to yarn.nodemanager.resource.memory-mb

Tuning Advice

How do we get it right? •  Test, gather, and test some more •  Define a SLA! •  Tune the job, not the cluster •  Tune the job to meet SLA! •  Don’t tune prematurely, it’s the root of all evil

Starting Points •  Keep your heap reasonable, but large enough to

handle your dataset. –  Recall that we only get about 60% of the heap for

RDD caching. –  Measure GC and ensure the percent of time spent

here is low. •  For jobs that heavily depend on cached RDDs,

limit executors per machine to one where possible –  See the first point, if RDD cache churn or GC are a

problem, make smaller executors and run multiple per machine.

Starting Points •  High memory hardware, multiple executors per

machine. –  Keep the heap reasonable

•  For CPU bound tasks with limited data needs, more executors can be better

–  Run with 2-4GB executors with a single vcore and measure performance.

•  Tune task parallelism –  As a rule of thumb, increase the task count by 1.5x

each round of testing and measure the results.

Avoid spilling or caching to disk

Caching strategies •  Use the default .cache() or .persist() which stores data as deserialized java

objects (MEMORY_ONLY). –  Trade off: Lower CPU usage versus size of data in memory.

•  Don’t use disk persistence. –  It’s typically faster to recompute the partition and there is a good chance many of the

blocks are still in the Operating System page cache. •  If the default strategy results in the data not fitting in memory, use

MEMORY_ONLY_SER, which stores the data as serialized objects. –  Trade off: Higher CPU usage but data set is typically around 50% smaller in memory. –  Can result in significant impacts to the job run time for larger data sets, use with caution.

import org.apache.spark.storage.StorageLevel._ theRdd.persist(MEMORY_ONLY_SER)

Data Access with Spark on YARN

Gotchas •  Don’t cache base RDDs, poor distribution.

–  Do cache intermediate data sets, good distribution across dynamically allocated executors.

•  Ensure executors remain running until you are done with the cached data. –  Cached data goes away when the executors do, costly to recompute.

•  Data locality is getting better, but isn’t great. –  SPARK-1767 introduced locality waits for cached data.

•  computePreferredLocations is pretty broken. –  Only use if necessary, gets overwritten in some scenarios, better

approaches in the works.

val locData = InputFormatInfo.computePreferredLocations(Seq( new InputFormatInfo(conf, classOf[TextInputFormat], new Path("myfile.txt"))) val sc = new SparkContext(conf, locData)

Future Improvements for Spark on YARN

RDD Sharing –  Short term: Keep around executors with RDD cache longer –  HDFS Memory Tier for RDD caching –  Experimental Off-heap caching in Tachyon (lower overhead than persist()) –  Cache rebalancing

Data Locality for Dynamic Allocation –  No more preferredLocations, discover locality from RDD lineage.

Container/Executor Sizing –  Make it easier… automatically determine the appropriate size. –  Long term: specify task size only and memory, cores, and overhead are determined

automatically. Secure All The Things!

–  SASL for shuffle data –  SSL for the HTTP endpoints –  Encrypted Shuffle – SPARK-5682

DEMO: Scaling Spark workloads on YARN

Scaling compute independent of storage

HDP 2.3 Hadoop Cluster

Storage Nodes

Storage Node

NodeMgr

Storage Node

NodeMgr

Storage Node

NodeMgr

Edge Node

Clients

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Nodes

Mgmt & Master Nodes

Ambari Node

Ambari

Master Node

Masters

Master Node

Masters

Master Node

Masters

Overview 1.  Pattern that is gaining

popularity in the cloud. 2.  Save costs and leverage the

elasticity of the cloud. 3.  Scale NodeManagers

(compute only) independent of traditional Nodemanager/Datanode (compute + storage) workers.

How it works?

Overview 1.  Leverage Spark Dynamic

Allocation on YARN to scale number of executors based on pending work.

2.  If additional capacity is still needed, provision additional compute nodes, add them to the cluster, and continue to scale executors onto the new nodes.

HDP 2.3 Hadoop Cluster

Storage Nodes

Storage Node

NodeMgr

Storage Node

NodeMgr

Storage Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Node

NodeMgr

Compute Nodes

Edge Node

Clients

Mgmt & Master Nodes

Ambari Node

Ambari

Master Node

Masters

Master Node

Masters

Master Node

Masters

HDP/Spark ClusterCloudbreak

Ambari

Orchestration(REST API) Metrics

Spark Client

Compute Nodes

Container

Executor

Container

Executor

Container

Executor

Container

Executor

Process Overview

+Container

Executor

Container

Executor

More Compute!

Container

Executor

Container

Executor

Container

Executor

Container

Executor

1 Deploy Cluster

2 Set Alerts

3 Submit Job4 Executors Increase

5 Capacity reached, Alerts trigger

6 Scaling Policy adds compute nodes

DEMO – Leveraging Dynamic Allocation

Scenarios

Promising Use Cases 1.  CPU bound workloads 2.  Burst-y usage 3.  Zeppelin/ad-hoc data exploration 4.  Multi-tenant, multi-use, centralized cluster 5.  Dev/QA clusters

Cloudbreak

•  Developed by SequenceIQ •  Open source with options to extend

with custom UI •  Launches Ambari and deploys

selected distribution via Blueprints in Docker containers

•  Customer registers, delegates access to cloud credentials, and runs Hadoop on own cloud account (Azure, AWS, etc.)

•  Elastic – Spin up any number of nodes, up/down scale on the fly

“Cloud agnostic Hadoop As-A-Service API”

BI / AnalyWcs (Hive)

IoT Apps (Storm, HBase, Hive)

Launch HDP on Any Cloud for Any Application

Dev / Test (all HDP services)

Data Science (Spark)

Cloudbreak 1.  Pick a Blueprint 2.  Choose a Cloud 3.  Launch HDP!

Example Ambari Blueprints: IoT Apps, BI / Analy2cs, Data Science,

Dev / Test

Step 1: Sign up for a free Cloudbreak account

URL to sign up for a free account:"https://accounts.sequenceiq.com/ ""General Cloudbreak documentation:"http://sequenceiq.com/cloudbreak/#cloudbreak

• Varies by cloud, but typically only a couple of steps.

Step 2: Create or add credentials

Step 3: Note the blueprint for your use case

• An Ambari blueprint describes components of the HDP stack to include in the cloud deployment

• Cloudbreak comes with some default blueprints, such as a Spark cluster or a streaming architecture

• Pick the appropriate blueprint, or create your own!

Step 4: Create Cluster

• Ensure your credential is selected by clicking on “select a credential”

• Click Create cluster, give it a name, choose a region, choose a network

• Choose desired blueprint

• Set the instance type and number of nodes.

• Click create and start cluster

Step 5: Wait for cluster install to complete

• Depending on instance types and blueprint chosen, cluster install should complete in 10-35 mins

• Once cluster install is complete, click on the Ambari server address link (highlighted on screenshot) and login to Ambari with admin/admin

• Your HDP cluster is ready to use

Periscope: Auto up and down scaling

• Define alerts for the number of pending YARN containers.

• Define scaling policies for how Periscope should react to the defined alerts.

• Define the min/max cluster size and “cooldown” period (how long to wait between scaling events).

• The number of compute nodes will automatically scale when out of capacity for containers.

Benefits

Why do I care? •  Less contention between jobs

–  Less waiting for your neighbors job to finish, elastic scale gives us all compute time.

•  Improved job run times. –  Testing has shown a 30%+ decrease in job run times for moderate

duration CPU bound jobs. •  Decreased costs over persistent IaaS clusters

–  Spin down resources not in use. –  If time = money, improve job run times will decrease costs.

•  Capacity planning hack! –  Scaling up a lot? You should probably add more capacity… –  Never scaling up? You probably overbuilt…

DEMO – Auto Scaling IaaS

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

spark jobs spark

spark strategy spark

apache spark

spark hiveorc data

authentication spark

hadoop data

machine learning spark

spark ui oob

Software

Horizontal Seams. Yarn for Seaming Use yarn used for item...

YARN Yarn A Yarn B NEEDLES - From Britain with Love ·...

Fiber 2 Yarn Conversion System - Yarn Characteristics

FASH 15 textiles yarn...

Partially Oriented Yarn, Supplier of Polyester Yarn

SUSTAINABLE VALUE · Compact yarn, Rayon/ Cotton Yarn, High...

Yarn manufacturing (yarn faults)

A Nurses’ & Midwives’ Guide to Reasonable Workloads...

Apache Spark on Hadoop YARN & Kubernetes for Scalable...

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with....

Soil resistant yarn finish for synthetic organic polymer...

Project, Boulder, - panesport.weebly.com · Le totonniadi,....

Classification of yarn yarn classification. Textile yarn.....

The Yarn Guy - Knitting Machines, Sewing Machines and Yarn

Resource Management with YARN: YARN Past, Present and ...

Yarn Count & Yarn Modification