Top Banner
1 © Cloudera, Inc. All rights reserved. Spark Operations Kostas Sakellis
33

Apache Spark Operations

Jan 17, 2017

Download

Software

Cloudera, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark Operations

1© Cloudera, Inc. All rights reserved.

Spark OperationsKostas Sakellis

Page 2: Apache Spark Operations

2© Cloudera, Inc. All rights reserved.

Me

• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager

Page 3: Apache Spark Operations

3© Cloudera, Inc. All rights reserved.

Building a proof of concept!

Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

Page 4: Apache Spark Operations

4© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 5: Apache Spark Operations

5© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 6: Apache Spark Operations

6© Cloudera, Inc. All rights reserved.

Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Page 7: Apache Spark Operations

7© Cloudera, Inc. All rights reserved.

Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 8: Apache Spark Operations

8© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Page 9: Apache Spark Operations

9© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

Page 10: Apache Spark Operations

10© Cloudera, Inc. All rights reserved.

RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Page 11: Apache Spark Operations

11© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDDs

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Page 12: Apache Spark Operations

12© Cloudera, Inc. All rights reserved.

…RDD …RDD

RDD Lineage

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

Lineage

Page 13: Apache Spark Operations

13© Cloudera, Inc. All rights reserved.

Task

…RDD …RDD

HDFS

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD

Partition 1

Partition 2

Partition 3

Partition 4

Collect

•A pipelined set of transformation on a single thread

Page 14: Apache Spark Operations

14© Cloudera, Inc. All rights reserved.

Spark Architecture

Page 15: Apache Spark Operations

15© Cloudera, Inc. All rights reserved.

Spark System Architecture

Page 16: Apache Spark Operations

16© Cloudera, Inc. All rights reserved.

Deployments

• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos

• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support

Page 17: Apache Spark Operations

17© Cloudera, Inc. All rights reserved.

Standalone

Master

WorkerClient

Worker

Process

AppMaster

Process

Page 18: Apache Spark Operations

18© Cloudera, Inc. All rights reserved.

Standalone

•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>

• Submit jobspark-submit --master <master-spark-URL>

Page 19: Apache Spark Operations

19© Cloudera, Inc. All rights reserved.

Container

YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 20: Apache Spark Operations

20© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 21: Apache Spark Operations

21© Cloudera, Inc. All rights reserved.

Container

Spark on YARN Architecture

Resource Manager

Node Manager

Client

Node Manager

Container

Process

AppMaster

Container

Process

Page 22: Apache Spark Operations

22© Cloudera, Inc. All rights reserved.

Spark on YARN

• Submit jobspark-submit --master yarn-client …

•Cluster modespark-submit --master yarn-cluster …

• Spark shell only works in client mode!

Page 23: Apache Spark Operations

23© Cloudera, Inc. All rights reserved.

Customers often have shared infrastructure

Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

Page 24: Apache Spark Operations

24© Cloudera, Inc. All rights reserved.

Multi-tenancy

•Cluster utilization is top metric•Target: 70-80% utilization

•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager

Page 25: Apache Spark Operations

25© Cloudera, Inc. All rights reserved.

Underutilized Clusters

Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

Page 26: Apache Spark Operations

26© Cloudera, Inc. All rights reserved.

Dynamic Allocation

• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed

• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell

Page 27: Apache Spark Operations

27© Cloudera, Inc. All rights reserved.

Dynamic Allocation Limitations

• Still required to specify cores•--num-cores

•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself

•Our customers still get it wrong!

Page 28: Apache Spark Operations

28© Cloudera, Inc. All rights reserved.

The Future of Dynamic Allocation

•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory

• Leads to better cluster utilization

Page 29: Apache Spark Operations

29© Cloudera, Inc. All rights reserved.

Security, now it’s getting serious.

Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

Page 30: Apache Spark Operations

30© Cloudera, Inc. All rights reserved.

Authentication

•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.

• Spark utilizes delegation tokens

Page 31: Apache Spark Operations

31© Cloudera, Inc. All rights reserved.

Encryption

•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)

SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

Page 32: Apache Spark Operations

32© Cloudera, Inc. All rights reserved.

Authorization

• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity

•Apache Sentry•HDFS-Sentry synchronization plugin

•Record Service•Column level security for Spark!

Page 33: Apache Spark Operations

33© Cloudera, Inc. All rights reserved.

Thank youWe’re Hiring!