Top Banner
Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1
144

Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

May 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Ghislain Fourny

Big Data for Engineers Spring 202010. Distributed Computations II: Spark

1

Page 2: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

YARN

kirtchanut / 123RF Stock Photo

2

Page 3: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

YARN

Scheduling

Applicationmanagement

Monitoring

3

Resource Manager Application MasterApplication MasterApplication MasterApplication MasterApplication Master

Page 4: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

YARN

ResourceManager

NodeManager NodeManager NodeManager NodeManager NodeManager

Container

ContainerContainer

4

Page 5: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Forward compatibility with DAGs of transformations

5

Page 6: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Introduction to Spark

Nikki Zalewski / 123RF Stock Photo

6

Page 7: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

MapReduce high-level data

MapShuffleMap

7

Page 8: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

MapReduce high-level data

This is a very

specific topology!

8

Page 9: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

... and it's under-using YARN

9

Page 10: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

... because YARN supports any DAG

10

Page 11: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

YARN

we can build somethingmoregeneral

11

Page 12: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

MapReduce (even more) high-level: two-step...

Map

Reduce

12

Page 13: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

... to any DAGs

13

Page 14: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Full-DAG query processing

entersSpark

14

Page 15: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

ResilientDistributedDataset

Spark's first-class citizen

RDD

15

Page 16: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

it's just a

Bigcollection

Spark's first-class citizen

RDD

16

Page 17: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark's first-class citizen

RDD

... and it is par ti tion ed17

Page 18: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Creation

RDD lifecycle

RDD

LocalFilesystem HDFS S3 On the

fly

18

Page 19: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformation

RDD lifecycle RDD

RDD19

Page 20: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

ActionRDD lifecycle RDD

LocalFilesystem HDFS S3 On

screen

20

Page 21: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Lineage graphRDD

RDD

RDD

RDDRDD

RDD

RDD

RDD

21

Page 22: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Lazy EvaluationRDD

RDD

RDD

RDDRDD

RDD

RDD

RDD

Each actiontriggers anevaluation

22

Page 23: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Lazy EvaluationRDD

RDD

RDD

RDDRDD

RDD

RDD

RDD

23

Page 24: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Execution

Applicationor

Shell

24

Page 25: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

25

Page 26: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

ValueHello, World!Hello, there!

26

Page 27: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

val rdd2 = rdd1.flatMap(value => value.split(" ")

)

27

ValueHello, World!Hello, there!

Page 28: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

val rdd2 = rdd1.flatMap(value => value.split(" ")

)ValueHello,World!Hello,there!

28

Page 29: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

val rdd2 = rdd1.flatMap(value => value.split(" ")

)

rdd2.countByValue()

29

ValueHello,World!Hello,there!

Page 30: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark: Hello, World!

val rdd1 = sc.parallelize(List("Hello, World!", "Hello, there!")

)

val rdd2 = rdd1.flatMap(value => value.split(" ")

)

rdd2.countByValue() Key ValueHello, 2there! 1World! 1

30

Page 31: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Overview of transformations

31

Page 32: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations

32

Page 33: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: filter

function:

33

Page 34: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: map

function:

34

Page 35: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: flatMap

function:

35

Page 36: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: distinct

36

Page 37: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: sample

fraction+ seed

37

Page 38: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations on two RDDs

38

Page 39: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: union

39

Page 40: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: intersection

40

Page 41: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: subtract

41

Page 42: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: cartesian product

42

Page 43: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Overview of actions

43

Page 44: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: collect

44

Page 45: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: count

22 45

Page 46: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: count by value

12 4 646

Page 47: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: take

4

47

Page 48: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: top

4

48

Page 49: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: takeSample

4

49

Page 50: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: reduce

+

50

Page 51: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Pair RDDs

51

Page 52: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: keys

52

Page 53: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: values

53

Page 54: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: reduce by key

+

54

Page 55: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: group by key

55

Page 56: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: sort by key

<=

56

Page 57: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: map values

function:

57

Page 58: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: join

58

Page 59: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformations: subtract by key

59

Page 60: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: count by key

6 1 460

Page 61: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Actions: lookup

61

Page 62: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Physical layer

62

Page 63: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Parallel execution

63

Page 64: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Parallel execution

Task 1 Task 2 Task 3 Task 464

Page 65: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Parallel execution

Task 1 Task 2 Task 3 Task 465

Default: one task per HDFS block

Page 66: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformation

Parallel executionLogicallayer

Physicallayer

RDD 1

RDD 2

66Task 1 Task 2 Task 3 Task 4

Page 67: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spreading tasks over executors

67

Executor 1

Executor 2

Executor 3

Executor 4

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10 Task 11

Page 68: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spreading tasks over cores

68

Executor 1

Executor 2

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10 Task 11

Core 1

Core 2

Core 1

Core 2

Memory

Memory

Page 69: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Transformation

Sequence of (parallelizable) transformations

69

Transformation

Transformation

Page 70: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Physical layer (3 transformations)

70

Page 71: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Optimization

71

Page 72: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Stage

Optimization

72

Page 73: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Sequence of (parallelizable) transformationsLogicallayer

Stage

73

Transformation

Transformation

Transformation

Page 74: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Narrow Dependency

74

Page 75: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Narrow Dependency

75

Stays on same machine Same stage

Page 76: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spreading a stage over cores

76

Executor 1

Executor 2

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10 Task 11

Core 1

Core 2

Core 1

Core 2

Memory

Memory

Page 77: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Setting up executors

spark-submit --num-executors 42my-application.jar

Page 78: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Setting up cores

spark-submit --executor-cores 2my-application.jar

Page 79: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Setting up memory

spark-submit --executor-memory 5Gmy-application.jar

Page 80: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Most important parameters

spark-submit –-num-executors 42--executor-cores 2--executor-memory 3Gmy-application.jar

Page 81: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Wide Dependency

81

Page 82: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Wide Dependency

82

Needs to be sent overthe network

New stage

Page 83: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Job as sequence of stages

83

Stage 1

Stage 2

Stage 3

wait for completion

wait for completion

Shuffle

Shuffle

Page 84: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

General DAG with stages

Join

Simple shuffle

84

Stage 1 Stage 2

Stage 3

Stage 4

Page 85: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Terminology

85

Stage

Transformation

Task

Verticalgrouping

Horizontal splitting

Job

Sequence

Page 86: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Performance tuning

86

Page 87: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

87

Page 88: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

88

Page 89: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

89

Page 90: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

90

Page 91: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

91

Page 92: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

92

Page 93: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Inefficiency

93

Page 94: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Persisting RDDs

94

Page 95: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Persisting RDDs

95

Page 96: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Persisting RDDs

96

Page 97: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Persisting RDDs

97

Page 98: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Persisting RDDs

98

Page 99: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

General DAG with stages

Join

Simple shuffle

99

Page 100: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Avoiding a wide dependency

100

Page 101: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Avoiding a wide dependency

101

Pre-partitioning

Page 102: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Avoiding a wide dependency

102

Key-values with samekey already on same machine

Page 103: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Avoiding a wide dependency

103

No need to leave the machine(No shuffling needed!)

Page 104: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

General DAG with stages

Join

Simple shuffle

104

Page 105: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Pre-partitioning

Pre-partitioned

Optimized Join

Simple shuffle

105

Page 106: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Data Frames

106

Page 107: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

RDDs...

function:

107

Page 108: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark and Python (PySpark)

108

rdd = spark.sparkContext.textFile('hdfs:///dataset.txt')

rdd2 = rdd.filter(lambda l: "Spark" in l)

rdd3 = rdd2.map(lambda l: (count(l), l))

rdd4 = rdd3.reduceByKey(lambda l1, l2: l1+l2)

result = rdd4.take(10)

Page 109: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Spark and Python (PySpark): with JSON

109

rdd = spark.sparkContext.textFile('hdfs:///dataset.json')

rdd2 = rdd.filter(lambda l: parseJSON(l))

rdd3 = rdd2.filter(lambda l: l['key'] = 0)

rdd4 = rdd3.map(lambda l: (l['key'], l['otherfield']))

result = rdd4.countByKey()

Page 110: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

110

DataSet<Row>

Page 111: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

111

Page 112: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Columnar storage

112

Page 113: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Memory footprint

113

"Raw" Spark DataFrames

Page 114: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

114

df = spark.read.json('hdfs:///dataset.json')

df.createOrReplaceTempView("dataset")

df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")

result = df2.take(10)

Page 115: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

115

df = spark.read.json('hdfs:///dataset.json')

df.createOrReplaceTempView("dataset")

df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")

result = df2.take(10)

Spark SQL

Page 116: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Schema inference

116

foo,bar1,true2,true3,false4,true5,true6,false7,true

dataset.csv

Page 117: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Schema inference

117

foo,bar1,true2,true3,false4,true5,true6,false7,true

foointeger1234567

barbooleantruetruefalsetruetruefalsetrue

dataset.csv DataFrame

Page 118: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Schema inference

118

{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : 3, "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}

dataset.json

Page 119: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Schema inference

119

{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : 3, "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}

foointeger1234567

barbooleantruetruefalsetruetruefalsetrue

dataset.json DataFrame

Page 120: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames (with logical transformations)

120

df = spark.read.json('hdfs:///dataset.json')

df2 = df.filter(df['name'] = 'Einstein')

df3 = df.sortBy(asc("theory"), desc("date"))

df4 = df.select('year')

result = df4.take(10)

Page 121: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Available types

121

Byte ShortIntegerLong

FloatDouble

Decimal

String

Boolean

Binary

Timestamp

Date

Array

Struct

Map

Numbers Other atomics Structured

Page 122: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Type mapping

122

DataFrame JavaByteType byteShortType shortIntegerType intLongType longFloatType floatDoubleType doubleBooleanType booleanStringType StringDecimalType java.math.BigDecimalTimestampType java.sql.TimestampDateType java.sql.DateBinaryType byte[]

DataFrame JavaArrayType java.util.ListMapType java.util.MapStructType Row

Page 123: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Other data formats

123

df = spark.read.json("hdfs:///dataset.json")

df = spark.read.parquet("hdfs:///dataset.parquet")

df = spark.read.csv("hdfs:///dir/*.csv")

df = spark.read.text("hdfs:///dataset[0-7].txt")

df = spark.read.jdbc("jdbc:postgresql://localhost/test?user=fred&password=secret",

...)

df = spark.read.format("avro").load("hdfs:///dataset.avro")

...

Page 124: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Your own schema

124124

First Last Picture Birthday FlagString String byte[] Date boolean

DataSet<Person>

Statically known

Page 125: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

125

df = spark.read.json('hdfs:///dataset.json')

df.createOrReplaceTempView("dataset")

df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")

result = df2.take(10)

Page 126: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

126

df = spark.read.json('hdfs:///dataset.json')

df.createOrReplaceTempView("dataset")

df2 = df.sql("SELECT * FROM dataset ""WHERE guess = target ""ORDER BY target ASC, country DESC, date DESC")

result = df2.take(10)

Spark SQL

SQL Brush-Up!

Page 127: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames (with logical transformations)

127

df = spark.read.json('hdfs:///dataset.json')

df2 = df.filter(df['name'] = 'Einstein')

df3 = df.sortBy(asc("theory"), desc("date"))

df4 = df.select('year')

result = df4.take(10)

Page 128: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames

128

Query (DataFrames/SparkSQL)

Physical query (RDDs)

Logical plan

Physical plan

Page 129: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Optimizations with Catalyst

129

Source: DataBricks blog entry on Catalyst

Page 130: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Dealing with nestedness: arrays

SELECT Last, EXPLODE(Countries)FROM input

First (String) Last (String) Countries (Array of Strings)

Albert Einstein [ "D", "I", "CH", "A", "BE", "US" ]

Srinivasa Ramanujan [ "IN", "UK" ]

Kurt Gödel [ "CZ", "A", "US" ]

Leonhard Euler [ "CH", "RU" ] Last (String) Countries (String)

Einstein D

Einstein I

Einstein CH

Einstein A

Einstein BE

Einstein US

Ramanujan IN

Ramanujan UK

Gödel CZ

Gödel A

Gödel US

Euler CH

Euler RU

Page 131: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Dealing with nestedness: objects

SELECT Name.First, Name.LastFROM input

Name (Object) Countries (Int)

{ "First" : "Albert", "Last" : "Einstein" } 6

{ "First" : "Srinivasa", "Last" : "Ramanujan" } 2

{ "First" : "Kurt", "Last" : "Gödel" } 3

{ "First" : "John", "Last" : "Nash" } 1

{ "First" : "Alan", "Last" : "Turing" }, 1

{ "First" : "Leonhard", "Last" : "Euler" } 2

First (String) Last (String)Albert Einstein

Srinivasa Ramanujan

Kurt Gödel

John Nash

Alan Turing

Leonhard Euler

Page 132: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

Limits of DataFrames: Heterogeneity

{ "foo" : 1, "bar" : true}{ "foo" : 2, "bar" : true}{ "foo" : [3, 4], "bar" : false}{ "foo" : 4, "bar" : true}{ "foo" : 5, "bar" : true}{ "foo" : 6, "bar" : false}{ "foo" : 7, "bar" : true}

foostring"1""2""[3, 4]""4""5""6""7"

barbooleantruetruefalsetruetruefalsetrue

dataset.json DataFrame

Page 133: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

DataFrames and RDDs

133

rdd = df.rdddf = sc.createDataFrame(rdd, schema)

df = rdd.toDF()or

Page 134: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

10 Design Principles of Big Data

Page 135: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

1. Learn from the past

Page 136: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

2. Keep the design simple

Page 137: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

3. Modularize the architecture

Page 138: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

4. Homogeneity in the large

Page 139: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

5. Heterogeneity in the small

Page 140: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

6. Separate metadata from data

Page 141: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

7. Abstract logical model from its physical implementation

Page 142: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

8. Shard the data

Page 143: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

9. Replicate the data

Page 144: Ghislain Fourny Big Data for Engineers Spring 2020 · Ghislain Fourny Big Data for Engineers Spring 2020 10. Distributed Computations II: Spark 1

10. Buy lots of cheap hardware