Top Banner
Data Ingestion with Spark UII Yogyakarta 1
72

Intro to Big Data - Spark

Apr 12, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intro to Big Data - Spark

Data Ingestion with SparkUII

Yogyakarta

1

Page 2: Intro to Big Data - Spark

About Me

Sofian Hadiwijaya@[email protected]

Co Founder at Pinjam.co.idTech Advisor at Nodeflux.ioSoftware Innovator (IoT and AI) at Intel

2

Page 3: Intro to Big Data - Spark

3

Page 4: Intro to Big Data - Spark

4

Page 5: Intro to Big Data - Spark

5

Page 6: Intro to Big Data - Spark

Wikipedia big data

In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.

6Source: http://en.wikipedia.org/wiki/Big_data

Page 7: Intro to Big Data - Spark

How big is big?

• 2008: Google processes 20 PB a day• 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day• 2011: Yahoo! has 180-200 PB of data• 2012: Facebook ingests 500 TB/day

7

Page 8: Intro to Big Data - Spark

That’s a lot of data

8Credit: http://www.flickr.com/photos/19779889@N00/1367404058/

Page 9: Intro to Big Data - Spark

So what?

s/data/knowledge/g

9

Page 10: Intro to Big Data - Spark

No really, what do you do with it?

• User behavior analysis• AB test analysis• Ad targeting• Trending topics• User and topic modeling• Recommendations• And more...

10

Page 11: Intro to Big Data - Spark

How to scale data?

11

Page 12: Intro to Big Data - Spark

Divide and Conquer

12

Page 13: Intro to Big Data - Spark

Parallel processing is complicated

• How do we assign tasks to workers?• What if we have more tasks than slots?• What happens when tasks fail?• How do you handle distributed synchronization?

13Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/

Page 14: Intro to Big Data - Spark

Data storage is not trivial• Data volumes are massive• Reliably storing PBs of data is challenging• Disk/hardware/network failures• Probability of failure event increases with number of machines

For example:1000 hosts, each with 10 disksa disk lasts 3 yearhow many failures per day?

14

Page 15: Intro to Big Data - Spark

Hadoop cluster

15

Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)

Page 16: Intro to Big Data - Spark

Hadoop

16

Page 17: Intro to Big Data - Spark

Hadoop provides

• Redundant, fault-tolerant data storage• Parallel computation framework• Job coordination

17http://hapdoop.apache.org

Page 18: Intro to Big Data - Spark

Joy

18Credit: http://www.flickr.com/photos/spyndle/3480602438/

Page 19: Intro to Big Data - Spark

Hadoop origins

• Hadoop is an open-source implementation based on GFS and MapReduce from Google• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

19

Page 20: Intro to Big Data - Spark

Hadoop Stack

20

MapReduce(Distributed Programming Framework)

Pig(Data Flow)

Hive(SQL)

HDFS(Hadoop Distributed File System)

Cascading(Java)

HBa

se(C

olum

nar D

atab

ase)

Page 21: Intro to Big Data - Spark

HDFS

21

Page 22: Intro to Big Data - Spark

HDFS is...

• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System

22

Page 23: Intro to Big Data - Spark

HDFS - files and blocks

• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable)• The NameNode (NN) manages metadata about files and blocks• The SecondaryNameNode (SNN) holds a backup of the NN data• DataNodes (DN) store and serve blocks

23

Page 24: Intro to Big Data - Spark

Replication

• Multiple copies of a block are stored• Replication strategy:

• Copy #1 on another node on same rack• Copy #2 on another node on different rack

24

Page 25: Intro to Big Data - Spark

HDFS - writes

25

DataNode

Block

Slave node

NameNodee

Master

DataNode

Block

Slave node

DataNode

Block

Slave node

File

Client

Rack #1 Rack #2

Note: Write path for a single block shown. Client writes multiple blocks in parallel.

block

Page 26: Intro to Big Data - Spark

HDFS - reads

26

DataNode

Block

Slave node

NameNodee

Master

DataNode

Block

Slave node

DataNode

Block

Slave node

File

ClientClient reads multiple blocks in parallel and re-assembles into a file.

block 1 block 2block N

Page 27: Intro to Big Data - Spark

What about DataNode failures?

• DNs check in with the NN to report health• Upon failure NN orders DNs to replicate under-replicated blocks

27Credit: http://www.flickr.com/photos/18536761@N00/367661087/

Page 28: Intro to Big Data - Spark

MapReduce

28

Page 29: Intro to Big Data - Spark

MapReduce is...

• A programming model for expressing distributed computations at a massive scale• An execution framework for organizing and performing such computations• An open-source implementation called Hadoop

29

Page 30: Intro to Big Data - Spark

Typical large-data problem

• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output

30

Map

Reduce

(Dean and Ghemawat, OSDI 2004)

Page 31: Intro to Big Data - Spark

MapReduce Flow

31

Page 32: Intro to Big Data - Spark

MapReduce architecture

32

TaskTracker

Task

Slave node

JobTracker

Master

TaskTracker

Task

Slave node

TaskTracker

Task

Slave node

Job

Client

Page 33: Intro to Big Data - Spark

What about failed tasks?

• Tasks will fail• JT will retry failed tasks up to N attempts• After N failed attempts for a task, job fails• Some tasks are slower than other• Speculative execution is JT starting up multiple of the same task• First one to complete wins, other is killed

33Credit: http://www.flickr.com/photos/phobia/2308371224/

Page 34: Intro to Big Data - Spark

MapReduce - Java API• Mapper:void map(WritableComparable key,

Writable value,OutputCollector output,Reporter reporter)

• Reducer:void reduce(WritableComparable key,

Iterator values,OutputCollector output,Reporter reporter)

34

Page 35: Intro to Big Data - Spark

MapReduce - Java API• Writable

• Hadoop wrapper interface• Text, IntWritable, LongWritable, etc

• WritableComparable

• Writable classes implement WritableComparable• OutputCollector

• Class that collects keys and values• Reporter

• Reports progress, updates counters• InputFormat

• Reads data and provide InputSplits• Examples: TextInputFormat, KeyValueTextInputFormat

• OutputFormat

• Writes data• Examples: TextOutputFormat, SequenceFileOutputFormat

35

Page 36: Intro to Big Data - Spark

MapReduce - Counters are...

• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend

• Bad:System.out.println(“Couldn’t parse value”);

• Good:reporter.incrCounter(BadParseEnum, 1L);

36

Page 37: Intro to Big Data - Spark

MapReduce - word count mapperpublic static class Map extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

37

Page 38: Intro to Big Data - Spark

MapReduce - word count reducerpublic static class Reduce extends MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key,Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

int sum = 0;while (values.hasNext()) {sum += values.next().get();

}output.collect(key, new IntWritable(sum));

}}

38

Page 39: Intro to Big Data - Spark

MapReduce - word count mainpublic static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);}

39

Page 40: Intro to Big Data - Spark

MapReduce - running a job

• To run word count, add files to HDFS and do:

$ bin/hadoop jar wordcount.jar org.myorg.WordCount input_dir output_dir

40

Page 41: Intro to Big Data - Spark

MapReduce is good for...

• Embarrassingly parallel algorithms• Summing, grouping, filtering, joining• Off-line batch jobs on massive data sets• Analyzing an entire large dataset

41

Page 42: Intro to Big Data - Spark

MapReduce is ok for...

• Iterative jobs (i.e., graph algorithms)• Each iteration must read/write data to disk• IO and latency cost of an iteration is high

42

Page 43: Intro to Big Data - Spark

MapReduce is not good for...

• Jobs that need shared state/coordination• Tasks are shared-nothing• Shared-state requires scalable state store

• Low-latency jobs• Jobs on small datasets• Finding individual records

43

Page 44: Intro to Big Data - Spark

Hadoop combined architecture

44

TaskTracker

DataNode

Slave node

JobTracker

Master

TaskTracker

DataNode

Slave node

TaskTracker

DataNode

Slave node

SecondaryNameNode

Backup

NameNode

Page 45: Intro to Big Data - Spark

Hadoop to Complicated

45

Page 46: Intro to Big Data - Spark

Welcome Spark

46

Page 47: Intro to Big Data - Spark

What is Spark?

Distributed data analytics engine, generalizing Map ReduceCore engine, with streaming, SQL, machine learning, and graph processing modules

Page 48: Intro to Big Data - Spark

Most Active Big Data Project

Activity in last 30 days*

*as of June 1, 2014

0

50

100

150

200

250

Patches

MapReduce Storm Yarn Spark

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Lines Added

MapReduce Storm Yarn Spark

0

2000

4000

6000

8000

10000

12000

14000

16000

Lines Removed

MapReduce Storm Yarn Spark

Page 49: Intro to Big Data - Spark

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLabStorm

Giraph

DrillImpala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

Unified platform

Page 50: Intro to Big Data - Spark

Spark Core: RDDsDistributed collection of objects What’s cool about them?

• In-memory• Built via parallel transformations

(map, filter, …)• Automatically rebuilt on failure

Page 51: Intro to Big Data - Spark

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

Page 52: Intro to Big Data - Spark

iter. 1 iter. 2 . . .

Input

What We’d Like

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Page 53: Intro to Big Data - Spark

A Unified Platform

MLlibmachine learning

Spark Streamingreal-time

Spark Core

GraphXgraph

SparkSQL

Page 54: Intro to Big Data - Spark

Spark SQL

Unify tables with RDDsTables = Schema + Data

Page 55: Intro to Big Data - Spark

Spark SQL

Unify tables with RDDsTables = Schema + Data = SchemaRDD

coolPants = sql("""SELECT pid, colorFROM pants JOIN opinions WHERE opinions.coolness > 90""")

chosenPair =coolPants.filter(lambda row: row(1) == "green").take(1)

Page 56: Intro to Big Data - Spark

GraphX

Unifies graphs with RDDs of edges and vertices

Page 57: Intro to Big Data - Spark

GraphX

Unifies graphs with RDDs of edges and vertices

Page 58: Intro to Big Data - Spark

GraphX

Unifies graphs with RDDs of edges and vertices

Page 59: Intro to Big Data - Spark

GraphX

Unifies graphs with RDDs of edges and vertices

Page 60: Intro to Big Data - Spark

MLlib

Vectors, Matrices

Page 61: Intro to Big Data - Spark

MLlib

Vectors, Matrices = RDD[Vector]Iterative computation

Page 62: Intro to Big Data - Spark

Spark StreamingTime

Input

Page 63: Intro to Big Data - Spark

Spark Streaming

RDD

RDD

RDD

RDD

RDD

RDD

Time

Express streams as a series of RDDs over timeval pantsers = spark.sequenceFile(“hdfs:/pantsWearingUsers”)

spark.twitterStream(...).filter(t => t.text.contains(“Hadoop”)).transform(tweets => tweets.map(t => (t.user,

t)).join(pantsers).print()

Page 64: Intro to Big Data - Spark

What it Means for Users

Separate frameworks:

…HDFS read

HDFS writeET

L HDFS read

HDFS writetra

in HDFS read

HDFS writequ

ery

HDFS

HDFS read ET

Ltra

inqu

ery

Spark: Interactiveanalysis

Page 65: Intro to Big Data - Spark

Spark Cluster

65

Page 66: Intro to Big Data - Spark

Benefits of Unification

• No copying or ETLing data between systems• Combine processing types in one program• Code reuse• One system to learn• One system to maintain

Page 67: Intro to Big Data - Spark

Data Ingestion

67

Page 68: Intro to Big Data - Spark

68

Page 69: Intro to Big Data - Spark

Collection vs Ingestion

69

Page 70: Intro to Big Data - Spark

Data collection

• Happens where data originates • “logging code” • Batch v. Streaming • Pull v. Push

70

Page 71: Intro to Big Data - Spark

Data Ingestion

• Receives data • Sometimes coupled with storage • Routing data

71

Page 72: Intro to Big Data - Spark

ThanksSofianhw

[email protected]

Pinjam.co.id

72