Top Banner
Benchmarking Hadoop & Big Data benchmarking Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk IWT TETRA User Group Meeting - November 28, 2014 1 / 62
62
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop & Big Data benchmarking

Benchmarking

Hadoop & Big Data benchmarking

Dr. ir. ing. Bart Vandewoestyne

Sizing Servers Lab, Howest, Kortrijk

IWT TETRA User Group Meeting - November 28, 2014

1 / 62

Page 2: Hadoop & Big Data benchmarking

Benchmarking

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

2 / 62

Page 3: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

3 / 62

Page 4: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

Hadoop

Hadoop is VMware, but the other way around.

4 / 62

Page 5: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

Hadoop 1.0

Source: Apache Hadoop YARN : moving beyondMapReduce and batch processing with Apache Hadoop 2,Hortonworks, 2014)

MapReduce and HDFS are thecore components, while othercomponents are built around thecore.

5 / 62

Page 6: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

Hadoop 2.0

Source: Apache Hadoop YARN : moving beyondMapReduce and batch processing with Apache Hadoop 2,Hortonworks, 2014)

YARN adds a more generalinterface to run non-MapReducejobs within the Hadoopframework.

6 / 62

Page 7: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

HDFS

Hadoop Distributed File System

Source: http://www.cac.cornell.edu/vw/MapReduce/dfs.aspx

7 / 62

Page 8: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

MapReduce

MapReduce = Programming Model

WordCount example:

Source: Optimizing Hadoop for MapReduce, Khaled Tannir

8 / 62

Page 9: Hadoop & Big Data benchmarking

Benchmarking

Intro: Hadoop essentials

Hadoop distributions

9 / 62

Page 10: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

10 / 62

Page 11: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

HDFS

11 / 62

Page 12: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

NameNode and DataNodes

12 / 62

Page 13: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

Hosts and their roles

13 / 62

Page 14: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

NameNode WebUI

NameNode WebUI address

http://sandy-quad-1.sslab.lan:50070/

14 / 62

Page 15: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

Replication factor

15 / 62

Page 16: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

HDFS Blocks

16 / 62

Page 17: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

Hue: file upload

17 / 62

Page 18: Hadoop & Big Data benchmarking

Benchmarking

Cloudera demo

Hadoop jobs: counters/metrics

18 / 62

Page 19: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

19 / 62

Page 20: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Why benchmark?

My three reasons for using benchmarks:

1 Evaluating the effect of a hardware/software upgrade:

OS, Java VM,. . .Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .

2 Debugging:

Compare with other clusters or published results.

3 Performance tuning:

E.g. Cloudera CDH default config is defensive, not optimal.

20 / 62

Page 21: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

21 / 62

Page 22: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

Hadoop: Available tests

hadoop jar /some/path/to/hadoop-*test*.jar

22 / 62

Page 23: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TestDFSIO

Read and write test for HDFS.

Helpful for

getting an idea of how fast your cluster is in terms of I/O,

stress testing HDFS,

discover network performance bottlenecks,

shake out the hardware, OS and Hadoop setup of your clustermachines (particularly the NameNode and the DataNodes).

23 / 62

Page 24: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TestDFSIO: write test

Generate 10 files of size 1 GB for a total of 10 GB:

$ hadoop jar hadoop-*test*.jar \

TestDFSIO -write -nrFiles 10 -fileSize 1000

TestDFSIO is designed to use 1 map task per file(1:1 mapping from files to map tasks)

24 / 62

Page 25: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TestDFSIO: write test output

Typical output of write test

----- TestDFSIO ----- : write

Date & time: Mon Oct 06 10:21:28 CEST 2014

Number of files: 10

Total MBytes processed: 10000.0

Throughput mb/sec: 12.874702111579893

Average IO rate mb/sec: 13.013071060180664

IO rate std deviation: 1.4416050051562712

Test exec time sec: 114.346

25 / 62

Page 26: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

Interpreting TestDFSIO results

Definition (Throughput)

Throughput(N) =

∑Ni=0 filesizei∑Ni=0 timei

Definition (Average IO rate)

Average IO rate(N) =

∑Ni=0 rateiN

=

∑Ni=0

filesizeitimei

N

Here, N is the number of map tasks.

26 / 62

Page 27: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TestDFSIO: read test

Read 10 input files, each of size 1 GB:

$ hadoop jar hadoop-*test*.jar \

TestDFSIO -read -nrFiles 10 -fileSize 1000

27 / 62

Page 28: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TestDFSIO: read test output

Typical output of read test

----- TestDFSIO ----- : read

Date & time: Mon Oct 06 10:56:15 CEST 2014

Number of files: 10

Total MBytes processed: 10000.0

Throughput mb/sec: 402.4306813151435

Average IO rate mb/sec: 492.8257751464844

IO rate std deviation: 196.51233829270575

Test exec time sec: 33.206

28 / 62

Page 29: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

Influence of HDFS replication factor

When interpreting TestDFSIO results, keep in mind:

The HDFS replication factor plays an important role!

A higher replication factor leads to slower writes.

For three identical TestDFSIO write runs (units are MB/s):

HDFS replication factor

1 2 3

Throughput 190 25 13Average IO-rate 190± 10 25± 3 13± 1

29 / 62

Page 30: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort

Goal

Sort 1TB of data (or any other amount of data) as fast as possible.

Probably most well-known Hadoop benchmark.

Combines testing the HDFS and MapReduce layers of anHadoop cluster.

Typical areas where TeraSort is helpful

Iron out your Hadoop configuration after your cluster passed aconvincing TestDFSIO benchmark first.

Determine whether your MapReduce-related parameters areset to proper values.

30 / 62

Page 31: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: workflow

TeraGen

/user/bart/terasort-input

TeraSort

/user/bart/terasort-output

TeraValidate

/user/bart/terasort-validate

31 / 62

Page 32: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: workflow

hadoop jar hadoop-mapreduce-examples.jar \

teragen 10000000000 /user/bart/input

≈ 4 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

terasort /user/bart/input /user/bart/output

≈ 5 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

teravalidate /user/bart/output /user/bart/validate

If something went wrong, TeraValidate’s output contains theproblem report.

32 / 62

Page 33: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: workflow

hadoop jar hadoop-mapreduce-examples.jar \

teragen 10000000000 /user/bart/input

≈ 4 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

terasort /user/bart/input /user/bart/output

≈ 5 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

teravalidate /user/bart/output /user/bart/validate

If something went wrong, TeraValidate’s output contains theproblem report.

33 / 62

Page 34: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: workflow

hadoop jar hadoop-mapreduce-examples.jar \

teragen 10000000000 /user/bart/input

≈ 4 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

terasort /user/bart/input /user/bart/output

≈ 5 hours on our 4-node cluster

hadoop jar hadoop-mapreduce-examples.jar \

teravalidate /user/bart/output /user/bart/validate

If something went wrong, TeraValidate’s output contains theproblem report.

34 / 62

Page 35: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: duration

35 / 62

Page 36: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

TeraSort: counters

36 / 62

Page 37: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

NNBench

Goal

Load test the NameNode hardware and software.

Generates a lot of HDFS-related requests with normally verysmall payloads.

Purpose: put a high HDFS management stress on theNameNode.

Can simulate requests for creating, reading, renaming anddeleting files on HDFS.

37 / 62

Page 38: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

NNBench: example

Create 1000 files using 12 maps and 6 reducers:

$ hadoop jar hadoop-*test*.jar nnbench \

-operation create_write \

-maps 12 \

-reduces 6 \

-blockSize 1 \

-bytesToWrite 0 \

-numberOfFiles 1000 \

-replicationFactorPerFile 3 \

-readFileAfterOpen true \

-baseDir /user/bart/NNBench-‘hostname -s‘

38 / 62

Page 39: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

MRBench

Goal

Loop a small job a number of times.

checks whether small job runs are responsive and runningefficiently on the cluster

complimentary to TeraSort

puts its focus on the MapReduce layer

impact on the HDFS layer is very limited

39 / 62

Page 40: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

MRBench: example

Run a loop of 50 small test jobs:

$ hadoop jar hadoop-*test*.jar \

mrbench -baseDir /user/bart/MRBench \

-numRuns 50

Example output:

DataLines Maps Reduces AvgTime (milliseconds)

1 2 1 28822

→ average finish time of executed jobs was 28 seconds.

40 / 62

Page 41: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

Micro Benchmarks

MRBench: example

Run a loop of 50 small test jobs:

$ hadoop jar hadoop-*test*.jar \

mrbench -baseDir /user/bart/MRBench \

-numRuns 50

Example output:

DataLines Maps Reduces AvgTime (milliseconds)

1 2 1 28822

→ average finish time of executed jobs was 28 seconds.

41 / 62

Page 42: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

42 / 62

Page 43: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench

Source: http://mhpersonaltrainer.mhpersonaltrainer.com/mhpersonaltrainer/56616/index

43 / 62

Page 44: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench

Big Data benchmark based on TPC-DS.

Focus is mostly on MapReduce engines.

Collaboration between industry and academia.

https://github.com/intel-hadoop/Big-Bench/

History

Launched at First Workshop on Big Data Benchmarking(May 8-9, 2012).

Full kit at Fifth Workshop on Big Data Benchmarking(August 5-6, 2014).

44 / 62

Page 45: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench data model

Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013.

45 / 62

Page 46: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Data Model - 3 V’s

Variety

BigBench data is

structured,

semi-structured,

unstructured.

Velocity

Periodic refreshes for all data.

Different velocity for different areas:

Vstructured < Vunstructured < Vsemistructured

Volume

TPC-DS: discrete scale factors(100, 300, 1000, 3000, 10000, 3000 and 100000).

BigBench: continuous scale factor.

46 / 62

Page 47: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Workload

Workload queries

30 “queries”

Specified in English (sort of)

No required syntax (first implementation in Aster SQL MR)

Kit implemented in Hive, Hadoop MR, Mahout, OpenNLP

Business functions (McKinsey)

Marketing

Merchandising

Operations

Supply chain

Reporting (customers and products)

47 / 62

Page 48: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Workload - Technical Aspects

Data Sources

Number of Queries Percentage

Structured 18 60 %Semi-structured 7 23 %Unstructured 5 17 %

Analytic techniques

Number of Queries Percentage

Statistics analysis 6 20 %Data mining 17 57 %Reporting 8 27 %

48 / 62

Page 49: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Workload - Technical Aspects

Query Types

Number of Queries Percentage

Pure HiveQL 14 46 %Mahout 5 17 %OpenNLP 5 17 %Custom MR 6 20 %

Note that your implementation may vary!

49 / 62

Page 50: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Workload - Technical Aspects

Query Types

Number of Queries Percentage

Pure HiveQL 14 46 %Mahout 5 17 %OpenNLP 5 17 %Custom MR 6 20 %

Note that your implementation may vary!

50 / 62

Page 51: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BIgBench: Benchmark Process

Source: http://www.tele-task.de/archive/video/flash/24896/

51 / 62

Page 52: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: Metric

Number of queries run:

30× (2× S + 1)

Measured times:

TL: loading process

TP : power test

TTT1 : first throughput test

TTDM: data maintenance task

TTT2 : second throughput test

Definition (BigBench queries per hour)

BBQpH =30 · 3 · S · 3600

S · TL + S · TP + TTT1 + S · TTDM+ TTT2

Similar to TPC-DS metric.

52 / 62

Page 53: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: results

53 / 62

Page 54: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: monitoring

54 / 62

Page 55: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: monitoring

55 / 62

Page 56: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: monitoring

56 / 62

Page 57: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: monitoring

57 / 62

Page 58: Hadoop & Big Data benchmarking

Benchmarking

Benchmarks

BigBench

BigBench: in progress

58 / 62

Source: The Hortonworks Blog

Page 59: Hadoop & Big Data benchmarking

Benchmarking

Conclusions

Outline

1 Intro: Hadoop essentials

2 Cloudera demo

3 BenchmarksMicro BenchmarksBigBench

4 Conclusions

59 / 62

Page 60: Hadoop & Big Data benchmarking

Benchmarking

Conclusions

Conclusions

Use Hadoop distributions!

Hadoop cluster administration → Cloudera Manager.

Micro-benchmarks ↔ BigBench.

Your best benchmark is your own application!

60 / 62

Page 61: Hadoop & Big Data benchmarking

Benchmarking

Conclusions

Conclusions

Use Hadoop distributions!

Hadoop cluster administration → Cloudera Manager.

Micro-benchmarks ↔ BigBench.

Your best benchmark is your own application!

61 / 62

Page 62: Hadoop & Big Data benchmarking

Benchmarking

Conclusions

Questions?

Source: https://gigaom.com/2011/12/19/my-hadoop-is-bigger-than-yours/

62 / 62