Top Banner
Profiling Hadoop Applications Basant Verma
28

Profile hadoop apps

Jul 17, 2015

Download

Technology

Basant Verma
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Profile hadoop apps

Profiling Hadoop Applications

Basant Verma

Page 2: Profile hadoop apps

Agenda

• Profiling General Background

• Available Options

• Profile using Free and Open Source tools

• Profile using YourKit

• Other troubleshooting tools

Page 3: Profile hadoop apps

What does Profiling Provide?

• Profiling runtime / CPU usage:– what lines of code the program is spending the most

time in– what call/invocation paths were used to get to these

lines• naturally represented as tree structures

• Profiling memory usage:– what kinds of objects are sitting on the heap– where were they allocated– who is pointing to them now– memory leaks

Page 4: Profile hadoop apps

Profiler Types and Components

• Components needed for profiling– Profiling Agent

• Collects profiled data (samples, traces, exceptions etc.)

– Analysis Tool• Provides interface for analyzing profiled data and help user

identify potential problems

• Types of Profilers– insertion

– sampling

– instrumenting

Page 5: Profile hadoop apps

Available Options

• Sun JDK Tools– hprof: Profiler (uses jvmti)– jmap: Provides memory map (dump) heap– jhat: Analyze memory dump– jstack: Provide thread dump– Jvisualvm: GUI based profile data analyzer

• Open Source– Visual VM (same as jvisualvm but downloaded as independent app)

• Uses HPROF internally for profiling. Provides GUI for analysis of heap dump and profiler outputs

– NetBeans Profiler• Similar to VisualVM but integrated into IDE

– Eclipse MAT (Memory Analysis Tool)• Can load .hprof files

• Commercial– YourKit– JProfile

Page 6: Profile hadoop apps

USING HPROF

Page 7: Profile hadoop apps

7

Official hprof Documentation

usage: java -Xrunhprof:[help]|[<option>=<value>, ...]

Option Name and Value Description Default

--------------------- ----------- -------

heap=dump|sites|all heap profiling all

cpu=samples|times|old CPU usage off

monitor=y|n monitor contention n

format=a|b text(txt) or binary output a

file=<file> write data to file off

depth=<size> stack trace depth 4

interval=<ms> sample interval in ms 10

cutoff=<value> output cutoff point 0.0001

lineno=y|n line number in traces? Y

thread=y|n thread in traces? N

doe=y|n dump on exit? Y

msa=y|n Solaris micro state accounting n

force=y|n force output to <file> y

verbose=y|n print messages about dumps y

http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

Page 8: Profile hadoop apps

8

Sample hprof usage

• To measure CPU usage, try the following:java -Xrunhprof:cpu=samples,depth=6,heap=dump

• Settings:– Takes samples of CPU execution– Record call traces that include the last 6 levels on the

stack– Dumps the heap map (bigger file size but helps in

finding problems)

• Creates the file java.hprof.txt in the current directory

Page 9: Profile hadoop apps

HPROF with Hadoop

• Hadoop uses hprof as the default profiler

• Profiling related parameters

Purpose JobConf API Command line Parameter

Enable Profiling setProfileEnabled(true) mapred.task.profile=true

Additionalparameters for Profiler

setProfileParams(…) mapred.task.profile.params

Range of sampled task to profile

setProfileTaskRange mapred.task.profile.maps

mapred.task.profile.reduces

Page 10: Profile hadoop apps

Example

• Using Java API

• Using Command line parameters

jobConf.setProfileEnabled(true);

jobConf.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites” +

“,depth=4,thread=y,file=%s");

jobConf.setProfileTaskRange(true, "0-2");

jobConf.setProfileTaskRange(false, "0-1");

hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \

-Dmapred.task.profile=true \

-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all, depth=4,thread=y,file=%s \

-Dmapred.task.profile.maps=0-2 \

-Dmapred.task.profile.reduces=0-1 \

input output

Page 11: Profile hadoop apps

Collecting Profiler Output

• Hadoop JobClient automatically downloads profile logs from all the profiled tasks– If output format type is not specified, hprof creates profile

output in text format (format=a)

• Profiler Outputs are also available via History WebUI

• You can also download profile output using curl– curl -o attempt_201305161037_0004_m_000000_0.hprof

"http://17.115.13.191:50060/tasklog?plaintext=true&attemptid=attempt_201305161037_0004_m_000000_0&filter=profile"

Page 12: Profile hadoop apps

Task User Log

Page 13: Profile hadoop apps

Analyze Profiler output

• You can use VisualVM, NetBeans profiler or YourKit for analyzing the profiling data.– The above tools support only binary format of hprof

output (i.e. option format=b)

• Example– Run profiler with Hadoop job

– Load Profiler output using VisualVM menu option

hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \

-Dmapred.task.profile=true \

-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all,

depth=4,thread=y,format=b,file=%s \

input output

Page 14: Profile hadoop apps

Analyze Profile Output in VisualVM

Page 15: Profile hadoop apps

Object Query Language

• VisualVM and jhat support special query language (OQL) to query Java heap.

– Example : Select all Strings with length 1K or more

• More information about OQL is available at http://visualvm.java.net/oqlhelp.html

select s from java.lang.String where s.count > 1024;

Page 16: Profile hadoop apps

Analyze Profile Output in Eclipse MAT

Page 17: Profile hadoop apps

Profiling Pig Jobs

• Use Hadoop command line parameters

• More information about Pig job profiling is available at Pig Wiki

– https://cwiki.apache.org/PIG/howtoprofile.html

pig -Dmapred.task.profile=true \

-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,thread=y,verbose=n

\

-Dmapred.task.profile.maps=0-2 \

-Dmapred.task.profile.reduces=0-0 \

mypigscript.pig

Page 18: Profile hadoop apps

Profiling Hive Queries

• Set appropriate Hadoop parameters before submitting the queries

hive> set mapred.task.profile=true;

hive> set mapred.task.profile.params=-agentlib:hprof=heap=dump,format=b,file=%s;

hive> set mapred.task.profile.maps=0-2;

hive> set mapred.task.profile.reduces=0-0;

hive>

hive> <hive query>

Page 19: Profile hadoop apps

USING YOURKIT

Page 20: Profile hadoop apps

YourKit Profiler - Summary

• Commercial Java Profiling Tool

– Free tryout and Open Source licenses are available

• Used by many Open Source projects including Hadoop, Pig, Hive etc.

• Features

– On-Demand Profiling

– CPU, Memory and Concurrency profiling methods

– Has integration (Eclipse, NetBeans, IntelliJ)

– Above all, has relatively low performance overhead

Page 21: Profile hadoop apps

Using YourKit Profiler

• You will need to install YourKit profiler (just the profiler lib) on to each TaskTracker

• Tell Hadoop to use a different profiler

• Theoretically, you can also use DistributedCache to make binaries available on TaskTracker machines– Though, I did not have success with this

hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \

-Dmapred.task.profile=true \

-Dmapred.task.profile.params=-

agentpath:<yourkit_path>/libyjpagent.jnilib=dir=/tmp/yourkit_snapnshot,sampling,disablej2ee \

-Dmapred.task.profile.maps=0-2 \

-Dmapred.task.profile.reduces=0-1 \

input output

Page 22: Profile hadoop apps

Small Glitch

• Hadoop JobClient.waitforCompletion(…) will throw error since profile logs are not available in the default directory.

• However, the job will continue to run successfully.• To avoid this, you can instead use mapred.child.java.opts option to specify

the profiling parameters

Page 23: Profile hadoop apps

YourKit to Analyze Jobs

• Can analyze profile output from both YourKitProfiler and hprof/jmap.

Page 24: Profile hadoop apps

OTHER TOOLS

Page 25: Profile hadoop apps

Using other Tools

• JDK Tool ‘jmap’– Can be used for capturing heap map of a running Java

process and later used for analysis inside VisualVM or YourKit

• $ jmap -dump:live,format=b,file=xyz.hprof <jvm-pid>• Don’t run jmap with -histo:live option on JT or NN

– Java process can also be instructed to generate hprofdump of heap map in case of OutOfMemoryError

• -XX:+HeapDumpOnOutOfMemoryError

• JDK Tool ‘jhat’– Can read heap dump in hprof format and provides a

light weight web interface to analyze profiler output

Page 26: Profile hadoop apps

Other Tools (Cont…)

• Hadoop Vaidya (Simple Diagnostic Tool)

– Identifies common performance problem related to Hadoop Jobs (unbalanced partitioning, granularity of tasks, combiners etc.)

– Works merely on Hadoop Job (does not understands the specifics of Hive/Pig)

Page 27: Profile hadoop apps

Other Recommendation

• If possible try running Hadoop (MR/Pig/Hive) in local mode using LocalJobRunner

– LocalJobRunner runs the entire MapReduce job in a single JVM

– It simplifies profiling and log collection

– Can also be used for attaching debugger from IDE

Page 28: Profile hadoop apps

Resources

• Troubleshooting Java application– http://www.oracle.com/technetwork/java/javase/toc-135973.html

• Profile Hadoop Job (Chapter 5 - “Hadoop – The definitive Guide”)– http://my.safaribooksonline.com/book/databases/hadoop/978059652

1974/tuning-a-job/id3545664

• Profiling Pig Job– https://cwiki.apache.org/PIG/howtoprofile.html

• ‘hprof’ Official Documentation– http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

• YourKit Profiler– http://www.yourkit.com