Profile hadoop apps

Profiling Hadoop Applications

Basant Verma

Agenda

• Profiling General Background

• Available Options

• Profile using Free and Open Source tools

• Profile using YourKit

• Other troubleshooting tools

What does Profiling Provide?

• Profiling runtime / CPU usage:– what lines of code the program is spending the most

time in– what call/invocation paths were used to get to these

lines• naturally represented as tree structures

• Profiling memory usage:– what kinds of objects are sitting on the heap– where were they allocated– who is pointing to them now– memory leaks

Profiler Types and Components

• Components needed for profiling– Profiling Agent

• Collects profiled data (samples, traces, exceptions etc.)

– Analysis Tool• Provides interface for analyzing profiled data and help user

identify potential problems

• Types of Profilers– insertion

– sampling

– instrumenting

Available Options

• Sun JDK Tools– hprof: Profiler (uses jvmti)– jmap: Provides memory map (dump) heap– jhat: Analyze memory dump– jstack: Provide thread dump– Jvisualvm: GUI based profile data analyzer

• Open Source– Visual VM (same as jvisualvm but downloaded as independent app)

• Uses HPROF internally for profiling. Provides GUI for analysis of heap dump and profiler outputs

– NetBeans Profiler• Similar to VisualVM but integrated into IDE

– Eclipse MAT (Memory Analysis Tool)• Can load .hprof files

• Commercial– YourKit– JProfile

USING HPROF

7

Official hprof Documentation

usage: java -Xrunhprof:[help]|[<option>=<value>, ...]

Option Name and Value Description Default

--------------------- ----------- -------

heap=dump|sites|all heap profiling all

cpu=samples|times|old CPU usage off

monitor=y|n monitor contention n

format=a|b text(txt) or binary output a

file=<file> write data to file off

depth=<size> stack trace depth 4

interval=<ms> sample interval in ms 10

cutoff=<value> output cutoff point 0.0001

lineno=y|n line number in traces? Y

thread=y|n thread in traces? N

doe=y|n dump on exit? Y

msa=y|n Solaris micro state accounting n

force=y|n force output to <file> y

verbose=y|n print messages about dumps y

http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

8

Sample hprof usage

• To measure CPU usage, try the following:java -Xrunhprof:cpu=samples,depth=6,heap=dump

• Settings:– Takes samples of CPU execution– Record call traces that include the last 6 levels on the

stack– Dumps the heap map (bigger file size but helps in

finding problems)

• Creates the file java.hprof.txt in the current directory

HPROF with Hadoop

• Hadoop uses hprof as the default profiler

• Profiling related parameters

Purpose JobConf API Command line Parameter

Enable Profiling setProfileEnabled(true) mapred.task.profile=true

Additionalparameters for Profiler

setProfileParams(…) mapred.task.profile.params

Range of sampled task to profile

setProfileTaskRange mapred.task.profile.maps

mapred.task.profile.reduces

Example

• Using Java API

• Using Command line parameters

jobConf.setProfileEnabled(true);

jobConf.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites” +

“,depth=4,thread=y,file=%s");

jobConf.setProfileTaskRange(true, "0-2");

jobConf.setProfileTaskRange(false, "0-1");

hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount \

-Dmapred.task.profile=true \

-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all, depth=4,thread=y,file=%s \

-Dmapred.task.profile.maps=0-2 \

-Dmapred.task.profile.reduces=0-1 \

input output

Collecting Profiler Output

• Hadoop JobClient automatically downloads profile logs from all the profiled tasks– If output format type is not specified, hprof creates profile

output in text format (format=a)

• Profiler Outputs are also available via History WebUI

• You can also download profile output using curl– curl -o attempt_201305161037_0004_m_000000_0.hprof

"http://17.115.13.191:50060/tasklog?plaintext=true&attemptid=attempt_201305161037_0004_m_000000_0&filter=profile"

Task User Log

Analyze Profiler output

• You can use VisualVM, NetBeans profiler or YourKit for analyzing the profiling data.– The above tools support only binary format of hprof

output (i.e. option format=b)

• Example– Run profiler with Hadoop job

– Load Profiler output using VisualVM menu option



-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=all,

depth=4,thread=y,format=b,file=%s \

input output

Analyze Profile Output in VisualVM

Object Query Language

• VisualVM and jhat support special query language (OQL) to query Java heap.

– Example : Select all Strings with length 1K or more

• More information about OQL is available at http://visualvm.java.net/oqlhelp.html

select s from java.lang.String where s.count > 1024;

http://visualvm.java.net/oqlhelp.html

Analyze Profile Output in Eclipse MAT

Profiling Pig Jobs

• Use Hadoop command line parameters

• More information about Pig job profiling is available at Pig Wiki

– https://cwiki.apache.org/PIG/howtoprofile.html

pig -Dmapred.task.profile=true \

-Dmapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,thread=y,verbose=n

\



mypigscript.pig

https://cwiki.apache.org/PIG/howtoprofile.html

Profiling Hive Queries

• Set appropriate Hadoop parameters before submitting the queries

hive> set mapred.task.profile=true;

hive> set mapred.task.profile.params=-agentlib:hprof=heap=dump,format=b,file=%s;

hive> set mapred.task.profile.maps=0-2;

hive> set mapred.task.profile.reduces=0-0;

hive>

hive> <hive query>

USING YOURKIT

YourKit Profiler - Summary

• Commercial Java Profiling Tool

– Free tryout and Open Source licenses are available

• Used by many Open Source projects including Hadoop, Pig, Hive etc.

• Features

– On-Demand Profiling

– CPU, Memory and Concurrency profiling methods

– Has integration (Eclipse, NetBeans, IntelliJ)

– Above all, has relatively low performance overhead

Using YourKit Profiler

• You will need to install YourKit profiler (just the profiler lib) on to each TaskTracker

• Tell Hadoop to use a different profiler

• Theoretically, you can also use DistributedCache to make binaries available on TaskTracker machines– Though, I did not have success with this



-Dmapred.task.profile.params=-

agentpath:<yourkit_path>/libyjpagent.jnilib=dir=/tmp/yourkit_snapnshot,sampling,disablej2ee \



input output

Small Glitch

• Hadoop JobClient.waitforCompletion(…) will throw error since profile logs are not available in the default directory.

• However, the job will continue to run successfully.• To avoid this, you can instead use mapred.child.java.opts option to specify

the profiling parameters

YourKit to Analyze Jobs

• Can analyze profile output from both YourKitProfiler and hprof/jmap.

OTHER TOOLS

Using other Tools

• JDK Tool ‘jmap’– Can be used for capturing heap map of a running Java

process and later used for analysis inside VisualVM or YourKit

• $ jmap -dump:live,format=b,file=xyz.hprof <jvm-pid>• Don’t run jmap with -histo:live option on JT or NN

– Java process can also be instructed to generate hprofdump of heap map in case of OutOfMemoryError

• -XX:+HeapDumpOnOutOfMemoryError

• JDK Tool ‘jhat’– Can read heap dump in hprof format and provides a

light weight web interface to analyze profiler output

Other Tools (Cont…)

• Hadoop Vaidya (Simple Diagnostic Tool)

– Identifies common performance problem related to Hadoop Jobs (unbalanced partitioning, granularity of tasks, combiners etc.)

– Works merely on Hadoop Job (does not understands the specifics of Hive/Pig)

Other Recommendation

• If possible try running Hadoop (MR/Pig/Hive) in local mode using LocalJobRunner

– LocalJobRunner runs the entire MapReduce job in a single JVM

– It simplifies profiling and log collection

– Can also be used for attaching debugger from IDE

Resources

• Troubleshooting Java application– http://www.oracle.com/technetwork/java/javase/toc-135973.html

• Profile Hadoop Job (Chapter 5 - “Hadoop – The definitive Guide”)– http://my.safaribooksonline.com/book/databases/hadoop/978059652

1974/tuning-a-job/id3545664

• Profiling Pig Job– https://cwiki.apache.org/PIG/howtoprofile.html

• ‘hprof’ Official Documentation– http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

• YourKit Profiler– http://www.yourkit.com

http://www.oracle.com/technetwork/java/javase/toc-135973.html

http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/tuning-a-job/id3545664

https://cwiki.apache.org/PIG/howtoprofile.html

http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

http://www.yourkit.com

Profile hadoop apps

Technology