Top Banner
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. [email protected]
22

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

May 26, 2015

Download

Technology

Sriram Krishnan

Hadoop-on-demand on traditional HPC resources at the UC Cloud Summit, 2011 (http://www.ucgrid.org/cloud2011/UCCloudSummit2011.html).
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

Sriram Krishnan, [email protected]

Page 2: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Acknowledgements

• Mahidhar Tatineni• Chaitanya Baru• Jim Hayes• Shava Smallen

Page 3: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Outline

• Motivations• Technical Challenges• Implementation Details• Performance Evaluation

Page 4: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Motivations

• An open source tool for running Hadoop jobs on HPC resources• Easy to configure and use for the end-user• Play nice with existing batch systems on HPC resources

• Why do we need such a tool?• End-users: I already have Hadoop code – and I only have

access to regular HPC-style resources• Computer Scientists: I want to study the implications of

using Hadoop on HPC resources• And I don’t have root access to these resources

Page 5: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Some Ground Rules

• What this presentation is:• A “how-to” for running Hadoop jobs on HPC resources

using myHadoop• A description of the performance implications of using

myHadoop• What this presentation is not:

• A propaganda for the use of Hadoop on HPC resources

Page 6: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Main Challenges

• Shared-nothing (Hadoop) versus HPC-style architectures• In terms of philosophies and implementation

• Control and co-existence of Hadoop and HPC batch systems• Typically both Hadoop and HPC batch systems (viz., SGE,

PBS) need completely control over the resources for scheduling purposes

Page 7: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Traditional HPC Architecture

PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH MINIMAL LOCAL

STORAGE

Shared-nothing (MapReduce-style) Architectures

COMPUTE/DATA CLUSTER WITH LOCAL STOARGE

ETHERNET

Page 8: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Hadoop and HPC Batch Systems

• Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc• These systems have complete control over the compute resources• Users typically can’t log in directly to the compute nodes (via ssh) to

start various daemons

• Hadoop manages its resources using its own set of daemons• NameNode & DataNode for Hadoop Distributed File System (HDFS)• JobTracker & TaskTracker for MapReduce jobs

• Hadoop daemons and batch systems can’t co-exist seamlessly• Will interfere with each other’s scheduling algorithms

Page 9: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

myHadoop Requirements

1. Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systemsa) Working with a variety of batch systems (PBS, SGE, etc)

2. Allowing users to run Hadoop jobs without needing root-level access

3. Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource

4. Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)

Page 10: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

COMPUTE NODES

PERSISTENT MODE NON-PERSISTENT MODE

BATCH PROCESSING SYSTEM (PBS, SGE)

PARALLEL FILE SYSTEM

HADOOP DAEMONS

myHadoop Architecture

[2, 3]

[1]

[4(a)][4(b)]

Page 11: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Implementation Details: PBS, SGE

Page 12: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

User Workflow

BOOTSTRAP

TEARDOWN

Page 13: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Performance Evaluation• Goals and non-goals

• Study the performance overhead and implication of myHadoop• Not to optimize/improve existing Hadoop code

• Software and Hardware• Triton Compute Cluster (http://tritonresource.sdsc.edu/)

• Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode”

• Apache Hadoop version 0.20.2• Various parameters tuned for performance on Triton

• Applications• Compute-intensive: HadoopBlast (Indiana University)

• Modest-sized inputs – 128 query sequences (70K each)• Compared against NR database – 200MB in size

• Data-intensive: Data Selections (OpenTopography Facility at SDSC)• Input size from 1GB to 100GB• Sub-selecting around 10% of the entire dataset

Page 14: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

HadoopBlast

Page 15: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Selections

Page 16: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Related Work

• Recipe for running Hadoop over PBS in blogosphere• http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using-pbs.

html• myHadoop is “inspired” by their approach – but is more general-

purpose and configurable

• Apache Hadoop On Demand (HOD)• http://hadoop.apache.org/common/docs/r0.17.0/hod.html• Only PBS support, needs external HDFS, harder to use, and has

trouble with multiple concurrent Hadoop instances

• CloudBatch – batch queuing system on clouds• Use of Hadoop to run batch systems like PBS• Exact opposite of our goals – but similar approach

Page 17: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Center for Large-Scale Data Systems Research (CLDS)

CLDS CLDSIndustry Advisory

BoardIndustry Advisory

Board

Academic Advisory Board

Academic Advisory Board

Benchmarking, Performance

Evaluation and Systems

Development Projects

Benchmarking, Performance

Evaluation and Systems

Development Projects

Industry Forums and Professional

Education

Industry Forums and Professional

Education

Industry-University Consortium on Software for Large-scale Data Systems

Industry-University Consortium on Software for Large-scale Data Systems

How Much Information?

Project

How Much Information?

Project

Public Private

Personal

Public Private

Personal

Visiting FellowsVisiting Fellows

Information MetrologyData Growth, Information Mgt

Cloud StorageArchitecture

Cloud Storage andPerformance Benchmarking

Industry InterchangeMgt, Technical Forums

• Student internships• Joint collaborations

Page 18: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Summary

• myHadoop – an open source tool for running Hadoop jobs on HPC resources• Without need for root-level access• Co-exists with traditional batch systems• Allows “persistent” and “non-persistent” modes to save HDFS state

across runs• Tested on SDSC Triton, TeraGrid and UC Grid resources

• More information• Software: https://sourceforge.net/projects/myhadoop/• SDSC Tech Report:

http://www.sdsc.edu/pub/techreports/SDSC-TR-2011-2-Hadoop.pdf

Page 19: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Questions?

• Email me at [email protected]

Page 20: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Appendix

Page 21: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

io.file.buffer.size 131072 Size of read/write buffer

fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs

io.sort.mb 650 Memory limit for sorting data

core-site.xml:

dfs.replication 2 Number of times data is replicated

dfs.block.size 134217728 HDFS block size in bytes

dfs.datanode.handler.count 64 Number of handlers to serve block requests

hdfs-site.xml:

mapred.reduce.parallel.copies 4 Number of parallel copies run by reducers

mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously

mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously

mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks

mapred.child.java.opts -Xmx1024m Large heap size for child JVMs

hdfs-site.xml:

Page 22: myHadoop - Hadoop-on-Demand on Traditional HPC Resources

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Data SelectCounts on Dash