Top Banner
Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University
10

Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Dec 17, 2015

Download

Documents

Jemimah Barker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Hadoop on Palmetto HPC

Pengfei Xuan

Jan 8, 2015

School of ComputingClemson University

Page 2: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Outline

• Introduction

• Hadoop over Palmetto HPC Cluster

• Hands-on Practice

Page 3: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

HPC Cluster vs. Hadoop Cluster

Networking

ComputeNode

Data Node

HDD

SSD

RAM

Data Node

Data Node

ComputeNode

ComputeNode

ComputeNode Hadoop

HPC

Page 4: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

HPC Clusters

Forge (NVIDIA GPU Cluster)• 44 GPU Nodes• 6 or 8 NVIDIA Fermi M2070 GPUs per Node• 6GB Graphics Memory per GPU • 600 TB GPFS File System• 40GB/sec InfiniBand QDR per Node

(Point-to-point unidirectional speed)

The National Center for Supercomputing Application

+ +

InfiniBand Switch 40GB InfiniBand Adapter 8 NVIDIA Fermi M2070 GPUs

Page 5: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Hadoop Clusters

Page 6: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

History of Hadoop

1998 Google Funded2003 GFS Paper2004 MapReduce Paper Nutch DFS Impl.2005 Nutch MR Impl.2006 BigTable Paper Hadoop Project2008 World’s Largest Hadoop2010 Facebook 21 PB Data 2011 Microsoft, IBM, Oracle, Twitter, AmazonNow Everywhere, Our Class!

Jeffrey Dean Doug Cutting

Page 7: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Google vs. Hadoop Infrastructures

Dremel

Evenflow Evenflow Dremel

MySQLGateway

SawzallBigtable

MapReduce / GFS

Chubby

Azkaban Azkaban

SqoopKafka

PigVoldemort

Hadoop

Zookeeper

HiPal

Databee Databee Hive

ScribeHive

HBaseHadoop

Zookeeper

Oozie Oozie Hive

Data Highway

Hive / PigHBase

Hadoop

Zookeeper

Hue Crunch

Oozie Oozie Hive

SqoopFlume

Hive / PigHBase

Hadoop

Zookeeper

I. Data Coordination Layer

II. Data Storage and Computing Layer

III. Data Flow Layer

IV. Data Analysis Layer

Page 8: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

MapReduce Word Count Example

cat * | grep | sort | uniq -c | cat > file

Page 9: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Run Hadoop over Palmetto Cluster

1. Setup Hadoop configuration files

2. Start Hadoop services

3. Copy input files to HDFS (stage-in)

4. Run Hadoop job (MapReduce WordCount)

5. Copy output files from HDFS to your home directory (stage-out)

6. Stop Hadoop services

7. Clear up

Page 10: Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University.

Commands

1. Create job directory: $> mkdir myHadoopJob1$> cd myHadoopJob1

2. Get Hadoop PBS Script:$> cp /tmp/runHadoop.pbs .Or,$> wget https://raw.githubusercontent.com/pfxuan/myhadoop/master/examples/runHadoop.pbs

3. Submit job to Palmetto cluster:$> qsub runHadoop.pbs

4. Check status of your job:$> qstat -anu your_cu_username

5. Verify the correctness of your result:$> grep Hadoop wordcount-output/* | grep 51