Top Banner
Introduction to Data Analysis with Hadoop PRESENTED BY: Tatineni, Mahidhar, PhD, SDSC Feng “Kevin” Chen, PhD, TACC and Weijia Xu, PhD, TACC
89

HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Mar 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

IntroductiontoData AnalysiswithHadoop

PRESENTEDBY:Tatineni, Mahidhar, PhD, SDSC

Feng “Kevin” Chen, PhD, TACC

and Weijia Xu, PhD, TACC

Page 2: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Today’sAgenda• OverviewofBigData– scope,challenges• IntroductiontoHadoopArchitecture

• HDFS• YARN

• MapReduceParadigm• HadooponWrangler

• Wrangleroverview• Hadoopreservations,usage

• Hadoophandsonsession• HDFScommands• Simplewordcount example• Hadoopstreaming• Mahout

Page 3: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

XSEDETrainingSurvey

Pleasecompleteashorton-linesurveyaboutthismoduleathttp://bit.ly/xsedejackson.Wevalueyourfeedback,andwilluseyourfeedbacktohelpimproveourtrainingofferings.Slidesfromthisworkshopareavailableat

http://hpcuniversity.org/trainingMaterials/238/

Page 4: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

What/WhyBigData?Big data: High-volume, high velocity and high varietyinformation assets that demand cost effective, innovativeforms of information processing for enhanced insight anddecision making.• 5billionmobilephonesinusein2010

• Facebook:34,722likeseverymin,100TBuploaddaily

• YouTube:usersupload48hrs ofnewvideoeveryminute.

• Wal-Marthandlesmorethan1millioncustomertransactionseveryhour.

• Akamaianalyze75millioneventsperdayforbettertarget

• advertisements.

• Twitter:roughly400milliontweetseverydayandholds465Maccounts.

• 571newwebsitesarecreatedeveryminute.

Page 5: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Continued..• Data Volume

• 50x increase from 2010 to 2020• Data velocity

• 10k payment card transactionsare madevery second around theglobe.

• Wal-Mart handles 1M +• transactions an hour.

• Data variety• Structured data such as ATM

andPOS bank transactions.• Semi structured is having some• tagging/method of

differentiation.• Unstructured : everything else

fallsin this category. E gtweets,FB likes,posts etc.

Page 6: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Continued..• Rapidgrowthwithalmost90% ofthe data generated in last2years.• Classification of data–

• 51% is structured data• 27% is semi structured data• 22% is unstructured data

• Total6Mjobs outofwhich 2M are in theUS. Limitedbynumberofpeoplewith deep analysis skills anditwillbedifficult to addressbig datarequirementsin USby 2018.

Page 7: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

What’sthe big challenge withthebig data analysis?

• Dataanalysisprocessrequiresalotofcomputationalresources,– Storage,triplethesizeoftherawdatatostoretheintermediate

files,outputetc.– Memory,e.g.algorithmwanttostorethepair-wisedistance

matrixamongdatapoints

• Theanalysisprocesswouldtake muchlonger.– Typicalharddrivereadspeedisabout150MB/sec,

• Reading1TB~2hours

– Analysiscouldrequireprocessingtimequadratic tothesizeofthedata

• Analysis that took1secondfor1GBdata,wouldrequire11daystofinishfor1TBdata

Page 8: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WhatisHadoopApache Hadoop is an open source softwareframeworkfor storage and large scaleprocessing ofdatasets on clusters ofcommodityhardware. It’s atop level Apacheproject being built and used byaglobal community ofcontributors/users. Itislicensed under the Apache License 2.0

Features :• Cost effective

• Flexible/Heterogeneous hardware

• Scalable

• Resilient to failure

Page 9: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Hadoop• Implementation of MapReduce programming modelin

JAVA withinterface toother programming language suchasC/C++, python.

• Hadoopincludes– HDFS, adistributedfile system basedon google file system

(GFS), asitssharedfile system.– YARN,Aresourcemanagertoassignresourcestothe

computational tasks.– MapReduce, alibrary toenable efficient distributeddata

processing easily.– Mahout,scalablemachinelearninganddatamininglibrary– Hadoopstreaming, enable processing withother language.– …

Page 10: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HDFSFeatures :• Based on GoogleFS• Fault Tolerant and easy management• Scalable and extremely simple to expand• Hadoop support shell-like commands to interact

with HDFS directly• Typicalreplica is 3, but can be set at file level as

well• 128M is the default block size• Daemons services ofHDFS

• NameNode• Secondary NameNode• Data Node

Limitations :• Cannotbe mounted directly byan existing OS.• Low latency and not for systems requiring

concurrent writes• Parallel write/arbitrary read• Is meant for less no of huge files not otherwise

Page 11: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

NameNode

• Master node in the cluster/ single point of failure• Data node sends heartbeats every 3 seconds• Every 10th heartbeat is ablockreport• Name node builds metadata from block reports• All the requests (read/write) are processed by namenode only

Page 12: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

YARNArchitecture

•ApacheHadoopYARN(YetAnotherResourceNegotiator)isaclustermanagementtechnology.•Scheduler is responsible for allocating resources, though noresponsibility on completion.•ApplicationMgr accepts job submission, spins up appmaster andtracks its progress.•AppMaster has the responsibility of negotiating resources fromthe scheduler, tracking their status and monitoring the progress.

Page 13: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Hadoop andYARN Architecture

YARN has two main components : Scheduler and ApplicationManager.Scheduler does the scheduling and allocates resources to running applicationsand It does so based on the abstract notion of aresource container whichincorporates elements such as memory, cpu, disk, network etc.ApplicationMgr is responsible for accepting job submissions, creating appmastercontainer and restart appmaster in case of failure.Nodemgr is the per-machine frameworkagent which is responsible forcontainers, monitoring their resource usage and reporting the same to RM.

Page 14: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

MapReduceMapReduce is asoftware frameworkfor easily writing applications whichprocess vast amounts of data in-parallel on large clusters in a reliable fault-tolerant manner.

Features :• Accessibility• Flexibility• Reliability

Page 15: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

MapReducedeepdive

Reduceprocess can start before mappers end. It can start shuffle since itsdatatransfer only but not sortand reduce.mapreduce.job.reduce.slowstart.completedmaps is used for that when tostart reducer.

Page 16: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

“Big”Ideasin MapReduce

• Move computations todata.– Donotuse/assumehighbandwidthinter-connection betweennodes.– Ifpossible avoidor reducetheneedofdatatransferoverthenetwork

asthisisoften thebottlenecktoscaling.

• Processing data insequentialorder, avoidrandom access– SameideaasappliedintheRDBMS– However,assumedatacanbeprocessedanyorder,– Alldataarepackedintoblocks(defaultat64MB or128MB).

Page 17: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

“Big”Ideasin MapReduce

• Scaleout insteadofscale up:– The same code canrunwith1 node or 100 nodes.

• Hide system detailsfrom user– Providingabstractioninwriting parallelcode.

• Mapperandreducer• Partitioner andcombiner

– Isolatesdeveloperfrom(andindependentfrom)systemhardwaredetails

– Onceallrequiredcomponentsarespecified,dataisautomatically“sliced”andprocessedinparallel.

Page 18: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

How doesMapReduce work inHadoop?

• Thecomputation isbroke downintotwomajor steps:– Mapinstances:

• processthedatastoredwithinonedatablocksequentially,• TheresultisintheformatoflistofKey-valuepairs

– Reduceinstances:• Collect(key,value)pairsemitted byMapinstances• Pairswiththesamekeyaresenttothesamereducer,• EachReducerprocessthekey-valuepairsreceivedandwriteoutputtoafile.

• Useronly requiredtodevelopfunctions for MapandReduce– Theworkloaddistributionareautomatically handledbythe Hadoop

cluster

• Eitherthe“Key”or “Value”inakey-value pair couldbe anytype ofdata.

8

Page 19: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount Example

• Readtextfilesandcount how often wordsoccur.– The input istext file– The output isatext file

• eachline: word, tab, count

• Map:Produce pairsof(word, count)• Reduce:For eachword, sum upthe counts.

Page 20: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

• AppMaster launch one map task for eachmap split. Typically there is a map split foreach input file.• Mappers transform the input data to intermediate data. Output pairs do not need to beof the same types as input pairs. A given input pair may map to zero ormany outputpairs.• All intermediate values associated with a given output key are subsequentlygrouped by the framework, and passed to the Reducer(s) to determine thefinal output.• The Mapper outputs are sorted and then partitioned per Reducer. The total number ofpartitions is the same as the number of reduce tasks for the job.• Users can optionally specify a combiner to perform local aggregation of theintermediate outputs. This helps to cut down amount of data flow.

Page 21: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount Overview

3import...12publicclassWordCount{131417182627282930373839404153545556575859}

publicstatic classMapextendsMapper<Object,Text,Text,IntWritable>{

publicvoidmap...}

publicstatic classReduceextendsReducer<Text,IntWritable,Text,IntWritable>{

public voidreduce ...}

publicstatic voidmain(String[]args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");...FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

System.exit(job.waitForCompletion(true)?0:1);}

Page 22: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

wordCount Mapper

14publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{15161718

privatefinalstatic IntWritable one =new IntWritable(1);private Text word=new Text();

publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException {

Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());context.write(word,one);

}}

1920212223242526}

Page 23: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount Reducer

28publicstatic classReduce extendsextendsReducer<Text, IntWritable,Text, IntWritable>{

2930publicvoidreduce(Textkey, Iterable<IntWritable>values,Context context) throwsIOException {31 intsum=0;32 while(values.hasNext()){33 sum+=values.next().get();34 }35 result.set(sum);36 context.write(key, result);37 }38}

Page 24: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount main

publicstatic voidmain(String[] args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");

job.setJarByClass(WordCount.class);job.setMapperClass(Map.class);job.setCombinerClass(Reduce.class);job.setReducerClass(Reduce.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true)?0 : 1);

}

Page 25: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Mapper

• Mapsinputkey-value pairstoaset ofintermediate key-valuepair– ClassforIndividualtaskstorun.– Onemappertask per InputSplit.– Mapfunct ion areautomatical ly c a l l e d perkeyValuepair.

publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{private finalstatic IntWritable one =new IntWritable(1);privateTextword=newText();

publicvoidmap(Objectkey,Textvalue, Context context)throwsIOException {Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());context.write(word,one);

}}

}

Page 26: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Partitioner andCombiner

• Theintermediateoutput generatedby Mapper willbe sortedandpartitioned basedonnumber ofreducers.

• Partitioner, (optionalimplementation)for determine how togroupintermediate key-value pairsfor eachreducer

• Combiner, (optionalimplementation)for combine key-valuepairswiththe same key.– Think of runningreducerlocally.

Page 27: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Reducer• Reducesthe set ofvalues ofthe same key toasmaller set.– Eachreducer willprocesssubset generatedby partitioner– Eachreducer willgenerate anoutput file

publicsta)c classReduce extendsextendsReducer<Text, IntWritable, Text,IntWritable>{public voidreduce(Text key, Iterable<IntWritable>values,

Contextcontext)throws IOException {intsum =0;while(values.hasNext()){

sum+=values.next().get();}result.set(sum);context.write(key, result);

}}

Page 28: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

ShuffleandSort

• Therearetwoother phasesduring the reducer– Shuffle: Gathering the partitions from eachmapper

– Sort: merge andsort partitions gatheredfrom multiplemappers.

• Thesetwophasesusually runsimultaneously

• Common causeof bottlenecks as this mayinvolve large datamovement.

Page 29: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

AccessingHadoopCluster usingWrangler

Page 30: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Computing Cluster at aGlance

Page 31: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Big data processing platform• Big data processing platforms like hadoopandsparkare not ansimple applications.

• Co-locate computations anddata.

• Needdedicatedstorage manager andresourcesmanager.

• Deployment requiresmapping andconfigurationwithunderlying hardware.

Page 32: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Hadoop Support at TACC: Wrangler

40 Gb/sEthernet100 Gbps

PublicNetwork

GlobusInterconnect with1 TB/s throughput

IB Interconnect 120 Lanes(56 Gb/s) non-blocking

High Speed Storage System500+ TB1 TB/s

250M+ IOPS

Access &Analysis System

96 Nodes128 GB+Memory

Haswell CPUs

MassStorage Subsystem10 PB

(Replicated)

Access &Analysis System

24 Nodes128 GB+Memory

Haswell CPUs

MassStorage Subsystem10 PB

(Replicated)

TACC Indiana

• A direct attached PCI interface allows access to the NAND flash.

• Not limited by networking connection

• Flash storage not tied to individual nodes

• TheHadoopclustercanbedynamicallycreatedover2to48nodesforeachprojecttouseinallocated tome

• Eachnodehaveaccessto4TBflashstorageacrossfourchannels

• Accessibleviathe Hadoopcluster viaidev, batchjobsubmissionandVNCsessions.

Page 33: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

FilesystemsonWrangler

• /home– Home directory for eachuser– Smallareaforconfiguration f i l e s andprograms/sourcecode

• /work– Eachusergets1TBofspaceonworksharedforallsystemsat

TACC– Yourworkdirectoryisstoredin$WORK environment variable

for Wrangler

• /data– Staging input andoutput files– OnlyavailableonWrangler– Supportallocation o f s h aredprojectdirectory

Page 34: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

GetStartedwith HadooponWrangler

• Step1: create a Hadoopreservation throughWrangler data portal– What doyouneed?

• Any web browser

• Step2: Accessyour Hadoopcluster andsubmit jobs– What doyouneed?

• Secure ShellClient• Any VNCclient

Page 35: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

CreateHadoopReservation• Wrangler data portal: portal.wrangler.tacc.utexas.edu

Page 36: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

On project page choose: Manage -> Create Hadoop Reservation

the number of nodes (1 ~10) to be used for the Hadoop cluster.

Duration(1-30 Days)

Schedule Start time

Page 37: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

About HadoopReservation onWrangler

• Anyone inthe project cansubmit a HadoopReservation request

• Anyone inthe project canaccessthe HadoopReservation

• TheSUonWrangler is“node hour”

• AHadoopreservation requiresaminimum of2 nodesand1 day,e. 48SU.– Onenodeinthereservation willbeusedas Namenode, resource manager

andapplication master etc.– Therestnodeswillbeusedasdatanodes.– Eachnodewillhave4TBflashstoragemountedaspartof hdfs.

Page 38: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

CheckReservation Status

• The web portal will show reservation status after request has been submitted

• More information about the reservation are available through slurm command: scontrol

Page 39: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Check HadoopReservationsUsingCommandLine

• Once log on to Wrangler login node,user can check the reservation statuswith`scontrol` command:

>scontrol show reservation

• The reservation will include all users from the projects

• The first node in the reservation will be used as namenode

• The hadoop cluster will start with a setofdefaultsettings• User may override most settings such as duplication factor,block sizeatrun

time per application.• Hadoop cluster with specific settings upon request

Page 40: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

AccessHadoopReservation• Oncethereservation statusis “active”, auser canaccessit(viaslurm jobs)inmultipleways:

– VNCjob:startsa vnc server sessiononone ofthe node in Hadoopcluster,

• Checkclusterinformation and hadoopjobstatus• Application withGraphical/Webuserinterface

– idevjob: Assignone node in Hadoopcluster touser• Managedatainandout hadoop cluster• SubmitHadoopjobsviacommandline• Codetesting

– Batchjob:submit jobstoYARNresourcemanagerin Hadoopcluster.

• Submitlargeanalysisjob• Submitbatchofprocessingjobstorunsequentially

Page 41: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

AccessHadoopCluster withVNC

• Pleasevisit: vis.tacc.utexas.edu

Choose “TACC User Portal User”

Enter credential

Page 42: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

1. Choose Wrangler Tab

2. Set VNC password, (Only need once)

3. Fill in reservation name: hadoop+TRAINING-HPC+1419And choose “hadoop” queue

Page 43: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

AccessHadoopReservation via idev Session

• Usercansubmitidevsessionusing hadoop reservation➢idev -r hadoop+JSU-Training

• Itdefaultstouseyour default project,The-Aallocation_nameoption tospecify allocation touse

• Thedefaultduration for idev is30 minutesThe-m minutesoption canspecify the time ofthe idev session

• Pleaselimityour usage to Hadooprelatedtasks, youcanalsosubmit idev without using reservation for non-hadooptasks.

Page 44: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

AccessHadoopReservation via idev Session

• Accessbysecure shellclient– sshwrangler.tacc.utexas.edu– idev -r hadoop+JSU-Training -m 240

• Accessby vis portal– Gotovis.tacc.utexas.edu using web browser– Loginwithyour credential– Goto Wrangler tab to start VNC sessions usingreservation and using Hadoop queue

Page 45: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Working withHadoopCluster

Page 46: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 47: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 48: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 49: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 50: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 51: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HadoopDistributedFile System (HDFS)

• Thehdfswillbe set upwiththree topleveldirectories

• /tmp publicwriteable, usedby many hadoopbasedapplication astemporary space.

• /user allusershome directory /user/$USERNAME• /var public readable, usedby many hadoopbasedapplications tostore log filesetc.

Page 52: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Working withHDFS• HDFShasfile system shell

hadoopfs[commands]

• Thefilesystem shellincludesaset ofcommandtowork withhdfs.– Commandare similar tocommonlinuxcommandse.g.

• >hadoopfs-ls• >hadoopfs-mkdir abc #tomake adirectory inhdfs.

#tolist content ofthe default user directory

Page 53: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Getting Data inandout of HDFS

• hadoopfs-putlocal_file [path_in_HDFS]– Put afile inyour localsystem intothe HDFS– Eachfile wouldbe storedinone or more “blocks”

• The default block size is128MB.• The block size usedcanbe overriden by users

• hadoop fs-get path_in_hdfs[path_in_local]– Get afile from the hadoopcluster tothe localfilesystem.

Page 54: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

OtherFileShellcommands• -stat returnsstatinformation ofapath••-cat/tail-setrep

output to stdoutset replication factor

• For acomplete listsjust dohadoop fs

• Referenceandmoredetails:https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

Note:Fordfs specificcommandsnewerversionsofHadoophavetransitionedtousing

“hdfs”commands.Forexample:“hadoop dfs -ls” is the same as “hdfs dfs -ls”

Page 55: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HDFSCommandExamples• hdfs fsck <path>

• Filesystemcheckingutility.Canidentifiesproblemslikemissingblocks,under-replicationetc.

Page 56: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HDFSCommandExamples• hdfs dfsadmin -report

• Reportsbasicfilesysteminformationandstatistics,needsadminprivileges.

Page 57: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

YARN

• YARN:YetAnother Resource Manager– Managing computing resourceswithinHadoopcluster– Alljobsshouldbe submitted toyarntorun.

• E.g.usingeither yarnJar or hadoopJar

– Whenuse other hadoop-supportedapplication, please alsospecify YARN asresource manager.SuchasSPARK.

• YARNcommands– Show cluster status– Helpmanaging jobsrunning inside the Hadoopcluster.

Page 58: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

YARNCommands

• yarnapplication-list tolistapplications submitted toYARN

default willshow active/queuedjobs

-kill tokillapplication specifiedby jobID.-appStates/appTypes filter options

Page 59: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

YARNCommands

• yarnnode– -list list ofstatusofdata nodes– Let usknow ifthere islessthanexpectedlive datanodes.

• yarnlogs– dumplogsofafinishedapplication– -applica/ onID specific log from whichapplication– -containerID specify log from whichcontainer

Page 60: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningHadoopApplication• AllHadoopjobs canbe runasaconsolecommand.

• Thebasicformat islike following:

hadoopjarjava_jar_namejava_class_name[parameters]

– User canuse –Dtospecify more Hadoopoptions.-Dmapred.map.tasks #number ofmapinstancestobe generated.-Dmapred.reduce.tasks #number ofreduce instancestobe used.

Page 61: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Howmany Mapper andReducer tasks?• Unfortunately, it depends ontheparticularcase.• Theinputparameteris“suggestion”.• Mapper

– Moreorlessdependsoninputworkload– InputSplit, InputFormat

• Numberofblocksusedtostoretheinput• Numberoflinesinatextfile.

• Reducer– Moreorlessdependsoncomputation workloadandavailable

resources.e.g.0.95/1.75 available containers– alsotoconsider

• Howmanypartitionscouldbegenerated?• Howmanyoutputsplitsaredesired.

Page 62: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningWordcount example code

• TheHadoopdistribution comeswithanexample jar whichincludesaset ofexemplar mapreduce code:

• Torunwordcount example from that jar file, youcanruncommandlike following:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar\wordcount \ #javaclassname torun-Dmapred.map.tasks=500 \ #number ofmapper instance-Dmapred.reduce.tasks=256 \ #number ofreducer instance/tmp/data/enwiki-20120104-pages-arKcles.xml \ #input file on hdfswiki_wc #foldertostoretheoutput

Page 63: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningWordcount Example code

• The on-screenoutput willshowjobrunningstatus

Page 64: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningWordCount Example

• Thenumberofactualmapper createdmay belimitedby the number ofactualdata blocksinthehdfs.– Inthe example, the input file isabout ~35GB, withdefaultblock size as128MB, the file isstoredin266 blocksin hdfs.

– Allmappersmay not be runat the same time, ifthere isnot enoughresources.

• Eachreducerwillgenerate anoutput fileindependent from eachother.– So256 reducer wouldresult 256 filesinthe output folder.

Page 65: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningWordCount example

• Eachoutput file isatext file aswell,– Eachline containsawordanditscount,– Youwillnoticefor eachoutput file, the wordissortedin alphabetic order

– Youcancopy the file out of hdfsusing commandlike

• hadoop fs-get /tmp/wiki_wc wiki_wc

– Or youcanview the content ofeachoutput usingcommandlike

• hadoop fs-cat /tmp/wiki_wc/part-r-00238 |more

Page 66: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Otherexamples

• There are more examplesinthe example.jar.Youcanuse following commandtoget the list– hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

• Here are afew examples:– grep: A map/reduce program that countsthe matchesofaregexinthe input.

– pi: A map/reduce program that estimated Piusing aquasi-Monte Carlomethod.

– terasort: sorting large set ofrandom generated100bytesdata

– sudoku: A sudokusolver.

Page 67: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HandsOnSession

1.Usetrainingaccounttologintologin.xsede.org(remembertouseDUO)2.AccessWranglerusinggsissh

Page 68: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

LoggingintoSingleSignOnHostMac/Linux:

ssh [email protected]

Windows(PuTTY):

login.xsede.org

Page 69: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

LogintoWranglerfromSingleSignonhost

gsissh wrangler-tacc

Page 70: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Try It Out• Examplecode/data location/work/03024/chenk/Hadoop_Training_TACC.tar.gz

Copy orextract toyour home directory e.g.>tar-xvf/work/03024/chenk/Hadoop_Training_TACC.tar.gz>cd~/Hadoop_Training_TACC

• Today’sreservationis:hadoop+TRAINING-HPC+2188

• Use“idev”toaccessnodes:> idev -rhadoop+JSU-Training -m240

Page 71: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

• Takealook at exercise.txt(https://drive.google.com/file/d/0B4PqgCa0ORIgNUNIVWpyMVVIZnc/view?usp=sharing)

– Tryexercise1andexercise2first

• hadoop fscommandsyntax– hadoop fs-{ls|rm |cat|mkdir |put |get}

Page 72: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HadoopStreaming

• EnablingMRjobsinother scripting language: Python,Perl, R, C, etc...

• Userneedtoprovide scripts/programsfor MapandReduce processing– The input/output format needtobe compatible withkey-value pair

• Intermediate dataarepassedthrough stdin, stdout– A trade-offbetweenconvenience andperformance

Page 73: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HadoopStreaming API

hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar-input/path/to/input/in/hdfs-output/path/to/output/in/hdfs-mappermap-reducerreduce

#inputfilelocation#outputfilelocation#mapperimplementation# reducer implementation

-filemap-filereduce

#location o f t h emapcodeonlocalfilesystem#location o f t h ereducecodeonlocalfilesystem.

• Themapandreducecouldbe implementedinanyprogramming language, evenwithbashscript.

Page 74: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount using BashwithHadoopstreaming

Themapcode:

Page 75: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount using BashwithHadoopstreaming

• The reducecode

Page 76: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

WordCount using BashwithHadoopstreaming

• Putting together:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar \-Dmapred.map.tasks=512 \-Dmapred.reduce.tasks=256 \-Dstream.num.map.output.key.fields=1 \ #specify the position ofthe key-input/tmp/data/20news-all/alt.atheism \-output wiki_wc_bash\-mapper./mapwc.sh-reducer ./reducewc.sh \-file ./mapwc.sh-file ./reducewc.sh

Page 77: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HadoopStreaming

• Thedefaultdatasplitbehavior of Hadoopisbyblocks.

• However, Hadoopstreaming split data per“line”, so– Youcanuse higher number ofmapper– The streaming only workswithtext file

Page 78: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Try It Out

• Ifyouare comfortable withJava– Try exercise 3

• Exercise 4 isabout hadoopstreaming– Canyouwrite awordcount example withyourfavorite programming language?

– Exemplar code for Python, R andbashareavailable at the corresponding directory.

• Questions tothink?– Where/how didwe specify parallelisms?

Page 79: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

HadoopWordCountExample(Screenshot)

Page 80: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Mahout

• Machinelearninglibrariesfor Hadoop

• Not very comprehensive one, but stillprovide agoodcoverage

• In practice somewhat raw and complex to utilize.Much of code is specific to the particular datasetbeing processed (Ex. 20 newsgroups)

• Asubsetof analyticmethodsalsocanbe runfromcommandline.

Page 81: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

RunningMahout from CommandLine

• Typemahout willshow alist ofprogram canberun

• Some ofpotential interests– buildforest: : Buildthe random forest classifier– kmeans::K-meansclustering– recommenditembased: : Compute recommendation using item- based

collaborativefiltering– runlogistic: : Runalogistic regressionmodelagainst CSVdata– svd:: LanczosSingular Value Decomposition

– …

Page 82: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Walkthrough ofusing Mahout for K-meansclustering

• Dataset,Reuters20 newsgroupdata– wget

https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz– mkdir reuters-sgm– tar-zxvf reuters21578.tar.gz-Creuters-sgm

• Step1, preparing filesandmove to hdfs– $ mahoutorg.apache.lucene.benchmark.utils.ExtractReutersreuters-sgmreuters-out– $hadoopfs-putreuters-out reuters-sgm-extract

• Step2, converting extractedtext file intosequence file format inmahout– $mahoutseqdirectory-i reuters-sgm-extract -oreuters-seqdir -cUTF-8-chunk5

Page 83: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Walkthrough ofusing Mahout forK-meansclustering

• HadoopSequence File,– Sequence ofRecords, where eachrecordisa<Key, Value>pair e.g.

• <Key1,Value1 >• <Key2,Value2 >

• For this example,

– Key <- document ID

of– Value <- content the document

Page 84: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Walkthrough ofusing Mahout forK-meansclustering

• Step3: creating vector representation fromsequence file

> mahoutseq2sparse -ireuters-seqdir/ -oreuters-seqdir-vectors

• ThisstepiscreatingtheTermFrequency/inversedocumentfrequencymatrixforthedocumentset.

Page 85: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Walkthrough ofusing Mahout for K-meansclustering

• Step4:runk-meansclusteringwithMahoutmahoutkmeans

-ireuters-seqdir-vectors/tfidf-vectors/-creuters-kmeans-clusters-oreuters-kmeans-x10-k20

• Step5:checkingresultmahoutclusterdump–Ireuters-kmeans/clusters-*-dreuters-seqdir-vectors/dictionary.file-0-dtsequencefile-b100-n

20-o./cluster-output.txt

Page 86: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational
Page 87: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Walkthrough ofusing Mahout for K-meansclustering

• Youcanget list of options ofeachprogram by– mahout [program_name]

• Options for k-meansinMahout--input(-i) input Path to job input directory.--clusters(-c)clustersTheinputcentroids,asVectors.(optional)--k(-k)kThekink-Means.Ifspecified, thenarandom selection ofk Vectorswillbe chosenasthe Centroidandwritten tothe clustersinput path.--output(-o)outputThedirectory pathnameforoutput.--distanceMeasure(-dm)distanceMeasure Theclassname ofthe DistanceMeasure.Default isSquaredEuclidean--convergenceDelta (-cd)convergenceDelta The convergence delta value.Default is0.5--maxIter(-x)maxIterThe maximum numberof iteraKons.--maxRed(-r)maxRedThe number ofreduce tasks.Defaultsto2--overwrite(-ow)Ifpresent, overwrite the output directory before running job--help(-h)Printouthelp--clustering(-cl)Ifpresent,runclusteringaZertheiterations havetakenplace

Page 88: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

OtherPackageandTools

• Hbase– Opensourceimplementation of BigTable– Columnar,append-onlykey-valuestore– BuiltonHDFSbutnotMR.ProvidesfastrandomlookupforHDFS

data.

• Spark– AdifferentprogrammingmodelwithHadoop– Distributed collectionsofobjects thatcanbecachedinmemoryacross

clusternodes– CurrentlysupportedviaYARNresourcemanager

• XSEDEmachinesprovideabroadrangeofcapabilities.Pleaseletusknow other tools/packagesyoumay need.

Page 89: HadoopTraining05 03 2017 V2 - HPC Universityhpcuniversity.org/media/TrainingMaterials/49/Hadoop_JSU_May_2017__1_-1.pdf– YARN, A resource manager to assign resources to the computational

Summary• Hadoopframeworkextensivelyusedforscalable

distributedprocessingoflargedatasets.• Severaltoolsavailablethatleveragetheframework.• HadoopStreamingcanbeusedtowrite

mapper/reducefunctionsinanylanguage.• Mahoutw/Hadoopprovidesscalablemachine

learningtools.• ToolslikeSparkhaveexpandedcapabilities,taking

advantageofin-memorystorage/caching.CanalsogobeyondtheMapReduceparadigm.

• Pleasefilloutthesurvey:• http://bit.ly/xsedejackson