Introduction to Data Analysis with Hadoop PRESENTED BY: Tatineni, Mahidhar, PhD, SDSC Feng “Kevin” Chen, PhD, TACC and Weijia Xu, PhD, TACC
IntroductiontoData AnalysiswithHadoop
PRESENTEDBY:Tatineni, Mahidhar, PhD, SDSC
Feng “Kevin” Chen, PhD, TACC
and Weijia Xu, PhD, TACC
Today’sAgenda• OverviewofBigData– scope,challenges• IntroductiontoHadoopArchitecture
• HDFS• YARN
• MapReduceParadigm• HadooponWrangler
• Wrangleroverview• Hadoopreservations,usage
• Hadoophandsonsession• HDFScommands• Simplewordcount example• Hadoopstreaming• Mahout
XSEDETrainingSurvey
Pleasecompleteashorton-linesurveyaboutthismoduleathttp://bit.ly/xsedejackson.Wevalueyourfeedback,andwilluseyourfeedbacktohelpimproveourtrainingofferings.Slidesfromthisworkshopareavailableat
http://hpcuniversity.org/trainingMaterials/238/
What/WhyBigData?Big data: High-volume, high velocity and high varietyinformation assets that demand cost effective, innovativeforms of information processing for enhanced insight anddecision making.• 5billionmobilephonesinusein2010
• Facebook:34,722likeseverymin,100TBuploaddaily
• YouTube:usersupload48hrs ofnewvideoeveryminute.
• Wal-Marthandlesmorethan1millioncustomertransactionseveryhour.
• Akamaianalyze75millioneventsperdayforbettertarget
• advertisements.
• Twitter:roughly400milliontweetseverydayandholds465Maccounts.
• 571newwebsitesarecreatedeveryminute.
Continued..• Data Volume
• 50x increase from 2010 to 2020• Data velocity
• 10k payment card transactionsare madevery second around theglobe.
• Wal-Mart handles 1M +• transactions an hour.
• Data variety• Structured data such as ATM
andPOS bank transactions.• Semi structured is having some• tagging/method of
differentiation.• Unstructured : everything else
fallsin this category. E gtweets,FB likes,posts etc.
Continued..• Rapidgrowthwithalmost90% ofthe data generated in last2years.• Classification of data–
• 51% is structured data• 27% is semi structured data• 22% is unstructured data
• Total6Mjobs outofwhich 2M are in theUS. Limitedbynumberofpeoplewith deep analysis skills anditwillbedifficult to addressbig datarequirementsin USby 2018.
What’sthe big challenge withthebig data analysis?
• Dataanalysisprocessrequiresalotofcomputationalresources,– Storage,triplethesizeoftherawdatatostoretheintermediate
files,outputetc.– Memory,e.g.algorithmwanttostorethepair-wisedistance
matrixamongdatapoints
• Theanalysisprocesswouldtake muchlonger.– Typicalharddrivereadspeedisabout150MB/sec,
• Reading1TB~2hours
– Analysiscouldrequireprocessingtimequadratic tothesizeofthedata
• Analysis that took1secondfor1GBdata,wouldrequire11daystofinishfor1TBdata
WhatisHadoopApache Hadoop is an open source softwareframeworkfor storage and large scaleprocessing ofdatasets on clusters ofcommodityhardware. It’s atop level Apacheproject being built and used byaglobal community ofcontributors/users. Itislicensed under the Apache License 2.0
Features :• Cost effective
• Flexible/Heterogeneous hardware
• Scalable
• Resilient to failure
Hadoop• Implementation of MapReduce programming modelin
JAVA withinterface toother programming language suchasC/C++, python.
• Hadoopincludes– HDFS, adistributedfile system basedon google file system
(GFS), asitssharedfile system.– YARN,Aresourcemanagertoassignresourcestothe
computational tasks.– MapReduce, alibrary toenable efficient distributeddata
processing easily.– Mahout,scalablemachinelearninganddatamininglibrary– Hadoopstreaming, enable processing withother language.– …
HDFSFeatures :• Based on GoogleFS• Fault Tolerant and easy management• Scalable and extremely simple to expand• Hadoop support shell-like commands to interact
with HDFS directly• Typicalreplica is 3, but can be set at file level as
well• 128M is the default block size• Daemons services ofHDFS
• NameNode• Secondary NameNode• Data Node
Limitations :• Cannotbe mounted directly byan existing OS.• Low latency and not for systems requiring
concurrent writes• Parallel write/arbitrary read• Is meant for less no of huge files not otherwise
NameNode
• Master node in the cluster/ single point of failure• Data node sends heartbeats every 3 seconds• Every 10th heartbeat is ablockreport• Name node builds metadata from block reports• All the requests (read/write) are processed by namenode only
YARNArchitecture
•ApacheHadoopYARN(YetAnotherResourceNegotiator)isaclustermanagementtechnology.•Scheduler is responsible for allocating resources, though noresponsibility on completion.•ApplicationMgr accepts job submission, spins up appmaster andtracks its progress.•AppMaster has the responsibility of negotiating resources fromthe scheduler, tracking their status and monitoring the progress.
Hadoop andYARN Architecture
YARN has two main components : Scheduler and ApplicationManager.Scheduler does the scheduling and allocates resources to running applicationsand It does so based on the abstract notion of aresource container whichincorporates elements such as memory, cpu, disk, network etc.ApplicationMgr is responsible for accepting job submissions, creating appmastercontainer and restart appmaster in case of failure.Nodemgr is the per-machine frameworkagent which is responsible forcontainers, monitoring their resource usage and reporting the same to RM.
MapReduceMapReduce is asoftware frameworkfor easily writing applications whichprocess vast amounts of data in-parallel on large clusters in a reliable fault-tolerant manner.
Features :• Accessibility• Flexibility• Reliability
MapReducedeepdive
Reduceprocess can start before mappers end. It can start shuffle since itsdatatransfer only but not sortand reduce.mapreduce.job.reduce.slowstart.completedmaps is used for that when tostart reducer.
“Big”Ideasin MapReduce
• Move computations todata.– Donotuse/assumehighbandwidthinter-connection betweennodes.– Ifpossible avoidor reducetheneedofdatatransferoverthenetwork
asthisisoften thebottlenecktoscaling.
• Processing data insequentialorder, avoidrandom access– SameideaasappliedintheRDBMS– However,assumedatacanbeprocessedanyorder,– Alldataarepackedintoblocks(defaultat64MB or128MB).
“Big”Ideasin MapReduce
• Scaleout insteadofscale up:– The same code canrunwith1 node or 100 nodes.
• Hide system detailsfrom user– Providingabstractioninwriting parallelcode.
• Mapperandreducer• Partitioner andcombiner
– Isolatesdeveloperfrom(andindependentfrom)systemhardwaredetails
– Onceallrequiredcomponentsarespecified,dataisautomatically“sliced”andprocessedinparallel.
How doesMapReduce work inHadoop?
• Thecomputation isbroke downintotwomajor steps:– Mapinstances:
• processthedatastoredwithinonedatablocksequentially,• TheresultisintheformatoflistofKey-valuepairs
– Reduceinstances:• Collect(key,value)pairsemitted byMapinstances• Pairswiththesamekeyaresenttothesamereducer,• EachReducerprocessthekey-valuepairsreceivedandwriteoutputtoafile.
• Useronly requiredtodevelopfunctions for MapandReduce– Theworkloaddistributionareautomatically handledbythe Hadoop
cluster
• Eitherthe“Key”or “Value”inakey-value pair couldbe anytype ofdata.
8
WordCount Example
• Readtextfilesandcount how often wordsoccur.– The input istext file– The output isatext file
• eachline: word, tab, count
• Map:Produce pairsof(word, count)• Reduce:For eachword, sum upthe counts.
• AppMaster launch one map task for eachmap split. Typically there is a map split foreach input file.• Mappers transform the input data to intermediate data. Output pairs do not need to beof the same types as input pairs. A given input pair may map to zero ormany outputpairs.• All intermediate values associated with a given output key are subsequentlygrouped by the framework, and passed to the Reducer(s) to determine thefinal output.• The Mapper outputs are sorted and then partitioned per Reducer. The total number ofpartitions is the same as the number of reduce tasks for the job.• Users can optionally specify a combiner to perform local aggregation of theintermediate outputs. This helps to cut down amount of data flow.
WordCount Overview
3import...12publicclassWordCount{131417182627282930373839404153545556575859}
publicstatic classMapextendsMapper<Object,Text,Text,IntWritable>{
publicvoidmap...}
publicstatic classReduceextendsReducer<Text,IntWritable,Text,IntWritable>{
public voidreduce ...}
publicstatic voidmain(String[]args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");...FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);}
wordCount Mapper
14publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{15161718
privatefinalstatic IntWritable one =new IntWritable(1);private Text word=new Text();
publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException {
Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());context.write(word,one);
}}
1920212223242526}
WordCount Reducer
28publicstatic classReduce extendsextendsReducer<Text, IntWritable,Text, IntWritable>{
2930publicvoidreduce(Textkey, Iterable<IntWritable>values,Context context) throwsIOException {31 intsum=0;32 while(values.hasNext()){33 sum+=values.next().get();34 }35 result.set(sum);36 context.write(key, result);37 }38}
WordCount main
publicstatic voidmain(String[] args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");
job.setJarByClass(WordCount.class);job.setMapperClass(Map.class);job.setCombinerClass(Reduce.class);job.setReducerClass(Reduce.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true)?0 : 1);
}
Mapper
• Mapsinputkey-value pairstoaset ofintermediate key-valuepair– ClassforIndividualtaskstorun.– Onemappertask per InputSplit.– Mapfunct ion areautomatical ly c a l l e d perkeyValuepair.
publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{private finalstatic IntWritable one =new IntWritable(1);privateTextword=newText();
publicvoidmap(Objectkey,Textvalue, Context context)throwsIOException {Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());context.write(word,one);
}}
}
Partitioner andCombiner
• Theintermediateoutput generatedby Mapper willbe sortedandpartitioned basedonnumber ofreducers.
• Partitioner, (optionalimplementation)for determine how togroupintermediate key-value pairsfor eachreducer
• Combiner, (optionalimplementation)for combine key-valuepairswiththe same key.– Think of runningreducerlocally.
Reducer• Reducesthe set ofvalues ofthe same key toasmaller set.– Eachreducer willprocesssubset generatedby partitioner– Eachreducer willgenerate anoutput file
publicsta)c classReduce extendsextendsReducer<Text, IntWritable, Text,IntWritable>{public voidreduce(Text key, Iterable<IntWritable>values,
Contextcontext)throws IOException {intsum =0;while(values.hasNext()){
sum+=values.next().get();}result.set(sum);context.write(key, result);
}}
ShuffleandSort
• Therearetwoother phasesduring the reducer– Shuffle: Gathering the partitions from eachmapper
– Sort: merge andsort partitions gatheredfrom multiplemappers.
• Thesetwophasesusually runsimultaneously
• Common causeof bottlenecks as this mayinvolve large datamovement.
Big data processing platform• Big data processing platforms like hadoopandsparkare not ansimple applications.
• Co-locate computations anddata.
• Needdedicatedstorage manager andresourcesmanager.
• Deployment requiresmapping andconfigurationwithunderlying hardware.
Hadoop Support at TACC: Wrangler
40 Gb/sEthernet100 Gbps
PublicNetwork
GlobusInterconnect with1 TB/s throughput
IB Interconnect 120 Lanes(56 Gb/s) non-blocking
High Speed Storage System500+ TB1 TB/s
250M+ IOPS
Access &Analysis System
96 Nodes128 GB+Memory
Haswell CPUs
MassStorage Subsystem10 PB
(Replicated)
Access &Analysis System
24 Nodes128 GB+Memory
Haswell CPUs
MassStorage Subsystem10 PB
(Replicated)
TACC Indiana
• A direct attached PCI interface allows access to the NAND flash.
• Not limited by networking connection
• Flash storage not tied to individual nodes
• TheHadoopclustercanbedynamicallycreatedover2to48nodesforeachprojecttouseinallocated tome
• Eachnodehaveaccessto4TBflashstorageacrossfourchannels
• Accessibleviathe Hadoopcluster viaidev, batchjobsubmissionandVNCsessions.
FilesystemsonWrangler
• /home– Home directory for eachuser– Smallareaforconfiguration f i l e s andprograms/sourcecode
• /work– Eachusergets1TBofspaceonworksharedforallsystemsat
TACC– Yourworkdirectoryisstoredin$WORK environment variable
for Wrangler
• /data– Staging input andoutput files– OnlyavailableonWrangler– Supportallocation o f s h aredprojectdirectory
GetStartedwith HadooponWrangler
• Step1: create a Hadoopreservation throughWrangler data portal– What doyouneed?
• Any web browser
• Step2: Accessyour Hadoopcluster andsubmit jobs– What doyouneed?
• Secure ShellClient• Any VNCclient
On project page choose: Manage -> Create Hadoop Reservation
the number of nodes (1 ~10) to be used for the Hadoop cluster.
Duration(1-30 Days)
Schedule Start time
About HadoopReservation onWrangler
• Anyone inthe project cansubmit a HadoopReservation request
• Anyone inthe project canaccessthe HadoopReservation
• TheSUonWrangler is“node hour”
• AHadoopreservation requiresaminimum of2 nodesand1 day,e. 48SU.– Onenodeinthereservation willbeusedas Namenode, resource manager
andapplication master etc.– Therestnodeswillbeusedasdatanodes.– Eachnodewillhave4TBflashstoragemountedaspartof hdfs.
CheckReservation Status
• The web portal will show reservation status after request has been submitted
• More information about the reservation are available through slurm command: scontrol
Check HadoopReservationsUsingCommandLine
• Once log on to Wrangler login node,user can check the reservation statuswith`scontrol` command:
>scontrol show reservation
• The reservation will include all users from the projects
• The first node in the reservation will be used as namenode
• The hadoop cluster will start with a setofdefaultsettings• User may override most settings such as duplication factor,block sizeatrun
time per application.• Hadoop cluster with specific settings upon request
AccessHadoopReservation• Oncethereservation statusis “active”, auser canaccessit(viaslurm jobs)inmultipleways:
– VNCjob:startsa vnc server sessiononone ofthe node in Hadoopcluster,
• Checkclusterinformation and hadoopjobstatus• Application withGraphical/Webuserinterface
– idevjob: Assignone node in Hadoopcluster touser• Managedatainandout hadoop cluster• SubmitHadoopjobsviacommandline• Codetesting
– Batchjob:submit jobstoYARNresourcemanagerin Hadoopcluster.
• Submitlargeanalysisjob• Submitbatchofprocessingjobstorunsequentially
AccessHadoopCluster withVNC
• Pleasevisit: vis.tacc.utexas.edu
Choose “TACC User Portal User”
Enter credential
1. Choose Wrangler Tab
2. Set VNC password, (Only need once)
3. Fill in reservation name: hadoop+TRAINING-HPC+1419And choose “hadoop” queue
AccessHadoopReservation via idev Session
• Usercansubmitidevsessionusing hadoop reservation➢idev -r hadoop+JSU-Training
• Itdefaultstouseyour default project,The-Aallocation_nameoption tospecify allocation touse
• Thedefaultduration for idev is30 minutesThe-m minutesoption canspecify the time ofthe idev session
• Pleaselimityour usage to Hadooprelatedtasks, youcanalsosubmit idev without using reservation for non-hadooptasks.
AccessHadoopReservation via idev Session
• Accessbysecure shellclient– sshwrangler.tacc.utexas.edu– idev -r hadoop+JSU-Training -m 240
• Accessby vis portal– Gotovis.tacc.utexas.edu using web browser– Loginwithyour credential– Goto Wrangler tab to start VNC sessions usingreservation and using Hadoop queue
HadoopDistributedFile System (HDFS)
• Thehdfswillbe set upwiththree topleveldirectories
• /tmp publicwriteable, usedby many hadoopbasedapplication astemporary space.
• /user allusershome directory /user/$USERNAME• /var public readable, usedby many hadoopbasedapplications tostore log filesetc.
Working withHDFS• HDFShasfile system shell
hadoopfs[commands]
• Thefilesystem shellincludesaset ofcommandtowork withhdfs.– Commandare similar tocommonlinuxcommandse.g.
• >hadoopfs-ls• >hadoopfs-mkdir abc #tomake adirectory inhdfs.
#tolist content ofthe default user directory
Getting Data inandout of HDFS
• hadoopfs-putlocal_file [path_in_HDFS]– Put afile inyour localsystem intothe HDFS– Eachfile wouldbe storedinone or more “blocks”
• The default block size is128MB.• The block size usedcanbe overriden by users
• hadoop fs-get path_in_hdfs[path_in_local]– Get afile from the hadoopcluster tothe localfilesystem.
OtherFileShellcommands• -stat returnsstatinformation ofapath••-cat/tail-setrep
output to stdoutset replication factor
• For acomplete listsjust dohadoop fs
• Referenceandmoredetails:https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html
Note:Fordfs specificcommandsnewerversionsofHadoophavetransitionedtousing
“hdfs”commands.Forexample:“hadoop dfs -ls” is the same as “hdfs dfs -ls”
HDFSCommandExamples• hdfs fsck <path>
• Filesystemcheckingutility.Canidentifiesproblemslikemissingblocks,under-replicationetc.
HDFSCommandExamples• hdfs dfsadmin -report
• Reportsbasicfilesysteminformationandstatistics,needsadminprivileges.
YARN
• YARN:YetAnother Resource Manager– Managing computing resourceswithinHadoopcluster– Alljobsshouldbe submitted toyarntorun.
• E.g.usingeither yarnJar or hadoopJar
– Whenuse other hadoop-supportedapplication, please alsospecify YARN asresource manager.SuchasSPARK.
• YARNcommands– Show cluster status– Helpmanaging jobsrunning inside the Hadoopcluster.
YARNCommands
• yarnapplication-list tolistapplications submitted toYARN
default willshow active/queuedjobs
-kill tokillapplication specifiedby jobID.-appStates/appTypes filter options
YARNCommands
• yarnnode– -list list ofstatusofdata nodes– Let usknow ifthere islessthanexpectedlive datanodes.
• yarnlogs– dumplogsofafinishedapplication– -applica/ onID specific log from whichapplication– -containerID specify log from whichcontainer
RunningHadoopApplication• AllHadoopjobs canbe runasaconsolecommand.
• Thebasicformat islike following:
hadoopjarjava_jar_namejava_class_name[parameters]
– User canuse –Dtospecify more Hadoopoptions.-Dmapred.map.tasks #number ofmapinstancestobe generated.-Dmapred.reduce.tasks #number ofreduce instancestobe used.
Howmany Mapper andReducer tasks?• Unfortunately, it depends ontheparticularcase.• Theinputparameteris“suggestion”.• Mapper
– Moreorlessdependsoninputworkload– InputSplit, InputFormat
• Numberofblocksusedtostoretheinput• Numberoflinesinatextfile.
• Reducer– Moreorlessdependsoncomputation workloadandavailable
resources.e.g.0.95/1.75 available containers– alsotoconsider
• Howmanypartitionscouldbegenerated?• Howmanyoutputsplitsaredesired.
RunningWordcount example code
• TheHadoopdistribution comeswithanexample jar whichincludesaset ofexemplar mapreduce code:
• Torunwordcount example from that jar file, youcanruncommandlike following:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar\wordcount \ #javaclassname torun-Dmapred.map.tasks=500 \ #number ofmapper instance-Dmapred.reduce.tasks=256 \ #number ofreducer instance/tmp/data/enwiki-20120104-pages-arKcles.xml \ #input file on hdfswiki_wc #foldertostoretheoutput
RunningWordCount Example
• Thenumberofactualmapper createdmay belimitedby the number ofactualdata blocksinthehdfs.– Inthe example, the input file isabout ~35GB, withdefaultblock size as128MB, the file isstoredin266 blocksin hdfs.
– Allmappersmay not be runat the same time, ifthere isnot enoughresources.
• Eachreducerwillgenerate anoutput fileindependent from eachother.– So256 reducer wouldresult 256 filesinthe output folder.
RunningWordCount example
• Eachoutput file isatext file aswell,– Eachline containsawordanditscount,– Youwillnoticefor eachoutput file, the wordissortedin alphabetic order
– Youcancopy the file out of hdfsusing commandlike
• hadoop fs-get /tmp/wiki_wc wiki_wc
– Or youcanview the content ofeachoutput usingcommandlike
• hadoop fs-cat /tmp/wiki_wc/part-r-00238 |more
Otherexamples
• There are more examplesinthe example.jar.Youcanuse following commandtoget the list– hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
• Here are afew examples:– grep: A map/reduce program that countsthe matchesofaregexinthe input.
– pi: A map/reduce program that estimated Piusing aquasi-Monte Carlomethod.
– terasort: sorting large set ofrandom generated100bytesdata
– sudoku: A sudokusolver.
HandsOnSession
1.Usetrainingaccounttologintologin.xsede.org(remembertouseDUO)2.AccessWranglerusinggsissh
Try It Out• Examplecode/data location/work/03024/chenk/Hadoop_Training_TACC.tar.gz
Copy orextract toyour home directory e.g.>tar-xvf/work/03024/chenk/Hadoop_Training_TACC.tar.gz>cd~/Hadoop_Training_TACC
• Today’sreservationis:hadoop+TRAINING-HPC+2188
• Use“idev”toaccessnodes:> idev -rhadoop+JSU-Training -m240
• Takealook at exercise.txt(https://drive.google.com/file/d/0B4PqgCa0ORIgNUNIVWpyMVVIZnc/view?usp=sharing)
– Tryexercise1andexercise2first
• hadoop fscommandsyntax– hadoop fs-{ls|rm |cat|mkdir |put |get}
HadoopStreaming
• EnablingMRjobsinother scripting language: Python,Perl, R, C, etc...
• Userneedtoprovide scripts/programsfor MapandReduce processing– The input/output format needtobe compatible withkey-value pair
• Intermediate dataarepassedthrough stdin, stdout– A trade-offbetweenconvenience andperformance
HadoopStreaming API
hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar-input/path/to/input/in/hdfs-output/path/to/output/in/hdfs-mappermap-reducerreduce
#inputfilelocation#outputfilelocation#mapperimplementation# reducer implementation
-filemap-filereduce
#location o f t h emapcodeonlocalfilesystem#location o f t h ereducecodeonlocalfilesystem.
• Themapandreducecouldbe implementedinanyprogramming language, evenwithbashscript.
WordCount using BashwithHadoopstreaming
• Putting together:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar \-Dmapred.map.tasks=512 \-Dmapred.reduce.tasks=256 \-Dstream.num.map.output.key.fields=1 \ #specify the position ofthe key-input/tmp/data/20news-all/alt.atheism \-output wiki_wc_bash\-mapper./mapwc.sh-reducer ./reducewc.sh \-file ./mapwc.sh-file ./reducewc.sh
HadoopStreaming
• Thedefaultdatasplitbehavior of Hadoopisbyblocks.
• However, Hadoopstreaming split data per“line”, so– Youcanuse higher number ofmapper– The streaming only workswithtext file
Try It Out
• Ifyouare comfortable withJava– Try exercise 3
• Exercise 4 isabout hadoopstreaming– Canyouwrite awordcount example withyourfavorite programming language?
– Exemplar code for Python, R andbashareavailable at the corresponding directory.
• Questions tothink?– Where/how didwe specify parallelisms?
Mahout
• Machinelearninglibrariesfor Hadoop
• Not very comprehensive one, but stillprovide agoodcoverage
• In practice somewhat raw and complex to utilize.Much of code is specific to the particular datasetbeing processed (Ex. 20 newsgroups)
• Asubsetof analyticmethodsalsocanbe runfromcommandline.
RunningMahout from CommandLine
• Typemahout willshow alist ofprogram canberun
• Some ofpotential interests– buildforest: : Buildthe random forest classifier– kmeans::K-meansclustering– recommenditembased: : Compute recommendation using item- based
collaborativefiltering– runlogistic: : Runalogistic regressionmodelagainst CSVdata– svd:: LanczosSingular Value Decomposition
– …
Walkthrough ofusing Mahout for K-meansclustering
• Dataset,Reuters20 newsgroupdata– wget
https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz– mkdir reuters-sgm– tar-zxvf reuters21578.tar.gz-Creuters-sgm
• Step1, preparing filesandmove to hdfs– $ mahoutorg.apache.lucene.benchmark.utils.ExtractReutersreuters-sgmreuters-out– $hadoopfs-putreuters-out reuters-sgm-extract
• Step2, converting extractedtext file intosequence file format inmahout– $mahoutseqdirectory-i reuters-sgm-extract -oreuters-seqdir -cUTF-8-chunk5
Walkthrough ofusing Mahout forK-meansclustering
• HadoopSequence File,– Sequence ofRecords, where eachrecordisa<Key, Value>pair e.g.
• <Key1,Value1 >• <Key2,Value2 >
• For this example,
– Key <- document ID
of– Value <- content the document
Walkthrough ofusing Mahout forK-meansclustering
• Step3: creating vector representation fromsequence file
> mahoutseq2sparse -ireuters-seqdir/ -oreuters-seqdir-vectors
• ThisstepiscreatingtheTermFrequency/inversedocumentfrequencymatrixforthedocumentset.
Walkthrough ofusing Mahout for K-meansclustering
• Step4:runk-meansclusteringwithMahoutmahoutkmeans
-ireuters-seqdir-vectors/tfidf-vectors/-creuters-kmeans-clusters-oreuters-kmeans-x10-k20
• Step5:checkingresultmahoutclusterdump–Ireuters-kmeans/clusters-*-dreuters-seqdir-vectors/dictionary.file-0-dtsequencefile-b100-n
20-o./cluster-output.txt
Walkthrough ofusing Mahout for K-meansclustering
• Youcanget list of options ofeachprogram by– mahout [program_name]
• Options for k-meansinMahout--input(-i) input Path to job input directory.--clusters(-c)clustersTheinputcentroids,asVectors.(optional)--k(-k)kThekink-Means.Ifspecified, thenarandom selection ofk Vectorswillbe chosenasthe Centroidandwritten tothe clustersinput path.--output(-o)outputThedirectory pathnameforoutput.--distanceMeasure(-dm)distanceMeasure Theclassname ofthe DistanceMeasure.Default isSquaredEuclidean--convergenceDelta (-cd)convergenceDelta The convergence delta value.Default is0.5--maxIter(-x)maxIterThe maximum numberof iteraKons.--maxRed(-r)maxRedThe number ofreduce tasks.Defaultsto2--overwrite(-ow)Ifpresent, overwrite the output directory before running job--help(-h)Printouthelp--clustering(-cl)Ifpresent,runclusteringaZertheiterations havetakenplace
OtherPackageandTools
• Hbase– Opensourceimplementation of BigTable– Columnar,append-onlykey-valuestore– BuiltonHDFSbutnotMR.ProvidesfastrandomlookupforHDFS
data.
• Spark– AdifferentprogrammingmodelwithHadoop– Distributed collectionsofobjects thatcanbecachedinmemoryacross
clusternodes– CurrentlysupportedviaYARNresourcemanager
• XSEDEmachinesprovideabroadrangeofcapabilities.Pleaseletusknow other tools/packagesyoumay need.
Summary• Hadoopframeworkextensivelyusedforscalable
distributedprocessingoflargedatasets.• Severaltoolsavailablethatleveragetheframework.• HadoopStreamingcanbeusedtowrite
mapper/reducefunctionsinanylanguage.• Mahoutw/Hadoopprovidesscalablemachine
learningtools.• ToolslikeSparkhaveexpandedcapabilities,taking
advantageofin-memorystorage/caching.CanalsogobeyondtheMapReduceparadigm.
• Pleasefilloutthesurvey:• http://bit.ly/xsedejackson