Top Banner
01#1 © Copyright 2010/2012 Cloudera. All rights reserved. Not to be reproduced without prior wri=en consent. Cloudera Developer Training for Apache Hadoop 201212
593

Cloudera_Developer_Training

Dec 04, 2014

Download

Documents

datadisk10

Cloudera_Developer_Training
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloudera_Developer_Training

01#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Cloudera"Developer"Training"

for"Apache"Hadoop"

201212"

Page 2: Cloudera_Developer_Training

01#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

IntroducDon"Chapter"1"

Page 3: Cloudera_Developer_Training

01#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  Introduc/on$

! WriDng"a"MapReduce"Program"!  Unit"TesDng"MapReduce"Programs"!  Delving"Deeper"into"the"Hadoop"API"!  PracDcal"Development"Tips"and"Techniques"!  Data"Input"and"Output"!  Common"MapReduce"Algorithms"!  Joining"Data"Sets"in"MapReduce"Jobs"

!  Conclusion"!  Appendix:"Cloudera"Enterprise"!  Appendix:"Graph"ManipulaDon"in"MapReduce"

!  IntegraDng"Hadoop"into"the"Enterprise"Workflow"! Machine"Learning"and"Mahout"!  An"IntroducDon"to"Hive"and"Pig"!  An"IntroducDon"to"Oozie"

IntroducDon"to"Apache"Hadoop""and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course$Introduc/on$

The"Hadoop"Ecosystem"

!  The"MoDvaDon"for"Hadoop"!  Hadoop:"Basic"Concepts"

Page 4: Cloudera_Developer_Training

01#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Introduc/on$Introduc/on$

!  About$this$course$!  About"Cloudera"!  Course"logisDcs"

Page 5: Cloudera_Developer_Training

01#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

During$this$course,$you$will$learn:$

! The$core$technologies$of$Hadoop$

! How$HDFS$and$MapReduce$work$

! How$to$develop$MapReduce$applica/ons$

! How$to$unit$test$MapReduce$applica/ons$

! How$to$use$MapReduce$combiners,$par//oners,$and$the$distributed$cache$

! Best$prac/ces$for$developing$and$debugging$MapReduce$applica/ons$

! How$to$implement$data$input$and$output$in$MapReduce$applica/ons$

Course"ObjecDves"

Page 6: Cloudera_Developer_Training

01#6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Algorithms$for$common$MapReduce$tasks$

! How$to$join$data$sets$in$MapReduce$

! How$Hadoop$integrates$into$the$data$center$

! How$to$use$Mahout’s$Machine$Learning$algorithms$

! How$Hive$and$Pig$can$be$used$for$rapid$applica/on$development$

! How$to$create$large$workflows$using$Oozie$

Course"ObjecDves"(cont’d)"

Page 7: Cloudera_Developer_Training

01#7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Introduc/on$Introduc/on$

!  About"this"course"!  About$Cloudera$!  Course"logisDcs"

Page 8: Cloudera_Developer_Training

01#8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Google,$Oracle$

and$Yahoo$

! Provides$consul/ng$and$training$services$for$Hadoop$users$

! Staff$includes$commi[ers$to$virtually$all$Hadoop$projects$

! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$

– Lars"George,"Tom"White,"Eric"Sammer,"etc."

About"Cloudera"

Page 9: Cloudera_Developer_Training

01#9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cloudera’s$Distribu/on,$including$Apache$Hadoop$(CDH)$– A"set"of"easy/to/install"packages"built"from"the"Apache"Hadoop"core"repository,"integrated"with"several"addiDonal"open"source"Hadoop"ecosystem"projects"– Includes"a"stable"version"of"Hadoop,"plus"criDcal"bug"fixes"and"solid"new"features"from"the"development"version"– 100%"open"source"

! Cloudera$Manager,$Free$Edi/on$

– The"easiest"way"to"deploy"a"Hadoop"cluster"– Automates"installaDon"of"Hadoop"so`ware"– InstallaDon,"monitoring"and"configuraDon"is"performed"from"a"central"machine"– Manages"up"to"50"nodes"– Completely"free"

Cloudera"So`ware"

Page 10: Cloudera_Developer_Training

01#10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cloudera$Enterprise$Core$– Complete"package"of"so`ware"and"support"– Built"on"top"of"CDH"– Includes"full"version"of"Cloudera"Manager"

– Install,"manage,"and"maintain"a"cluster"of"any"size"– LDAP"integraDon"– Resource"consumpDon"tracking"– ProacDve"health"checks"– AlerDng"– ConfiguraDon"change"audit"trails"– And"more"

! Cloudera$Enterprise$RTD$– Includes"support"for"Apache"HBase"

Cloudera"Enterprise"

Page 11: Cloudera_Developer_Training

01#11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Provides$consultancy$and$support$services$to$many$key$users$of$Hadoop$

– Including"eBay,"JPMorganChase,"Experian,"Groupon,"Morgan"Stanley,"Nokia,"Orbitz,"NaDonal"Cancer"InsDtute,"RIM,"The"Walt"Disney"Company…"

! Solu/ons$Architects$are$experts$in$Hadoop$and$related$technologies$– Many"are"commi=ers"to"the"Apache"Hadoop"and"ecosystem"projects"

! Provides$training$in$key$areas$of$Hadoop$administra/on$and$development$

– Courses"include"System"Administrator"training,"Developer"training,"Hive"and"Pig"training,"HBase"Training,"EssenDals"for"Managers"– Custom"course"development"available"– Both"public"and"on/site"training"available"

Cloudera"Services"

Page 12: Cloudera_Developer_Training

01#12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Introduc/on$Introduc/on$

!  About"this"course"!  About"Cloudera"!  Course$logis/cs$

Page 13: Cloudera_Developer_Training

01#13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Course$start$and$end$/mes$

! Lunch$

! Breaks$

! Restrooms$

! Can$I$come$in$early/stay$late?$

! Cer/fica/on$

LogisDcs"

Page 14: Cloudera_Developer_Training

01#14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! About$your$instructor$

! About$you$– Experience"with"Hadoop?"– Experience"as"a"developer?"

– What"programming"languages"do"you"use?"– ExpectaDons"from"the"course?"

IntroducDons"

Page 15: Cloudera_Developer_Training

02#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"MoAvaAon"for"Hadoop"Chapter"2"

Page 16: Cloudera_Developer_Training

02#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  IntroducAon"

! WriAng"a"MapReduce"Program"

!  Unit"TesAng"MapReduce"Programs"

!  Delving"Deeper"into"the"Hadoop"API"!  PracAcal"Development"Tips"and"Techniques"

!  Data"Input"and"Output"!  Common"MapReduce"Algorithms"

!  Joining"Data"Sets"in"MapReduce"Jobs"

!  Conclusion"!  Appendix:"Cloudera"Enterprise"!  Appendix:"Graph"ManipulaAon"in"MapReduce"

!  IntegraAng"Hadoop"into"the"Enterprise"Workflow"

! Machine"Learning"and"Mahout"

!  An"IntroducAon"to"Hive"and"Pig"!  An"IntroducAon"to"Oozie"

Introduc.on%to%Apache%Hadoop%%and%its%Ecosystem%

Basic"Programming"with"the"

Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducAon"

The"Hadoop"Ecosystem"

!  The%Mo.va.on%for%Hadoop%!  Hadoop:"Basic"Concepts"

Page 17: Cloudera_Developer_Training

02#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%

! What%requirements%an%alterna.ve%approach%should%have%

! How%Hadoop%addresses%those%requirements%

The"MoAvaAon"For"Hadoop"

Page 18: Cloudera_Developer_Training

02#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

!  Problems%with%tradi.onal%large#scale%systems%

!  Requirements"for"a"new"approach"

!  Introducing"Hadoop"!  Conclusion"

Page 19: Cloudera_Developer_Training

02#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Tradi.onally,%computa.on%has%been%processor#bound%– RelaAvely"small"amounts"of"data"

– Significant"amount"of"complex"processing"performed"on"that"data"

! For%decades,%the%primary%push%was%to%increase%the%compu.ng%power%of%a%single%machine%– Faster"processor,"more"RAM"

! Distributed%systems%evolved%to%allow%developers%to%use%mul.ple%machines%for%a%single%job%– MPI"

– PVM"

– Condor"

TradiAonal"Large/Scale"ComputaAon"

MPI: Message Passing Interface PVM: Parallel Virtual Machine

Page 20: Cloudera_Developer_Training

02#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Programming%for%tradi.onal%distributed%systems%is%complex%– Data"exchange"requires"synchronizaAon"– Finite"bandwidth"is"available"– Temporal"dependencies"are"complicated"

– It"is"difficult"to"deal"with"parAal"failures"of"the"system"

! Ken%Arnold,%CORBA%designer:%– “Failure"is"the"defining"difference"between"distributed"and"local"programming,"so"you"have"to"design"distributed"systems"with"the"

expectaAon"of"failure”"

– Developers"spend"more"Ame"designing"for"failure"than"they"do"

actually"working"on"the"problem"itself"

Distributed"Systems:"Problems"

CORBA: Common Object Request Broker Architecture

Page 21: Cloudera_Developer_Training

02#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Typically,%data%for%a%distributed%system%is%stored%on%a%SAN%

! At%compute%.me,%data%is%copied%to%the%compute%nodes%

! Fine%for%rela.vely%limited%amounts%of%data%

Distributed"Systems:"Data"Storage"

SAN: Storage Area Network

Page 22: Cloudera_Developer_Training

02#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Modern%systems%have%to%deal%with%far%more%data%than%was%the%case%in%the%past%– OrganizaAons"are"generaAng"huge"amounts"of"data"

– That"data"has"inherent"value,"and"cannot"be"discarded"! Examples:%

– Facebook"–"over"70PB"of"data"– eBay"–"over"5PB"of"data"

! Many%organiza.ons%are%genera.ng%data%at%a%rate%of%terabytes%per%day%

The"Data/Driven"World"

Page 23: Cloudera_Developer_Training

02#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Moore’s%Law%has%held%firm%for%over%40%years%– Processing"power"doubles"every"two"years"– Processing"speed"is"no"longer"the"problem"

! Ge^ng%the%data%to%the%processors%becomes%the%bo_leneck%

! Quick%calcula.on%– Typical"disk"data"transfer"rate:"75MB/sec"

– Time"taken"to"transfer"100GB"of"data"to"the"processor:"approx"22"

minutes!"

– Assuming"sustained"reads"

– Actual"Ame"will"be"worse,"since"most"servers"have"less"than"100GB"

of"RAM"available"

! A%new%approach%is%needed%

Data"Becomes"the"Bo=leneck"

Page 24: Cloudera_Developer_Training

02#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

!  Problems"with"tradiAonal"large/scale"systems"

!  Requirements%for%a%new%approach%

!  Introducing"Hadoop"!  Conclusion"

Page 25: Cloudera_Developer_Training

02#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%system%must%support%par.al%failure%– Failure"of"a"component"should"result"in"a"graceful"degradaAon"of"

applicaAon"performance"

– Not"complete"failure"of"the"enAre"system"

ParAal"Failure"Support"

Page 26: Cloudera_Developer_Training

02#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%a%component%of%the%system%fails,%its%workload%should%be%assumed%by%s.ll#func.oning%units%in%the%system%– Failure"should"not"result"in"the"loss"of"any"data"

Data"Recoverability"

Page 27: Cloudera_Developer_Training

02#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%a%component%of%the%system%fails%and%then%recovers,%it%should%be%able%to%rejoin%the%system%– Without"requiring"a"full"restart"of"the"enAre"system"

Component"Recovery"

Page 28: Cloudera_Developer_Training

02#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Component%failures%during%execu.on%of%a%job%should%not%affect%the%outcome%of%the%job%%

Consistency"

Page 29: Cloudera_Developer_Training

02#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Adding%load%to%the%system%should%result%in%a%graceful%decline%in%performance%of%individual%jobs%– Not"failure"of"the"system"

! Increasing%resources%should%support%a%propor.onal%increase%in%load%capacity%

Scalability"

Page 30: Cloudera_Developer_Training

02#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

!  Problems"with"tradiAonal"large/scale"systems"

!  Requirements"for"a"new"approach"

!  Introducing%Hadoop%!  Conclusion"

Page 31: Cloudera_Developer_Training

02#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%is%based%on%work%done%by%Google%in%the%late%1990s/early%2000s%– Specifically,"on"papers"describing"the"Google"File"System"(GFS)"

published"in"2003,"and"MapReduce"published"in"2004"

! This%work%takes%a%radical%new%approach%to%the%problem%of%distributed%compu.ng%– Meets"all"the"requirements"we"have"for"reliability"and"scalability"

! Core%concept:%distribute%the%data%as%it%is%ini.ally%stored%in%the%system%– Individual"nodes"can"work"on"data"local"to"those"nodes"

– No"data"transfer"over"the"network"is"required"for"iniAal"processing"

Hadoop’s"History"

Page 32: Cloudera_Developer_Training

02#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Applica.ons%are%wri_en%in%high#level%code%– Developers"need"not"worry"about"network"programming,"temporal"

dependencies"or"low/level"infrastructure"

! Nodes%talk%to%each%other%as%li_le%as%possible%– Developers"should"not"write"code"which"communicates"between"nodes"

– ‘Shared"nothing’"architecture"! Data%is%spread%among%machines%in%advance%

– ComputaAon"happens"where"the"data"is"stored,"wherever"possible"

– Data"is"replicated"mulAple"Ames"on"the"system"for"increased"

availability"and"reliability"

Core"Hadoop"Concepts"

Page 33: Cloudera_Developer_Training

02#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! When%data%is%loaded%into%the%system,%it%is%split%into%‘blocks’%– Typically"64MB"or"128MB"

! Map%tasks%(the%first%part%of%the%MapReduce%system)%work%on%rela.vely%small%por.ons%of%data%– Typically"a"single"block"

! A%master%program%allocates%work%to%nodes%such%that%a%Map%task%will%work%on%a%block%of%data%stored%locally%on%that%node%whenever%possible%– Many"nodes"work"in"parallel,"each"on"their"own"part"of"the"overall"

dataset"

Hadoop:"Very"High/Level"Overview"

Page 34: Cloudera_Developer_Training

02#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%a%node%fails,%the%master%will%detect%that%failure%and%re#assign%the%work%to%a%different%node%on%the%system%

! Restar.ng%a%task%does%not%require%communica.on%with%nodes%working%on%other%por.ons%of%the%data%

! If%a%failed%node%restarts,%it%is%automa.cally%added%back%to%the%system%and%assigned%new%tasks%

! If%a%node%appears%to%be%running%slowly,%the%master%can%redundantly%execute%another%instance%of%the%same%task%– Results"from"the"first"to"finish"will"be"used"

– Known"as"‘speculaAve"execuAon’"

Fault"Tolerance"

Page 35: Cloudera_Developer_Training

02#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc.on%to%Apache%Hadoop%and%its%Ecosystem%The%Mo.va.on%for%Hadoop%

!  Problems"with"tradiAonal"large/scale"systems"

!  Requirements"for"a"new"approach"

!  Introducing"Hadoop"!  Conclusion%

Page 36: Cloudera_Developer_Training

02#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! What%problems%exist%with%tradi.onal%large#scale%compu.ng%systems%

! What%requirements%an%alterna.ve%approach%should%have%

! How%Hadoop%addresses%those%requirements%

Conclusion"

Page 37: Cloudera_Developer_Training

03#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Hadoop:"Basic"Concepts"Chapter"3"

Page 38: Cloudera_Developer_Training

03#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  IntroducDon"

! WriDng"a"MapReduce"Program"!  Unit"TesDng"MapReduce"Programs"!  Delving"Deeper"into"the"Hadoop"API"!  PracDcal"Development"Tips"and"Techniques"!  Data"Input"and"Output"!  Common"MapReduce"Algorithms"!  Joining"Data"Sets"in"MapReduce"Jobs"

!  Conclusion"!  Appendix:"Cloudera"Enterprise"!  Appendix:"Graph"ManipulaDon"in"MapReduce"

!  IntegraDng"Hadoop"into"the"Enterprise"Workflow"! Machine"Learning"and"Mahout"!  An"IntroducDon"to"Hive"and"Pig"!  An"IntroducDon"to"Oozie"

Introduc/on%to%Apache%Hadoop%%and%its%Ecosystem%

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducDon"

The"Hadoop"Ecosystem"

!  The"MoDvaDon"for"Hadoop"!  Hadoop:%Basic%Concepts%

Page 39: Cloudera_Developer_Training

03#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! What%Hadoop%is%

! What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%

! The%concepts%behind%MapReduce%

! How%a%Hadoop%cluster%operates%

! What%other%Hadoop%Ecosystem%projects%exist%

Hadoop:"Basic"Concepts"

Page 40: Cloudera_Developer_Training

03#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The%Hadoop%project%and%Hadoop%components%

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 41: Cloudera_Developer_Training

03#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%is%an%open#source%project%overseen%by%the%Apache%SoPware%Founda/on%

! Originally%based%on%papers%published%by%Google%in%2003%and%2004%

! Hadoop%commiTers%work%at%several%different%organiza/ons%– Including"Cloudera,"Yahoo!,"Facebook,"LinkedIn"

The"Hadoop"Project"

Page 42: Cloudera_Developer_Training

03#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%consists%of%two%core%components%– The"Hadoop"Distributed"File"System"(HDFS)"– MapReduce"

! There%are%many%other%projects%based%around%core%Hadoop%– Oaen"referred"to"as"the"‘Hadoop"Ecosystem’"– Pig,"Hive,"HBase,"Flume,"Oozie,"Sqoop,"etc"

– Many"are"discussed"later"in"the"course"

! A%set%of%machines%running%HDFS%and%MapReduce%is%known%as%a%Hadoop&Cluster&– Individual"machines"are"known"as"nodes&– A"cluster"can"have"as"few"as"one"node,"as"many"as"several"thousand"

– More"nodes"="be=er"performance!"

Hadoop"Components"

Page 43: Cloudera_Developer_Training

03#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! HDFS,%the%Hadoop%Distributed%File%System,%is%responsible%for%storing%data%on%the%cluster%

! Data%is%split%into%blocks%and%distributed%across%mul/ple%nodes%in%the%cluster%– Each"block"is"typically"64MB"or"128MB"in"size"

! Each%block%is%replicated%mul/ple%/mes%– Default"is"to"replicate"each"block"three"Dmes"– Replicas"are"stored"on"different"nodes"

– This"ensures"both"reliability"and"availability"

Hadoop"Components:"HDFS"

Page 44: Cloudera_Developer_Training

03#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce%is%the%system%used%to%process%data%in%the%Hadoop%cluster%

! Consists%of%two%phases:%Map,%and%then%Reduce%– Between"the"two"is"a"stage"known"as"the"shuffle&and&sort"

! Each%Map%task%operates%on%a%discrete%por/on%of%the%overall%dataset%– Typically"one"HDFS"block"of"data"

! APer%all%Maps%are%complete,%the%MapReduce%system%distributes%the%intermediate%data%to%nodes%which%perform%the%Reduce%phase%– Much"more"on"this"later!"

Hadoop"Components:"MapReduce"

Page 45: Cloudera_Developer_Training

03#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The%Hadoop%Distributed%File%System%(HDFS)%

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 46: Cloudera_Developer_Training

03#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! HDFS%is%a%filesystem%wriTen%in%Java%– Based"on"Google’s"GFS"

! Sits%on%top%of%a%na/ve%filesystem%– Such"as"ext3,"ext4"or"xfs"

! Provides%redundant%storage%for%massive%amounts%of%data%– Using"readily/available,"industry/standard"computers"

HDFS"Basic"Concepts"

Page 47: Cloudera_Developer_Training

03#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! HDFS%performs%best%with%a%‘modest’%number%of%large%files%– Millions,"rather"than"billions,"of"files"– Each"file"typically"100MB"or"more"

! Files%in%HDFS%are%‘write%once’%– No"random"writes"to"files"are"allowed"

! HDFS%is%op/mized%for%large,%streaming%reads%of%files%– Rather"than"random"reads"

HDFS"Basic"Concepts"(cont’d)"

Page 48: Cloudera_Developer_Training

03#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Files%are%split%into%blocks%– Each"block"is"usually"64MB"or"128MB"

! Data%is%distributed%across%many%machines%at%load%/me%– Different"blocks"from"the"same"file"will"be"stored"on"different"machines"– This"provides"for"efficient"MapReduce"processing"(see"later)"

! Blocks%are%replicated%across%mul/ple%machines,%known%as%DataNodes&– Default"replicaDon"is"three/fold"

– Meaning"that"each"block"exists"on"three"different"machines"

! A%master%node%called%the%NameNode&keeps%track%of%which%blocks%make%up%a%file,%and%where%those%blocks%are%located%– Known"as"the"metadata"

How"Files"Are"Stored"

Page 49: Cloudera_Developer_Training

03#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! NameNode%holds%metadata%for%the%two%files%(Foo.txt%and%Bar.txt)%

! DataNodes%hold%the%actual%blocks%– Each"block"will"be"64MB"or"128MB"in"size"– Each"block"is"replicated"three"Dmes"on"the"cluster"

How"Files"Are"Stored:"Example"

Foo.txt: blk_001, blk_002, blk_003Bar.txt: blk_004, blk_005

NameNode

DataNodes

blk_003 blk_004

blk_001 blk_003

blk_004

blk_001 blk_005

blk_002 blk_004

blk_002 blk_003

blk_005

blk_001 blk_002

blk_005

Page 50: Cloudera_Developer_Training

03#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%NameNode%daemon%must%be%running%at%all%/mes%– If"the"NameNode"stops,"the"cluster"becomes"inaccessible"– Your"system"administrator"will"take"care"to"ensure"that"the"NameNode"hardware"is"reliable!"

! The%NameNode%holds%all%of%its%metadata%in%RAM%for%fast%access%– It"keeps"a"record"of"changes"on"disk"for"crash"recovery"

! A%separate%daemon%known%as%the%Secondary&NameNode&takes%care%of%some%housekeeping%tasks%for%the%NameNode%– Be"careful:"The"Secondary"NameNode"is"not%a"backup"NameNode!"

More"On"The"HDFS"NameNode"

Page 51: Cloudera_Developer_Training

03#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! CDH4%introduced%High%Availability%for%the%NameNode%

! Instead%of%a%single%NameNode,%there%are%now%two%– An"AcDve"NameNode"– A"Standby"NameNode"

! If%the%Ac/ve%NameNode%fails,%the%Standby%NameNode%can%automa/cally%take%over%

! The%Standby%NameNode%does%the%work%performed%by%the%Secondary%NameNode%in%‘classic’%HDFS%– HA"HDFS"does"not"run"a"Secondary"NameNode"daemon"

! Your%system%administrator%will%choose%whether%to%set%the%cluster%up%with%NameNode%High%Availability%or%not%

NameNode"High"Availability"in"CDH4"

Page 52: Cloudera_Developer_Training

03#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Although%files%are%split%into%64MB%or%128MB%blocks,%if%a%file%is%smaller%than%this%the%full%64MB/128MB%will%not%be%used%

! Blocks%are%stored%as%standard%files%on%the%DataNodes,%in%a%set%of%directories%specified%in%Hadoop’s%configura/on%files%– This"will"be"set"by"the"system"administrator"

! Without%the%metadata%on%the%NameNode,%there%is%no%way%to%access%the%files%in%the%HDFS%cluster%

! When%a%client%applica/on%wants%to%read%a%file:%– It"communicates"with"the"NameNode"to"determine"which"blocks"make"up"the"file,"and"which"DataNodes"those"blocks"reside"on"– It"then"communicates"directly"with"the"DataNodes"to"read"the"data"– The"NameNode"will"not"be"a"bo=leneck"

HDFS:"Points"To"Note"

Page 53: Cloudera_Developer_Training

03#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Applica/ons%can%read%and%write%HDFS%files%directly%via%the%Java%API%– Covered"later"in"the"course"

! Typically,%files%are%created%on%a%local%filesystem%and%must%be%moved%into%HDFS%

! Likewise,%files%stored%in%HDFS%may%need%to%be%moved%to%a%machine’s%local%filesystem%

! Access%to%HDFS%from%the%command%line%is%achieved%with%the%hadoop fs%command%

Accessing"HDFS"

Page 54: Cloudera_Developer_Training

03#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS%

– This"will"copy"the"file"to"/user/username/foo.txt ! Get%a%directory%lis/ng%of%the%user’s%home%directory%in%HDFS%

! Get%a%directory%lis/ng%of%the%HDFS%root%directory%

hadoop fs"Examples"

hadoop fs -put foo.txt foo.txt

hadoop fs -ls

hadoop fs –ls /

Page 55: Cloudera_Developer_Training

03#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%%

! Move%that%file%to%the%local%disk,%named%as%baz.txt

! Create%a%directory%called%input%under%the%user’s%home%directory%

hadoop fs"Examples"(cont’d)"

hadoop fs –cat /user/fred/bar.txt

hadoop fs –get /user/fred/bar.txt baz.txt

hadoop fs –mkdir input

Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""

Page 56: Cloudera_Developer_Training

03#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Delete%the%directory%input_old%and%all%its%contents%

hadoop fs"Examples"(cont’d)"

hadoop fs –rm -r input_old

Page 57: Cloudera_Developer_Training

03#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands#On%Exercise:%Using%HDFS%!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 58: Cloudera_Developer_Training

03#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! During%this%course,%you%will%perform%numerous%hands#on%exercises%using%the%Cloudera%Training%Virtual%Machine%(VM)%

! The%VM%has%Hadoop%installed%in%pseudo5distributed&mode%– This"essenDally"means"that"it"is"a"cluster"comprised"of"a"single"node"– Using"a"pseudo/distributed"cluster"is"the"typical"way"to"test"your"code"before"you"run"it"on"your"full"cluster"– It"operates"almost"exactly"like"a"‘real’"cluster"

– A"key"difference"is"that"the"data"replicaDon"factor"is"set"to"1,"not"3"

Aside:"The"Training"Virtual"Machine"

Page 59: Cloudera_Developer_Training

03#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise%you%will%gain%familiarity%with%manipula/ng%files%in%HDFS%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"Using"HDFS"

Page 60: Cloudera_Developer_Training

03#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How%MapReduce%works%

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 61: Cloudera_Developer_Training

03#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce%is%a%method%for%distribu/ng%a%task%across%mul/ple%nodes%

! Each%node%processes%data%stored%on%that%node%%– Where"possible"

! Consists%of%two%phases:%– Map"– Reduce"

What"Is"MapReduce?"

Page 62: Cloudera_Developer_Training

03#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Automa/c%paralleliza/on%and%distribu/on%

! Fault#tolerance%

! Status%and%monitoring%tools%

! A%clean%abstrac/on%for%programmers%– MapReduce"programs"are"usually"wri=en"in"Java"

– Can"be"wri=en"in"any"language"using"Hadoop&Streaming"(see"later)"– All"of"Hadoop"is"wri=en"in"Java"

! MapReduce%abstracts%all%the%‘housekeeping’%away%from%the%developer%– Developer"can"concentrate"simply"on"wriDng"the"Map"and"Reduce"funcDons"

Features"of"MapReduce"

Page 63: Cloudera_Developer_Training

03#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MapReduce:"The"Big"Picture"

Page 64: Cloudera_Developer_Training

03#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce%jobs%are%controlled%by%a%soPware%daemon%known%as%the%JobTracker&

! The%JobTracker%resides%on%a%‘master%node’%– Clients"submit"MapReduce"jobs"to"the"JobTracker"– The"JobTracker"assigns"Map"and"Reduce"tasks"to"other"nodes"on"the"cluster"– These"nodes"each"run"a"soaware"daemon"known"as"the"TaskTracker"– The"TaskTracker"is"responsible"for"actually"instanDaDng"the"Map"or"Reduce"task,"and"reporDng"progress"back"to"the"JobTracker"

MapReduce:"The"JobTracker"

Page 65: Cloudera_Developer_Training

03#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! CDH4%contains%‘standard’%MapReduce%(MR1)%

! CDH4%also%includes%MapReduce%version%2%(MR2)%– Also"known"as"YARN"(Yet"Another"Resource"NegoDator)"– A"complete"rewrite"of"the"Hadoop"MapReduce"framework"

! MR2%is%not%yet%considered%produc/on#ready%– Included"in"CDH4"as"a"‘technology"preview’"

! Exis/ng%code%will%work%with%no%modifica/on%on%MR2%clusters%when%the%technology%matures%– Code"will"need"to"be"re/compiled,"but"the"API"remains"idenDcal"

! For%produc/on%use,%we%strongly%recommend%using%MR1%

Aside:"MapReduce"Version"2"

Page 66: Cloudera_Developer_Training

03#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A%job%is%a%‘full%program’%– A"complete"execuDon"of"Mappers"and"Reducers"over"a"dataset"

! A%task%is%the%execu/on%of%a%single%Mapper%or%Reducer%over%a%slice%of%data%

! A%task&a<empt%is%a%par/cular%instance%of%an%aTempt%to%execute%a%task%– There"will"be"at"least"as"many"task"a=empts"as"there"are"tasks"– If"a"task"a=empt"fails,"another"will"be"started"by"the"JobTracker"– Specula7ve&execu7on"(see"later)"can"also"result"in"more"task"a=empts"than"completed"tasks&

MapReduce:"Terminology"

Page 67: Cloudera_Developer_Training

03#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%aTempts%to%ensure%that%Mappers%run%on%nodes%which%hold%their%por/on%of%the%data%locally,%to%avoid%network%traffic%– MulDple"Mappers"run"in"parallel,"each"processing"a"porDon"of"the"input"data"

! The%Mapper%reads%data%in%the%form%of%key/value%pairs%

! It%outputs%zero%or%more%key/value%pairs%(pseudo#code):%

MapReduce:"The"Mapper"

map(in_key, in_value) -> (inter_key, inter_value) list

Page 68: Cloudera_Developer_Training

03#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%Mapper%may%use%or%completely%ignore%the%input%key%– For"example,"a"standard"pa=ern"is"to"read"a"line"of"a"file"at"a"Dme"

– The"key"is"the"byte"offset"into"the"file"at"which"the"line"starts"– The"value"is"the"contents"of"the"line"itself"– Typically"the"key"is"considered"irrelevant"

! If%the%Mapper%writes%anything%out,%the%output%must%be%in%the%form%of%%key/value%pairs%

MapReduce:"The"Mapper"(cont’d)"

Page 69: Cloudera_Developer_Training

03#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Turn%input%into%upper%case%(pseudo#code):%

Example"Mapper:"Upper"Case"Mapper"

let map(k, v) = emit(k.toUpper(), v.toUpper())

('foo', 'bar') -> ('FOO', 'BAR') ('foo', 'other') -> ('FOO', 'OTHER') ('baz', 'more data') -> ('BAZ', 'MORE DATA')

Page 70: Cloudera_Developer_Training

03#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Output%each%input%character%separately%(pseudo#code):%

Example"Mapper:"Explode"Mapper"

let map(k, v) = foreach char c in v: emit (k, c)

('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'), ('foo', 'r')

('baz', 'other') -> ('baz', 'o'), ('baz', 't'), ('baz', 'h'), ('baz', 'e'), ('baz', 'r')

Page 71: Cloudera_Developer_Training

03#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Only%output%key/value%pairs%where%the%input%value%is%a%prime%number%(pseudo#code):%

Example"Mapper:"Filter"Mapper"

let map(k, v) = if (isPrime(v)) then emit(k, v)

('foo', 7) -> ('foo', 7) ('baz', 10) -> nothing

Page 72: Cloudera_Developer_Training

03#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%key%output%by%the%Mapper%does%not%need%to%be%iden/cal%to%the%input%key%

! Output%the%word%length%as%the%key%(pseudo#code):%

Example"Mapper:"Changing"Keyspaces"

let map(k, v) = emit(v.length(), v)

('foo', 'bar') -> (3, 'bar') ('baz', 'other') -> (5, 'other') ('foo', 'abracadabra') -> (11, 'abracadabra')

Page 73: Cloudera_Developer_Training

03#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! APer%the%Map%phase%is%over,%all%the%intermediate%values%for%a%given%intermediate%key%are%combined%together%into%a%list%

! This%list%is%given%to%a%Reducer%– There"may"be"a"single"Reducer,"or"mulDple"Reducers"

– This"is"specified"as"part"of"the"job"configuraDon"(see"later)"– All"values"associated"with"a"parDcular"intermediate"key"are"guaranteed"to"go"to"the"same"Reducer"– The"intermediate"keys,"and"their"value"lists,"are"passed"to"the"Reducer"in"sorted"key"order"– This"step"is"known"as"the"‘shuffle"and"sort’"

! The%Reducer%outputs%zero%or%more%final%key/value%pairs%– These"are"wri=en"to"HDFS"– In"pracDce,"the"Reducer"usually"emits"a"single"key/value"pair"for"each"input"key"

MapReduce:"The"Reducer"

Page 74: Cloudera_Developer_Training

03#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Add%up%all%the%values%associated%with%each%intermediate%key%(pseudo#code):%

Example"Reducer:"Sum"Reducer"

let reduce(k, vals) = sum = 0 foreach int i in vals: sum += i emit(k, sum)

(’bar', [9, 3, -17, 44]) -> (’bar', 39) (’foo', [123, 100, 77]) -> (’foo', 300)

Page 75: Cloudera_Developer_Training

03#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%Iden/ty%Reducer%is%very%common%(pseudo#code):%

Example"Reducer:"IdenDty"Reducer"

let reduce(k, vals) = foreach v in vals: emit(k, v)

('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100), ('bar', 77)

('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3), ('foo', -17), ('foo', 44)

Page 76: Cloudera_Developer_Training

03#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Count%the%number%of%occurrences%of%each%word%in%a%large%amount%of%input%data%– This"is"the"‘hello"world’"of"MapReduce"programming"

MapReduce"Example:"Word"Count"

map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)

reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)

Page 77: Cloudera_Developer_Training

03#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Input%to%the%Mapper:%

! Output%from%the%Mapper:%

MapReduce"Example:"Word"Count"(cont’d)"

(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')

('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

Page 78: Cloudera_Developer_Training

03#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Intermediate%data%sent%to%the%Reducer:%

! Final%Reducer%output:%

MapReduce"Example:"Word"Count"(cont’d)"

('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])

('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)

Page 79: Cloudera_Developer_Training

03#43%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Whenever%possible,%Hadoop%will%aTempt%to%ensure%that%a%Map%task%on%a%node%is%working%on%a%block%of%data%stored%locally%on%that%node%via%HDFS%

! If%this%is%not%possible,%the%Map%task%will%have%to%transfer%the%data%across%the%network%as%it%processes%that%data%

! Once%the%Map%tasks%have%finished,%data%is%then%transferred%across%the%network%to%the%Reducers%– Although"the"Reducers"may"run"on"the"same"physical"machines"as"the"Map"tasks,"there"is"no"concept"of"data"locality"for"the"Reducers"– All"Mappers"will,"in"general,"have"to"communicate"with"all"Reducers"

MapReduce:"Data"Locality"

Page 80: Cloudera_Developer_Training

03#44%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! It%appears%that%the%shuffle%and%sort%phase%is%a%boTleneck%– The"reduce"method"in"the"Reducers"cannot"start"unDl"all"Mappers"have"finished"

! In%prac/ce,%Hadoop%will%start%to%transfer%data%from%Mappers%to%Reducers%as%the%Mappers%finish%work%– This"miDgates"against"a"huge"amount"of"data"transfer"starDng"as"soon"as"the"last"Mapper"finishes"– Note"that"this"behavior"is"configurable"

– The"developer"can"specify"the"percentage"of"Mappers"which"should"finish"before"Reducers"start"retrieving"data"

– The"developer’s"reduce"method"sDll"does"not"start"unDl"all"intermediate"data"has"been"transferred"and"sorted"

MapReduce:"Is"Shuffle"and"Sort"a"Bo=leneck?"

Page 81: Cloudera_Developer_Training

03#45%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! It%is%possible%for%one%Map%task%to%run%more%slowly%than%the%others%– Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"

! It%would%appear%that%this%would%create%a%boTleneck%– The"reduce"method"in"the"Reducer"cannot"start"unDl"every"Mapper"has"finished"

! Hadoop%uses%specula=ve&execu=on%to%mi/gate%against%this%– If"a"Mapper"appears"to"be"running"significantly"more"slowly"than"the"others,"a"new"instance"of"the"Mapper"will"be"started"on"another"machine,"operaDng"on"the"same"data"– The"results"of"the"first"Mapper"to"finish"will"be"used"– Hadoop"will"kill"off"the"Mapper"which"is"sDll"running"

MapReduce:"Is"a"Slow"Mapper"a"Bo=leneck?"

Page 82: Cloudera_Developer_Training

03#46%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Write%the%Mapper%and%Reducer%classes%

! Write%a%Driver%class%that%configures%the%job%and%submits%it%to%the%cluster%– Driver"classes"are"covered"in"the"next"chapter"

! Compile%the%Mapper,%Reducer,%and%Driver%classes%– Example:""javac -classpath `hadoop classpath` *.java

! Create%a%jar%file%with%the%Mapper,%Reducer,%and%Driver%classes%– Example:"jar cvf foo.jar *.class

! Run%the%hadoop jar%command%to%submit%the%job%to%the%Hadoop%cluster%– Example:"hadoop jar foo.jar Foo in_dir out_dir

CreaDng"and"Running"a"MapReduce"Job"

Page 83: Cloudera_Developer_Training

03#47%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands#On%Exercise:%Running%a%MapReduce%Job%

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 84: Cloudera_Developer_Training

03#48%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%run%a%MapReduce%job%on%your%pseudo#distributed%Hadoop%cluster%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"Running"A"MapReduce"Job"

Page 85: Cloudera_Developer_Training

03#49%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How%a%Hadoop%cluster%operates%! Other"Hadoop"ecosystem"components"

!  Conclusion"

Page 86: Cloudera_Developer_Training

03#50%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cluster%installa/on%is%usually%performed%by%the%system%administrator,%and%is%outside%the%scope%of%this%course%– Cloudera"offers"a"training"course"for"System"Administrators"specifically"aimed"at"those"responsible"for"commissioning"and"maintaining"Hadoop"clusters"

! However,%it’s%very%useful%to%understand%how%the%component%parts%of%the%Hadoop%cluster%work%together%

! Typically,%a%developer%will%configure%their%machine%to%run%in%pseudo5distributed&mode%– This"effecDvely"creates"a"single/machine"cluster"– All"five"Hadoop"daemons"are"running"on"the"same"machine"– Very"useful"for"tesDng"code"before"it"is"deployed"to"the"real"cluster"

Installing"A"Hadoop"Cluster"

Page 87: Cloudera_Developer_Training

03#51%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Easiest%way%to%download%and%install%Hadoop,%either%for%a%full%cluster%or%in%pseudo#distributed%mode,%is%by%using%Cloudera’s%Distribu/on,%including%Apache%Hadoop%(CDH)%– Vanilla"Hadoop"plus"many"patches,"backports,"bugfixes"– Supplied"as"a"Debian"package"(for"Linux"distribuDons"such"as"Ubuntu),"an"RPM"(for"CentOS/RedHat"Enterprise"Linux),"and"as"a"tarball"– Full"documentaDon"available"at"http://cloudera.com/

Installing"A"Hadoop"Cluster"(cont’d)"

Page 88: Cloudera_Developer_Training

03#52%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%is%comprised%of%five%separate%daemons%

! NameNode%– Holds"the"metadata"for"HDFS"

! Secondary%NameNode%– Performs"housekeeping"funcDons"for"the"NameNode"– Is"not"a"backup"or"hot"standby"for"the"NameNode!"

! DataNode%– Stores"actual"HDFS"data"blocks"

! JobTracker%– Manages"MapReduce"jobs,"distributes"individual"tasks"to"machines"running"the…"

! TaskTracker%– InstanDates"and"monitors"individual"Map"and"Reduce"tasks"

The"Five"Hadoop"Daemons"

Page 89: Cloudera_Developer_Training

03#53%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Each%daemon%runs%in%its%own%Java%Virtual%Machine%(JVM)%

! No%node%on%a%real%cluster%will%run%all%five%daemons%– Although"this"is"technically"possible"

! We%can%consider%nodes%to%be%in%two%different%categories:%– Master"Nodes"

– Run"the"NameNode,"Secondary"NameNode,"JobTracker"daemons"– Only"one"of"each"of"these"daemons"runs"on"the"cluster"

– Slave"Nodes"– Run"the"DataNode"and"TaskTracker"daemons"

• A"slave"node"will"run"both"of"these"daemons"

The"Five"Hadoop"Daemons"(cont’d)"

Page 90: Cloudera_Developer_Training

03#54%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! On%very%small%clusters,%the%NameNode,%JobTracker%and%Secondary%NameNode%daemons%can%all%reside%on%a%single%machine%– It"is"typical"to"put"them"on"separate"machines"as"the"cluster"grows"beyond"20/30"nodes"

! Each%daemon%runs%in%a%separate%Java%Virtual%Machine%(JVM)%

Basic"Cluster"ConfiguraDon"

Page 91: Cloudera_Developer_Training

03#55%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! When%a%client%submits%a%job,%its%configura/on%informa/on%is%packaged%into%an%XML%file%

! This%file,%along%with%the%.jar%file%containing%the%actual%program%code,%is%handed%to%the%JobTracker%– The"JobTracker"then"parcels"out"individual"tasks"to"TaskTracker"nodes"– When"a"TaskTracker"receives"a"request"to"run"a"task,"it"instanDates"a"separate"JVM"for"that"task"– TaskTracker"nodes"can"be"configured"to"run"mulDple"tasks"at"the"same"Dme"– If"the"node"has"enough"processing"power"and"memory"

Submirng"A"Job"

Page 92: Cloudera_Developer_Training

03#56%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%intermediate%data%is%held%on%the%TaskTracker’s%local%disk%

! As%Reducers%start%up,%the%intermediate%data%is%distributed%across%the%network%to%the%Reducers%

! Reducers%write%their%final%output%to%HDFS%

! Once%the%job%has%completed,%the%TaskTracker%can%delete%the%intermediate%data%from%its%local%disk%– Note"that"the"intermediate"data"is"not"deleted"unDl"the"enDre"job"completes"

Submirng"A"Job"(cont’d)"

Page 93: Cloudera_Developer_Training

03#57%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other%Hadoop%ecosystem%components%

!  Conclusion"

Page 94: Cloudera_Developer_Training

03#58%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%term%‘Hadoop%core’%refers%to%HDFS%and%MapReduce%

! Many%other%projects%exist%which%use%Hadoop%core%– Either"both"HDFS"and"MapReduce,"or"just"HDFS"

! Most%are%Apache%projects%or%Apache%Incubator%projects%– Some"others"are"not"hosted"by"the"Apache"Soaware"FoundaDon"

– These"are"oaen"hosted"on"GitHub"or"a"similar"repository"

! We%will%inves/gate%many%of%these%projects%later%in%the%course%

! Following%is%an%introduc/on%to%some%of%the%most%significant%projects%

Other"Ecosystem"Projects:"IntroducDon"

Page 95: Cloudera_Developer_Training

03#59%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hive%is%an%abstrac/on%on%top%of%MapReduce%

! Allows%users%to%query%data%in%the%Hadoop%cluster%without%knowing%Java%or%MapReduce%

! Uses%the%HiveQL%language%– Very"similar"to"SQL"

! The%Hive%Interpreter%runs%on%a%client%machine%– Turns"HiveQL"queries"into"MapReduce"jobs"– Submits"those"jobs"to"the"cluster"

! Note:%this%does%not%turn%the%cluster%into%a%rela/onal%database%server!%– It"is"sDll"simply"running"MapReduce"jobs"– Those"jobs"are"created"by"the"Hive"Interpreter"

Hive"

Page 96: Cloudera_Developer_Training

03#60%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Sample%Hive%query:%

%

! We%will%inves/gate%Hive%in%greater%detail%later%in%the%course%

Hive"(cont’d)"

SELECT stock.product, SUM(orders.purchases) FROM stock JOIN orders ON (stock.id = orders.stock_id) WHERE orders.quarter = 'Q1' GROUP BY stock.product;

Page 97: Cloudera_Developer_Training

03#61%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Pig%is%an%alterna/ve%abstrac/on%on%top%of%MapReduce%

! Uses%a%dataflow%scrip/ng%language%– Called"PigLaDn"

! The%Pig%interpreter%runs%on%the%client%machine%– Takes"the"PigLaDn"script"and"turns"it"into"a"series"of"MapReduce"jobs"– Submits"those"jobs"to"the"cluster"

! As%with%Hive,%nothing%‘magical’%happens%on%the%cluster%– It"is"sDll"simply"running"MapReduce"jobs"

Pig"

Page 98: Cloudera_Developer_Training

03#62%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Sample%Pig%script:%

%

! We%will%inves/gate%Pig%in%more%detail%later%in%the%course%

Pig"(cont’d)"

stock = LOAD '/user/fred/stock' AS (id, item); orders= LOAD '/user/fred/orders' AS (id, cost); grpd = GROUP orders BY id; totals = FOREACH grpd GENERATE group, SUM(orders.cost) AS t; result = JOIN stock BY id, totals BY group; DUMP result;

Page 99: Cloudera_Developer_Training

03#63%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Impala%is%an%open#source%project%created%by%Cloudera%

! Facilitates%real#/me%queries%of%data%in%HDFS%

! Does%not%use%MapReduce%– Uses"its"own"daemon,"running"on"each"slave"node"– Queries"data"stored"in"HDFS"

! Uses%a%language%very%similar%to%HiveQL%– But"produces"results"much,"much"faster"

– Typically"between"five"and"40"Dmes"faster"than"Hive"

! Currently%in%beta%– Although"being"used"in"producDon"by"mulDple"organizaDons"

Impala"

Page 100: Cloudera_Developer_Training

03#64%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Flume%provides%a%method%to%import%data%into%HDFS%as%it%is%generated%– Instead"of"batch/processing"the"data"later"– For"example,"log"files"from"a"Web"server"

! Sqoop%provides%a%method%to%import%data%from%tables%in%a%rela/onal%database%into%HDFS%– Does"this"very"efficiently"via"a"Map/only"MapReduce"job"– Can"also"‘go"the"other"way’"

– Populate"database"tables"from"files"in"HDFS"

! We%will%inves/gate%Flume%and%Sqoop%later%in%the%course%

Flume"and"Sqoop"

Page 101: Cloudera_Developer_Training

03#65%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Oozie%allows%developers%to%create%a%workflow%of%MapReduce%jobs%– Including"dependencies"between"jobs"

! The%Oozie%server%submits%the%jobs%to%the%server%in%the%correct%sequence%

! We%will%inves/gate%Oozie%later%in%the%course%

Oozie"

Page 102: Cloudera_Developer_Training

03#66%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! HBase%is%‘the%Hadoop%database’%

! A%‘NoSQL’%datastore%

! Can%store%massive%amounts%of%data%– Gigabytes,"terabytes,"and"even"petabytes"of"data"in"a"table"

! Scales%to%provide%very%high%write%throughput%– Hundreds"of"thousands"of"inserts"per"second"

! Copes%well%with%sparse%data%– Tables"can"have"many"thousands"of"columns"

– Even"if"most"columns"are"empty"for"any"given"row"

! Has%a%very%constrained%access%model%– Insert"a"row,"retrieve"a"row,"do"a"full"or"parDal"table"scan"– Only"one"column"(the"‘row"key’)"is"indexed"

HBase"

Page 103: Cloudera_Developer_Training

03#67%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

HBase"vs"TradiDonal"RDBMSs"

RDBMS% HBase%

Data%layout% Row/oriented" Column/oriented"

Transac/ons% Yes" Single"row"only"

Query%language% SQL" get/put/scan"

Security% AuthenDcaDon/AuthorizaDon" Kerberos"

Indexes% On"arbitrary"columns" Row/key"only"

Max%data%size% TBs" PB+"

Read/write%throughput%limits%

1000s"queries/second" Millions"of"queries/second"

Page 104: Cloudera_Developer_Training

03#68%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Introduc/on%to%Apache%Hadoop%and%its%Ecosystem%

Hadoop:%Basic%Concepts%

!  The"Hadoop"project"and"Hadoop"components"

!  The"Hadoop"Distributed"File"System"(HDFS)"

!  Hands/On"Exercise:"Using"HDFS"!  How"MapReduce"works"

!  Hands/On"Exercise:"Running"a"MapReduce"Job"

!  How"a"Hadoop"cluster"operates"! Other"Hadoop"ecosystem"components"

!  Conclusion%

Page 105: Cloudera_Developer_Training

03#69%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! What%Hadoop%is%

! What%features%the%Hadoop%Distributed%File%System%(HDFS)%provides%

! The%concepts%behind%MapReduce%

! How%a%Hadoop%cluster%operates%

! What%other%Hadoop%Ecosystem%projects%exist%

Conclusion"

Page 106: Cloudera_Developer_Training

04#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Wri@ng"a"MapReduce"Program"Chapter"4"

Page 107: Cloudera_Developer_Training

04#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  Introduc@on"

! Wri*ng%a%MapReduce%Program%!  Unit"Tes@ng"MapReduce"Programs"!  Delving"Deeper"into"the"Hadoop"API"!  Prac@cal"Development"Tips"and"Techniques"!  Data"Input"and"Output"!  Common"MapReduce"Algorithms"!  Joining"Data"Sets"in"MapReduce"Jobs"

!  Conclusion"!  Appendix:"Cloudera"Enterprise"!  Appendix:"Graph"Manipula@on"in"MapReduce"

!  Integra@ng"Hadoop"into"the"Enterprise"Workflow"! Machine"Learning"and"Mahout"!  An"Introduc@on"to"Hive"and"Pig"!  An"Introduc@on"to"Oozie"

Introduc@on"to"Apache"Hadoop""and"its"Ecosystem"

Basic%Programming%with%the%Hadoop%Core%API%

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"Introduc@on"

The"Hadoop"Ecosystem"

!  The"Mo@va@on"for"Hadoop"!  Hadoop:"Basic"Concepts"

Page 108: Cloudera_Developer_Training

04#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! The%MapReduce%flow%

! Basic%MapReduce%API%concepts%

! How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%

! How%to%write%Mappers%and%Reducers%in%other%languages%using%the%Streaming%API%

! How%to%speed%up%your%Hadoop%development%by%using%Eclipse%

! The%differences%between%the%old%and%new%MapReduce%APIs%

Wri@ng"a"MapReduce"Program"

Page 109: Cloudera_Developer_Training

04#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The%MapReduce%flow%!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 110: Cloudera_Developer_Training

04#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%the%previous%chapter,%you%ran%a%sample%MapReduce%program%– WordCount,"which"counted"the"number"of"occurrences"of"each"unique"word"in"a"set"of"files"

! In%this%chapter,%we%will%examine%the%code%for%WordCount%– This"will"demonstrate"the"Hadoop"API"

! We%will%also%inves*gate%Hadoop%Streaming%– Allows"you"to"write"MapReduce"programs"in"(virtually)"any"language"

A"Sample"MapReduce"Program:"Introduc@on"

Page 111: Cloudera_Developer_Training

04#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! On%the%following%slides%we%show%the%MapReduce%flow%

! Each%of%the%por*ons%(RecordReader,%Mapper,%Par**oner,%Reducer,%etc.)%can%be%created%by%the%developer%

! We%will%cover%each%of%these%as%we%move%through%the%course%

! You%will%always%create%at%least%a%Mapper,%Reducer,%and%driver%code%– Those"are"the"por@ons"we"will"inves@gate"in"this"chapter"

The"MapReduce"Flow:"Introduc@on"

Page 112: Cloudera_Developer_Training

04#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"MapReduce"Flow:"The"Mapper"

Page 113: Cloudera_Developer_Training

04#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"MapReduce"Flow:"Shuffle"and"Sort"

Page 114: Cloudera_Developer_Training

04#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"MapReduce"Flow:"Reducers"to"Outputs"

Page 115: Cloudera_Developer_Training

04#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic%MapReduce%API%concepts%! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 116: Cloudera_Developer_Training

04#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%inves*gate%the%API,%we%will%dissect%the%WordCount%program%you%ran%in%the%previous%chapter%

! This%consists%of%three%por*ons%– The"driver"code"

– Code"that"runs"on"the"client"to"configure"and"submit"the"job"– The"Mapper"– The"Reducer"

! Before%we%look%at%the%code,%we%need%to%cover%some%basic%Hadoop%API%concepts%

Our"MapReduce"Program:"WordCount"

Page 117: Cloudera_Developer_Training

04#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%data%passed%to%the%Mapper%is%specified%by%an%InputFormat+– Specified"in"the"driver"code"– Defines"the"loca@on"of"the"input"data"

– A"file"or"directory,"for"example"– Determines"how"to"split"the"input"data"into"input&splits"

– Each"Mapper"deals"with"a"single"input"split""– InputFormat"is"a"factory"for"RecordReader"objects"to"extract""(key,"value)"records"from"the"input"source"

Gebng"Data"to"the"Mapper"

Page 118: Cloudera_Developer_Training

04#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Gebng"Data"to"the"Mapper"(cont’d)"

Page 119: Cloudera_Developer_Training

04#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! FileInputFormat%– The"base"class"used"for"all"file/based"InputFormats"

! TextInputFormat – The"default"– Treats"each"\n/terminated"line"of"a"file"as"a"value"– Key"is"the"byte"offset"within"the"file"of"that"line"

! KeyValueTextInputFormat – Maps"\n/terminated"lines"as"‘key"SEP"value’"

– By"default,"separator"is"a"tab"! SequenceFileInputFormat

– Binary"file"of"(key,"value)"pairs"with"some"addi@onal"metadata"

! SequenceFileAsTextInputFormat – Similar,"but"maps"(key.toString(),"value.toString())"

Some"Standard"InputFormats"

Page 120: Cloudera_Developer_Training

04#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Keys%and%values%in%Hadoop%are%Objects%

! Values%are%objects%which%implement%Writable

! Keys%are%objects%which%implement%WritableComparable

Keys"and"Values"are"Objects"

Page 121: Cloudera_Developer_Training

04#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%defines%its%own%‘box%classes’%for%strings,%integers%and%so%on%– IntWritable"for"ints"– LongWritable"for"longs"– FloatWritable"for"floats"– DoubleWritable"for"doubles"– Text"for"strings"– Etc.""

! The%Writable%interface%makes%serializa*on%quick%and%easy%for%Hadoop%%

! Any%value’s%type%must%implement%the%Writable%interface%

What"is"Writable?"

Page 122: Cloudera_Developer_Training

04#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A%WritableComparable%is%a%Writable%which%is%also%Comparable – Two"WritableComparables"can"be"compared"against"each"other"to"determine"their"‘order’"– Keys"must"be"WritableComparables"because"they"are"passed"to"the"Reducer"in"sorted"order"– We"will"talk"more"about"WritableComparables"later"

! Note%that%despite%their%names,%all%Hadoop%box%classes%implement%both%Writable%and%WritableComparable – For"example,"IntWritable"is"actually"a"WritableComparable

What"is"WritableComparable?"

Page 123: Cloudera_Developer_Training

04#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri*ng%MapReduce%applica*ons%in%Java%–  The%driver%–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 124: Cloudera_Developer_Training

04#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%driver%code%runs%on%the%client%machine%

! It%configures%the%job,%then%submits%it%to%the%cluster%

The"Driver"Code:"Introduc@on"

Page 125: Cloudera_Developer_Training

04#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Driver:"Complete"Code"

import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);

Page 126: Cloudera_Developer_Training

04#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Driver:"Complete"Code"(cont’d)"

job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Page 127: Cloudera_Developer_Training

04#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Driver:"Import"Statements"

import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class);

You"will"typically"import"these"classes"into"every"

MapReduce"job"you"write."We"will"omit"the"import statements"in"future"slides"for"brevity.""

Page 128: Cloudera_Developer_Training

04#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Driver:"Main"Code"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Page 129: Cloudera_Developer_Training

04#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Driver"Class:"main"Method"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

The"main"method"accepts"two"command/line"arguments:"the"input"

and"output"directories."

Page 130: Cloudera_Developer_Training

04#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Sanity"Checking"The"Job’s"Invoca@on"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

The"first"step"is"to"ensure"that"we"have"been"given"two"command/

line"arguments."If"not,"print"a"help"message"and"exit."

Page 131: Cloudera_Developer_Training

04#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Configuring"The"Job"With"the"Job"Object

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

To"configure"the"job,"create"a"new"Job"object"and"specify"the"class"which"will"be"called"to"run"the"job."

Page 132: Cloudera_Developer_Training

04#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%Job%class%allows%you%to%set%configura*on%op*ons%for%your%MapReduce%job%– The"classes"to"be"used"for"your"Mapper"and"Reducer"– The"input"and"output"directories"– Many"other"op@ons"

! Any%op*ons%not%explicitly%set%in%your%driver%code%will%be%read%from%your%Hadoop%configura*on%files%– Usually"located"in"/etc/hadoop/conf

! Any%op*ons%not%specified%in%your%configura*on%files%will%receive%Hadoop’s%default%values%

! You%can%also%use%the%Job%object%to%submit%the%job,%control%its%execu*on,%and%query%its%state%%

Crea@ng"a"New"Job"Object"

Page 133: Cloudera_Developer_Training

04#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Naming"The"Job"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Give"the"job"a"meaningful"name."

Page 134: Cloudera_Developer_Training

04#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Specifying"Input"and"Output"Directories"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Next,"specify"the"input"directory"from"which"data"will"be"read,"and"

the"output"directory"to"which"final"output"will"be"wri=en.""

Page 135: Cloudera_Developer_Training

04#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%default%InputFormat%(TextInputFormat)%will%be%used%unless%you%specify%otherwise%

! To%use%an%InputFormat%other%than%the%default,%use%e.g.%job.setInputFormatClass(KeyValueTextInputFormat.class)

Specifying"the"InputFormat"

Page 136: Cloudera_Developer_Training

04#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! By%default,%FileInputFormat.setInputPaths()%will%read%all%files%from%a%specified%directory%and%send%them%to%Mappers%– Excep@ons:"items"whose"names"begin"with"a"period"(.)"or"underscore"(_)"– Globs"can"be"specified"to"restrict"input"

– For"example,"/2010/*/01/*

! Alterna*vely,%FileInputFormat.addInputPath()%can%be%called%mul*ple%*mes,%specifying%a%single%file%or%directory%each%*me%

! More%advanced%filtering%can%be%performed%by%implemen*ng%a%PathFilter%– Interface"with"a"method"named"accept

– Takes"a"path"to"a"file,"returns"true"or"false"depending"on"whether"or"not"the"file"should"be"processed"

Determining"Which"Files"To"Read"

Page 137: Cloudera_Developer_Training

04#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! FileOutputFormat.setOutputPath()%specifies%the%directory%to%which%the%Reducers%will%write%their%final%output%

! The%driver%can%also%specify%the%format%of%the%output%data%– Default"is"a"plain"text"file"– Could"be"explicitly"wri=en"as"job.setOutputFormatClass(TextOutputFormat.class)

! We%will%discuss%OutputFormats%in%more%depth%in%a%later%chapter%

Specifying"Final"Output"With"OutputFormat"

Page 138: Cloudera_Developer_Training

04#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Specify"The"Classes"for"Mapper"and"Reducer"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Give"the"Job"object"informa@on"about"which"classes"are"to"be"

instan@ated"as"the"Mapper"and"Reducer."

Page 139: Cloudera_Developer_Training

04#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Specify"The"Intermediate"Data"Types"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Specify"the"types"for"the"intermediate"output"key"and"value"

produced"by"the"Mapper."

Page 140: Cloudera_Developer_Training

04#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Specify"The"Final"Output"Data"Types"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Specify"the"types"for"the"Reducer’s"output"key"and"value."

Page 141: Cloudera_Developer_Training

04#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Running"The"Job"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Start"the"job,"wait"for"it"to"complete,"and"exit"with"a"return"code."

Page 142: Cloudera_Developer_Training

04#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! There%are%two%ways%to%run%your%MapReduce%job:%– job.waitForCompletion()

– Blocks"(waits"for"the"job"to"complete"before"con@nuing)"– job.submit()

– Does"not"block"(driver"code"con@nues"as"the"job"is"running)"! The%job%determines%the%proper%division%of%input%data%into%InputSplits,%and%then%sends%the%job%informa*on%to%the%JobTracker%daemon%on%the%cluster%

Running"The"Job"(cont’d)"

Page 143: Cloudera_Developer_Training

04#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reprise:"Driver"Code"

public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.printf("Usage: WordCount <input dir> <output dir>\n"); System.exit(-1); } Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

Page 144: Cloudera_Developer_Training

04#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri*ng%MapReduce%applica*ons%in%Java%–  The"driver"–  The%Mapper%–  The"Reducer"

! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 145: Cloudera_Developer_Training

04#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Mapper:"Complete"Code"

import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

Page 146: Cloudera_Developer_Training

04#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Mapper:"import"Statements"

import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

You"will"typically"import"java.io.IOException,"and"the"org.apache.hadoop"classes"shown,"in"every"Mapper"you"

write."We"will"omit"the"import"statements"in"future"slides"for"

brevity.""

Page 147: Cloudera_Developer_Training

04#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Mapper:"Main"Code"

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

Page 148: Cloudera_Developer_Training

04#43%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

The"Mapper:"Main"Code"(cont’d)"

Your"Mapper"class"should"extend"the"Mapper"class."The"Mapper class"expects"four"generics,"which"define"the"types"of"the"input"and"output"key/value"pairs."The"first"two"

parameters"define"the"input"key"and"value"types,"the"second"two"define"the"output"key"and"value"types."

Page 149: Cloudera_Developer_Training

04#44%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

The"map"Method"

The"map"method’s"signature"looks"like"this."It"will"be"passed"

a"key,"a"value,"and"a"Context"object."The"Context"is"used"to"write"the"intermediate"data."It"also"contains"

informa@on"about"the"job’s"configura@on"(see"later)."

Page 150: Cloudera_Developer_Training

04#45%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

The"map"Method:"Processing"The"Line"

value"is"a"Text"object,"so"we"retrieve"the"string"it"contains."

Page 151: Cloudera_Developer_Training

04#46%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

The"map"Method:"Processing"The"Line"(cont’d)"

We"split"the"string"up"into"words"using"a"regular"expression"

with"non/alphanumeric"characters"as"the"delimiter,"and"

then"loop"through"the"words."

Page 152: Cloudera_Developer_Training

04#47%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

Outpubng"Intermediate"Data"

To"emit"a"(key,"value)"pair,"we"call"the"write"method"of"our"Context object."The"key"will"be"the"word"itself,"the"value"will"be"the"number"1."

Recall"that"the"output"key"must"be"a"WritableComparable,"and"the"value"must"be"a"Writable.

Page 153: Cloudera_Developer_Training

04#48%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reprise:"The"Map"Method"

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

Page 154: Cloudera_Developer_Training

04#49%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri*ng%MapReduce%applica*ons%in%Java%–  The"driver"–  The"Mapper"

–  The%Reducer%! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 155: Cloudera_Developer_Training

04#50%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Reducer:"Complete"Code"

import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

Page 156: Cloudera_Developer_Training

04#51%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

The"Reducer:"Import"Statements"

As"with"the"Mapper,"you"will"typically"import"

java.io.IOException,"and"the"org.apache.hadoop classes"shown,"in"every"Reducer"you"write."We"will"omit"the"

import"statements"in"future"slides"for"brevity.""

Page 157: Cloudera_Developer_Training

04#52%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Reducer:"Main"Code"

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

Page 158: Cloudera_Developer_Training

04#53%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

The"Reducer:"Main"Code"(cont’d)"

Your"Reducer"class"should"extend"Reducer."The"Reducer class"expects"four"generics,"which"define"the"types"of"the"input"and"output"key/value"pairs."The"first"two"parameters"define"the"

intermediate"key"and"value"types,"the"second"two"define"the"final"output"key"and"value"types."The"keys"are"

WritableComparables,"the"values"are"Writables."

Page 159: Cloudera_Developer_Training

04#54%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

The"reduce"Method"

The"reduce"method"receives"a"key"and"an"Iterable"

collec@on"of"objects"(which"are"the"values"emi=ed"from"the"

Mappers"for"that"key);"it"also"receives"a"Context"object."

Page 160: Cloudera_Developer_Training

04#55%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

Processing"The"Values"

We"use"the"Java"for/each"syntax"to"step"through"all"the"elements"

in"the"collec@on."In"our"example,"we"are"merely"adding"all"the"

values"together."We"use"value.get()"to"retrieve"the"actual"numeric"value."

Page 161: Cloudera_Developer_Training

04#56%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

Wri@ng"The"Final"Output"

Finally,"we"write"the"output"key/value"pair"to"HDFS"using"

the"write"method"of"our"Context"object."

Page 162: Cloudera_Developer_Training

04#57%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reprise:"The"Reduce"Method"

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0;

for (IntWritable value : values) { wordCount += value.get(); }

context.write(key, new IntWritable(wordCount)); } }

Page 163: Cloudera_Developer_Training

04#58%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri*ng%Mappers%and%Reducers%in%other%languages%with%the%Streaming%API%!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 164: Cloudera_Developer_Training

04#59%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Many%organiza*ons%have%developers%skilled%in%languages%other%than%Java,%such%as%%– Ruby"– Python"– Perl"

! The%Streaming%API%allows%developers%to%use%any%language%they%wish%to%write%Mappers%and%Reducers%– As"long"as"the"language"can"read"from"standard"input"and"write"to"standard"output"

The"Streaming"API:"Mo@va@on"

Page 165: Cloudera_Developer_Training

04#60%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Advantages%of%the%Streaming%API:%– No"need"for"non/Java"coders"to"learn"Java"– Fast"development"@me"– Ability"to"use"exis@ng"code"libraries"

! Disadvantages%of%the%Streaming%API:%– Performance"– Primarily"suited"for"handling"data"that"can"be"represented"as"text"– Streaming"jobs"can"use"excessive"amounts"of"RAM"or"fork"excessive"numbers"of"processes"– Although"Mappers"and"Reducers"can"be"wri=en"using"the"Streaming"API,"Par@@oners,"InputFormats"etc."must"s@ll"be"wri=en"in"Java"

The"Streaming"API:"Advantages"and"Disadvantages"

Page 166: Cloudera_Developer_Training

04#61%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%implement%streaming,%write%separate%Mapper%and%Reducer%programs%in%the%language%of%your%choice%– They"will"receive"input"via"stdin"– They"should"write"their"output"to"stdout"

! If%TextInputFormat%(the%default)%is%used,%the%streaming%Mapper%just%receives%each%line%from%the%file%on%stdin%– No"key"is"passed"

! Streaming%Mapper%and%streaming%Reducer’s%output%should%be%sent%to%stdout%as%key%(tab)%value%(newline)%

! Separators%other%than%tab%can%be%specified%

How"Streaming"Works"

Page 167: Cloudera_Developer_Training

04#62%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Example%streaming%wordcount%Mapper:%

Streaming:"Example"Mapper"

#!/usr/bin/env perl while (<>) { # Read lines from stdin chomp; # Get rid of the training newline (@words) = split /\s+/; # Create an array of words foreach $w (@words) { # Loop through the array print "$w\t1\n"; # Print out the key and value } }

Page 168: Cloudera_Developer_Training

04#63%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Recall%that%in%Java,%all%the%values%associated%with%a%key%are%passed%to%the%Reducer%as%an%Iterable

! Using%Hadoop%Streaming,%the%Reducer%receives%its%input%as%(key,%value)%pairs%– One"per"line"of"standard"input"

! Your%code%will%have%to%keep%track%of%the%key%so%that%it%can%detect%when%values%from%a%new%key%start%appearing%

Streaming"Reducers:"Cau@on"

Page 169: Cloudera_Developer_Training

04#64%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%launch%a%Streaming%job,%use%e.g.,:%

%

! Many%other%command#line%op*ons%are%available%– See"the"documenta@on"for"full"details"

! Note%that%system%commands%can%be%used%as%a%Streaming%Mapper%or%Reducer%– For"example:"awk,"grep,"sed,"or"wc"

Launching"a"Streaming"Job"

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/ \ streaming/hadoop-streaming*.jar \

-input myInputDirs \ -output myOutputDir \ -mapper myMapScript.pl \ -reducer myReduceScript.pl \ -file myMapScript.pl \ -file myReduceScript.pl

Page 170: Cloudera_Developer_Training

04#65%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding%up%Hadoop%development%by%using%Eclipse%!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 171: Cloudera_Developer_Training

04#66%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! There%are%many%Integrated%Development%Environments%(IDEs)%available%

! Eclipse%is%one%such%IDE%– Open"source"– Very"popular"among"Java"developers"– Has"plug/ins"to"speed"development"in"several"different"languages"

! If%you%would%prefer%to%write%your%code%this%week%using%a%terminal#based%editor%such%as%vi,%we%certainly%won’t%stop%you!%– But"using"Eclipse"can"drama@cally"speed"up"your"development"process"

! On%the%next%few%slides%we%will%demonstrate%how%to%use%Eclipse%to%write%a%MapReduce%program%

Integrated"Development"Environments"

Page 172: Cloudera_Developer_Training

04#67%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Double#click%the%Eclipse%icon%on%the%Desktop%to%%launch%Eclipse%

! Import%pre#built%projects%for%all%Java%API%hands#on%%exercises%in%this%course%

Star@ng"Eclipse"

Page 173: Cloudera_Developer_Training

04#68%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%Package%Explorer,%expand%%the%project%you%want%to%work%%with%

! Locate%the%class%you%want%to%%edit%

! Double#click%the%class%

Loca@ng"Source"Code"

Page 174: Cloudera_Developer_Training

04#69%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Specifying"the"Java"Build"Path"

Page 175: Cloudera_Developer_Training

04#70%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Edit%the%class%in%the%right%window%pane%

Edi@ng"Source"Code"

Page 176: Cloudera_Developer_Training

04#71%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%you%have%network%access,%you%can%select%an%element%and%click%Shii%+%F2%to%access%the%element’s%full%Javadoc%in%a%browser%

! Or,%simply%hover%your%mouse%over%an%element%for%which%you%want%to%access%the%top#level%Javadoc%

Accessing"the"Javadoc"

Page 177: Cloudera_Developer_Training

04#72%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Your%virtual%machine%has%been%provisioned%with%the%Hadoop%source%code%

! Select%a%Hadoop%element%and%click%F3%to%access%the%element’s%source%code%

Accessing"the"Hadoop"Source"Code"

Page 178: Cloudera_Developer_Training

04#73%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! When%you%are%%ready%to%test%your%code,%right#click%%the%default%package%and%choose%Export %%

Crea@ng"a"Jar"File"

Page 179: Cloudera_Developer_Training

04#74%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Expand%‘Java’,%select%the%‘JAR%file’%op*on,%%and%then%click%Next%

Crea@ng"a"Jar"File"(cont’d)"

Page 180: Cloudera_Developer_Training

04#75%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Enter%a%path%and%filename%%inside%/home/training%%(your%home%directory),%and%%click%Finish%

! Your%JAR%file%will%be%saved;%%you%can%now%run%it%from%the%%command%line%with%the%%standard%hadoop jar...%%command%

Crea@ng"a"Jar"File"(cont’d)"

Page 181: Cloudera_Developer_Training

04#76%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands#On%Exercise:%Wri*ng%a%MapReduce%Program%!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion"

Page 182: Cloudera_Developer_Training

04#77%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%write%a%MapReduce%program%using%either%Java%or%Hadoop’s%Streaming%interface%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"Wri@ng"A"MapReduce"Program"

Page 183: Cloudera_Developer_Training

04#78%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences%between%the%Old%and%New%MapReduce%APIs%!  Conclusion"

Page 184: Cloudera_Developer_Training

04#79%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! When%Hadoop%0.20%was%released,%a%‘New%API’%was%introduced%– Designed"to"make"the"API"easier"to"evolve"in"the"future"– Favors"abstract"classes"over"interfaces"

! Some%developers%s*ll%use%the%Old%API%– Un@l"CDH4,"the"New"API"was"not"absolutely"feature/complete"

! All%the%code%examples%in%this%course%use%the%New%API%– Old"API/based"solu@ons"for"many"of"the"Hands/On"Exercises"for"this"course"are"available"in"the"sample_solutions_oldapi"directory"

What"Is"The"Old"API?"

Page 185: Cloudera_Developer_Training

04#80%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

New API Old API import org.apache.hadoop.mapreduce.* import org.apache.hadoop.mapred.*

Driver code: Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(Driver.class); job.setSomeProperty(...); ... job.waitForCompletion(true);

Driver code: JobConf conf = new JobConf(conf, Driver.class); conf.setSomeProperty(...); ... JobClient.runJob(conf);

Mapper: public class MyMapper extends Mapper { public void map(Keytype k, Valuetype v, Context c) { ... c.write(key, val); } }

Mapper: public class MyMapper extends MapReduceBase implements Mapper { public void map(Keytype k, Valuetype v, OutputCollector o, Reporter r) { ... o.collect(key, val); } }

New"API"vs."Old"API:"Some"Key"Differences"

Page 186: Cloudera_Developer_Training

04#81%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

New API Old API Reducer: public class MyReducer extends Reducer { public void reduce(Keytype k, Iterable<Valuetype> v, Context c) { for(Valuetype v : eachval) { // process eachval c.write(key, val); } } }

Reducer: public class MyReducer extends MapReduceBase implements Reducer { public void reduce(Keytype k, Iterator<Valuetype> v, OutputCollector o, Reporter r) { while(v.hasnext()) { // process v.next() o.collect(key, val); } } }

setup(Context c) (See later) configure(JobConf job)

cleanup(Context c) (See later) close()

New"API"vs."Old"API:"Some"Key"Differences"(cont’d)"

Page 187: Cloudera_Developer_Training

04#82%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! There%is%a%lot%of%confusion%about%the%New%and%Old%APIs,%and%MapReduce%version%1%and%MapReduce%version%2%

! The%chart%below%should%clarify%what%is%available%with%each%version%of%MapReduce%

! Summary:%Code%using%either%the%Old%API%or%the%New%API%will%run%under%MRv1%and%MRv2%– You"will"have"to"recompile"the"code"to"move"from"MR1"to"MR2,"but"you"will"not"have"to"change"the"code"itself"

MRv1"vs"MRv2,"Old"API"vs"New"API"

Old%API% New%API%

MapReduce%v1% ✔ ✔

MapReduce%v2% ✔ ✔

Page 188: Cloudera_Developer_Training

04#83%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Wri*ng%a%MapReduce%Program%

!  The"MapReduce"flow"

!  Basic"MapReduce"API"concepts"

! Wri@ng"MapReduce"applica@ons"in"Java"

–  The"driver"–  The"Mapper"

–  The"Reducer"! Wri@ng"Mappers"and"Reducers"in"other"languages"with"the"Streaming"API"

!  Speeding"up"Hadoop"development"by"using"Eclipse"

!  Hands/On"Exercise:"Wri@ng"a"MapReduce"Program"

!  Differences"between"the"Old"and"New"MapReduce"APIs"

!  Conclusion%

Page 189: Cloudera_Developer_Training

04#84%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! The%MapReduce%flow%

! Basic%MapReduce%API%concepts%

! How%to%write%MapReduce%drivers,%Mappers,%and%Reducers%in%Java%

! How%to%write%Mappers%and%Reducers%in%other%languages%using%the%Streaming%API%

! How%to%speed%up%your%Hadoop%development%by%using%Eclipse%

! The%differences%between%the%old%and%new%MapReduce%APIs%

Conclusion"

Page 190: Cloudera_Developer_Training

05#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Unit"TesAng"MapReduce"Programs"Chapter"5"

Page 191: Cloudera_Developer_Training

05#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducAon"!  "The"MoAvaAon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriAng"a"MapReduce"Program"!  %Unit%Tes.ng%MapReduce%Programs%!  "Delving"Deeper"into"the"Hadoop"API"!  "PracAcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaAon"in"MapReduce"""

!  "IntegraAng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducAon"to"Hive"and"Pig"!  "An"IntroducAon"to"Oozie"

IntroducAon"to"Apache"Hadoop"and"its"Ecosystem"

Basic%Programming%with%the%Hadoop%Core%API%

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducAon"

The"Hadoop"Ecosystem"

Page 192: Cloudera_Developer_Training

05#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! What%unit%tes.ng%is,%and%why%you%should%write%unit%tests%

! What%the%JUnit%tes.ng%framework%is,%and%how%MRUnit%builds%on%the%JUnit%framework%

! How%to%write%unit%tests%with%MRUnit%

! How%to%run%unit%tests%

Unit"TesAng"MapReduce"Programs"

Page 193: Cloudera_Developer_Training

05#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

! Unit%tes.ng%!  The"JUnit"and"MRUnit"tesAng"frameworks"! WriAng"unit"tests"with"MRUnit"!  Running"unit"tests"!  Hands/On"Exercise:"WriAng"Unit"Tests"with"the"MRUnit"Framework"!  Conclusion"

Page 194: Cloudera_Developer_Training

05#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A%‘unit’%is%a%small%piece%of%your%code%– A"small"piece"of"funcAonality"

! A%unit%test%verifies%the%correctness%of%that%unit%of%code%– A"purist"might"say"that"in"a"well/wri=en"unit"test,"only"a"single"‘thing’"should"be"able"to"fail"– Generally"accepted"rule/of/thumb:"a"unit"test"should"take"less"than"a"second"to"complete"

An"IntroducAon"to"Unit"TesAng"

Page 195: Cloudera_Developer_Training

05#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Unit%tes.ng%provides%verifica.on%that%your%code%is%func.oning%correctly%

! Much%faster%than%tes.ng%your%en.re%program%each%.me%you%modify%the%code%– Fastest"MapReduce"job"on"a"cluster"will"take"many"seconds"

– Even"in"pseudo/distributed"mode"– Even"running"in"LocalJobRunner"mode"will"take"several"seconds"

– LocalJobRunner"mode"is"discussed"later"in"the"course"– Unit"tests"help"you"iterate"faster"in"your"code"development"

Why"Write"Unit"Tests?"

Page 196: Cloudera_Developer_Training

05#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

!  Unit"tesAng"!  The%JUnit%and%MRUnit%tes.ng%frameworks%! WriAng"unit"tests"with"MRUnit"!  Running"unit"tests"!  Hands/On"Exercise:"WriAng"Unit"Tests"with"the"MRUnit"Framework"!  Conclusion"

Page 197: Cloudera_Developer_Training

05#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! JUnit%is%a%very%popular%Java%unit%tes.ng%framework%

! Problem:%JUnit%cannot%be%used%directly%to%test%Mappers%or%Reducers%– Unit"tests"require"mocking"up"classes"in"the"MapReduce"framework"

– A"lot"of"work"! MRUnit%is%built%on%top%of%JUnit%

– Works"with"the"mockito"framework"to"provide"required"mock"objects"

! Allows%you%to%test%your%code%from%within%an%IDE%– Much"easier"to"debug"

Why"MRUnit?"

Page 198: Cloudera_Developer_Training

05#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! We%are%using%JUnit%4%in%class%– Earlier"versions"would"also"work"

! @Test – Java"annotaAon"– Indicates"that"this"method"is"a"test"which"JUnit"should"execute"

! @Before – Java"annotaAon"– Tells"JUnit"to"call"this"method"before"every"@Test"method"

– Two"@Test"methods"would"result"in"the"@Before"method"being"called"twice"

JUnit"Basics"

Page 199: Cloudera_Developer_Training

05#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! JUnit%test%methods:%– assertEquals(),"assertNotNull()"etc"

– Fail"if"the"condiAons"of"the"statement"are"not"met"– fail(msg)

– Fails"the"test"with"the"given"error"message"

! With%a%JUnit%test%open%in%Eclipse,%run%all%tests%in%the%class%by%going%to%%Run%"%Run%

! Eclipse%also%provides%func.onality%to%run%all%JUnit%tests%in%your%project%

! Other%IDEs%have%similar%func.onality%

JUnit"Basics"(cont’d)"

Page 200: Cloudera_Developer_Training

05#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

JUnit:"Example"Code"

import static org.junit.Assert.assertEquals; import org.junit.Before; import org.junit.Test; public class JUnitHelloWorld { protected String s; @Before public void setup() { s = "HELLO WORLD"; } @Test public void testHelloWorldSuccess() { s = s.toLowerCase(); assertEquals("hello world", s); } // will fail even if testHelloWorldSuccess is called first @Test public void testHelloWorldFail() { assertEquals("hello world", s); } }

Page 201: Cloudera_Developer_Training

05#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

!  Unit"tesAng"!  The"JUnit"and"MRUnit"tesAng"frameworks"! Wri.ng%unit%tests%with%MRUnit%!  Running"unit"tests"!  Hands/On"Exercise:"WriAng"Unit"Tests"with"the"MRUnit"Framework"!  Conclusion"

Page 202: Cloudera_Developer_Training

05#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MRUnit%builds%on%top%of%JUnit%

! Provides%a%mock%InputSplit%and%other%classes%

! Can%test%just%the%Mapper,%just%the%Reducer,%or%the%full%MapReduce%flow%

Using"MRUnit"to"Test"MapReduce"Code"

Page 203: Cloudera_Developer_Training

05#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MRUnit:"Example"Code"–"Mapper"Unit"Test"

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.junit.Before; import org.junit.Test; public class TestWordCount { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; @Before public void setUp() { WordMapper mapper = new WordMapper(); mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>(); mapDriver.setMapper(mapper); } @Test public void testMapper() { mapDriver.withInput(new LongWritable(1), new Text("cat dog")); mapDriver.withOutput(new Text("cat"), new IntWritable(1)); mapDriver.withOutput(new Text("dog"), new IntWritable(1)); mapDriver.runTest(); } }

Page 204: Cloudera_Developer_Training

05#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MRUnit:"Example"Code"–"Mapper"Unit"Test"(cont’d)"

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.junit.Before; import org.junit.Test; public class TestWordCount { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; @Before public void setUp() { WordMapper mapper = new WordMapper(); mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>(); mapDriver.setMapper(mapper); } @Test public void testMapper() { mapDriver.withInput(new LongWritable(1), new Text("cat dog")); mapDriver.withOutput(new Text("cat"), new IntWritable(1)); mapDriver.withOutput(new Text("dog"), new IntWritable(1)); mapDriver.runTest(); } }

Import"the"relevant"JUnit"classes"and"the"MRUnit"MapDriver"class"as"we"will"be"wriAng"a"unit"test"for"our"Mapper."

Page 205: Cloudera_Developer_Training

05#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MRUnit:"Example"Code"–"Mapper"Unit"Test"(cont’d)"

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.junit.Before; import org.junit.Test; public class TestWordCount { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; @Before public void setUp() { WordMapper mapper = new WordMapper(); mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>(); mapDriver.setMapper(mapper); } @Test public void testMapper() { mapDriver.withInput(new LongWritable(1), new Text("cat dog")); mapDriver.withOutput(new Text("cat"), new IntWritable(1)); mapDriver.withOutput(new Text("dog"), new IntWritable(1)); mapDriver.runTest(); } }

MapDriver"is"an"MRUnit"class"(not"a"user/defined"driver)."

Page 206: Cloudera_Developer_Training

05#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MRUnit:"Example"Code"–"Mapper"Unit"Test"(cont’d)"

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.junit.Before; import org.junit.Test; public class TestWordCount { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; @Before public void setUp() { WordMapper mapper = new WordMapper(); mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>(); mapDriver.setMapper(mapper); } @Test public void testMapper() { mapDriver.withInput(new LongWritable(1), new Text("cat dog")); mapDriver.withOutput(new Text("cat"), new IntWritable(1)); mapDriver.withOutput(new Text("dog"), new IntWritable(1)); mapDriver.runTest(); } }

Set"up"the"test."This"method"will"be"called"before"every"test,"just"as"with"JUnit."

Page 207: Cloudera_Developer_Training

05#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

MRUnit:"Example"Code"–"Mapper"Unit"Test"(cont’d)"

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.junit.Before; import org.junit.Test; public class TestWordCount { MapDriver<LongWritable, Text, Text, IntWritable> mapDriver; @Before public void setUp() { WordMapper mapper = new WordMapper(); mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>(); mapDriver.setMapper(mapper); } @Test public void testMapper() { mapDriver.withInput(new LongWritable(1), new Text("cat dog")); mapDriver.withOutput(new Text("cat"), new IntWritable(1)); mapDriver.withOutput(new Text("dog"), new IntWritable(1)); mapDriver.runTest(); } }

The"test"itself."Note"that"the"order"in"which"the"output"is"specified"is"important"–"it"must"match"the"order"in"which"the"output"will"be"created"by"the"Mapper."

Page 208: Cloudera_Developer_Training

05#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MRUnit%has%a%MapDriver,%a%ReduceDriver,%and%a%MapReduceDriver

! Methods%to%specify%test%input%and%output:%– withInput

– Specifies"input"to"the"Mapper/Reducer"– Builder"method"that"can"be"chained"

– withOutput – Specifies"expected"output"from"the"Mapper/Reducer"– Builder"method"that"can"be"chained"

– addInput – Similar"to"withInput"but"returns"void

– addOutput – Similar"to"withOutput"but"returns"void

MRUnit"Drivers"

Page 209: Cloudera_Developer_Training

05#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Methods%to%run%tests:%– runTest

– Runs"the"test"and"verifies"the"output"– run

– Runs"the"test"and"returns"the"result"set"– Ignores"previous"withOutput"and"addOutput"calls"

! Drivers%take%a%single%(key,%value)%pair%as%input%

! Can%take%mul.ple%(key,%value)%pairs%as%expected%output%

! If%you%are%calling%driver.runTest()%or%driver.run()%mul.ple%.mes,%call%driver.resetOutput()%between%each%call%– MRUnit"will"fail"if"you"do"not"do"this"

MRUnit"Drivers"(cont’d)"

Page 210: Cloudera_Developer_Training

05#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! You%should%write%unit%tests%for%your%code!%

! As%you%are%performing%the%Hands#On%Exercises%in%the%rest%of%the%course%we%strongly%recommend%that%you%write%unit%tests%as%you%proceed%– This"will"help"greatly"in"debugging"your"code"

MRUnit"Conclusions"

Page 211: Cloudera_Developer_Training

05#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

!  Unit"tesAng"!  The"JUnit"and"MRUnit"tesAng"frameworks"! WriAng"unit"tests"with"MRUnit"!  Running%unit%tests%!  Hands/On"Exercise:"WriAng"Unit"Tests"with"the"MRUnit"Framework"!  Conclusion"

Page 212: Cloudera_Developer_Training

05#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Running"Unit"Tests"From"Eclipse"

Page 213: Cloudera_Developer_Training

05#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Running"Unit"Tests"From"the"Command"Line"

[training@localhost sample_solution]$ java -cp `hadoop classpath`:/home/training/lib/mrunit-0.9.0-incubating-hadoop2.jar:. org.junit.runner.JUnitCore TestWordCount JUnit version 4.8.2 ... Time: 0.51 OK (3 tests)

Page 214: Cloudera_Developer_Training

05#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

!  Unit"tesAng"!  The"JUnit"and"MRUnit"tesAng"frameworks"! WriAng"unit"tests"with"MRUnit"!  Running"unit"tests"!  Hands#On%Exercise:%Wri.ng%Unit%Tests%with%the%MRUnit%Framework%!  Conclusion"

Page 215: Cloudera_Developer_Training

05#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%gain%prac.ce%crea.ng%unit%tests%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"WriAng"Unit"Tests"With"the"MRUnit"Framework"

Page 216: Cloudera_Developer_Training

05#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Unit%Tes.ng%MapReduce%Programs%%

!  Unit"tesAng"!  The"JUnit"and"MRUnit"tesAng"frameworks"! WriAng"unit"tests"with"MRUnit"!  Running"unit"tests"!  Hands/On"Exercise:"WriAng"Unit"Tests"with"the"MRUnit"Framework"!  Conclusion%

Page 217: Cloudera_Developer_Training

05#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! What%unit%tes.ng%is,%and%why%you%should%write%unit%tests%

! What%the%JUnit%tes.ng%framework%is,%and%how%MRUnit%builds%on%the%JUnit%framework%

! How%to%write%unit%tests%with%MRUnit%

! How%to%run%unit%tests%

Conclusion"

Page 218: Cloudera_Developer_Training

06#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Delving"Deeper"into"the"Hadoop"API"Chapter"6"

Page 219: Cloudera_Developer_Training

06#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducDon"!  "The"MoDvaDon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriDng"a"MapReduce"Program"!  "Unit"TesDng"MapReduce"Programs"!  %Delving%Deeper%into%the%Hadoop%API%!  "PracDcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaDon"in"MapReduce"""

!  "IntegraDng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducDon"to"Hive"and"Pig"!  "An"IntroducDon"to"Oozie"

IntroducDon"to"Apache"Hadoop"and"its"Ecosystem"

Basic%Programming%with%the%Hadoop%Core%API%

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducDon"

The"Hadoop"Ecosystem"

Page 220: Cloudera_Developer_Training

06#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! How%to%use%the%ToolRunner%class%

! How%to%decrease%the%amount%of%intermediate%data%with%Combiners%

! How%to%set%up%and%tear%down%Mappers%and%Reducers%by%using%the%setup%and%cleanup%methods%

! How%to%write%custom%ParGGoners%for%beHer%load%balancing%

! How%to%access%HDFS%programmaGcally%

! How%to%use%the%distributed%cache%

! How%to%use%the%Hadoop%API’s%library%of%Mappers,%Reducers,%and%ParGGoners%

Delving"Deeper"Into"The"Hadoop"API"

Page 221: Cloudera_Developer_Training

06#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

! Using%the%ToolRunner%class%!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 222: Cloudera_Developer_Training

06#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! You%can%use%ToolRunner%in%MapReduce%driver%classes%– This"is"not"required,"but"is"a"best"pracDce"

! ToolRunner%uses%the%GenericOptionsParser%class%internally%– Allows"you"to"specify"configuraDon"opDons"on"the"command"line"– Also"allows"you"to"specify"items"for"the"Distributed"Cache"on"the"command"line"(see"later)"

Why"Use"ToolRunner?"

Page 223: Cloudera_Developer_Training

06#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Import%the%relevant%classes%in%your%driver%

%

! Change%your%driver%class%so%that%it%extends%Configured%and%implements%Tool

How"to"Implement"ToolRunner"

public class WordCount extends Configured implements Tool {

import org.apache.hadoop.conf.Configured; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner;

Page 224: Cloudera_Developer_Training

06#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%main%method%should%call%ToolRunner.run

! Create%a%run%method%– Configure"and"submit"the"job"in"this"method"– Note"how"the"Job"object"is"created"when"using"ToolRunner"

How"to"Implement"ToolRunner"(cont’d)"

public int run(String[] args) throws Exception { Job job = new Job(getConf()); Job.setJarByClass(WordCount.class);

...

public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(),

new WordCount(), args); System.exit(exitCode); }

Page 225: Cloudera_Developer_Training

06#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

How"to"Implement"ToolRunner:"Complete"Driver"

public class WordCount extends Configured implements Tool { public int run(String[] args) throws Exception { if (args.length != 2) { System.out.printf( "Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName()); return -1; } Job job = new Job(getConf()); job.setJarByClass(WordCount.class); job.setJobName("Word Count"); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(exitCode); } }

Page 226: Cloudera_Developer_Training

06#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! ToolRunner%allows%the%user%to%specify%configuraGon%opGons%on%the%command%line%

! Commonly%used%to%specify%Hadoop%properGes%using%the%-D%flag%– Will"override"any"default"or"site"properDes"in"the"configuraDon"– But"will"not"override"those"set"in"the"driver"code"

! Note%that%-D%opGons%must%appear%before%any%addiGonal%program%arguments%

! Can%specify%an%XML%configuraGon%file%with%-conf

! Can%specify%the%default%filesystem%with%-fs uri – Shortcut"for"–D fs.defaultFS=uri

ToolRunner"Command"Line"OpDons"

$ hadoop jar myjar.jar MyDriver \ -D mapreduce.job.reduces=10 myinputdir myoutputdir

Page 227: Cloudera_Developer_Training

06#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%CDH%4,%a%large%number%of%configuraGon%properGes%were%deprecated%

! The%old%property%names%work%in%CDH%4%but%do#not%work%in%CDH%3%

! All%configuraGon%property%names%shown%in%this%course%are%the%new%property%names%– The"deprecated"property"names"are"also"provided"for"students"who"are"sDll"working"with"CDH"3"

! CDH%3%equivalents%for%configuraGon%properGes%on%the%previous%slide%are:%– mapred.reduce.tasks"(for"mapreduce.job.reduces)"– fs.default.name"(for"fs.defaultFS)"

Aside:"Deprecated"ConfiguraDon"ProperDes""

Page 228: Cloudera_Developer_Training

06#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing%the%amount%of%intermediate%data%with%Combiners%!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 229: Cloudera_Developer_Training

06#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! O^en,%Mappers%produce%large%amounts%of%intermediate%data%– That"data"must"be"passed"to"the"Reducers"– This"can"result"in"a"lot"of"network"traffic"

! It%is%o^en%possible%to%specify%a%Combiner%– Like"a"‘mini/Reducer’"– Runs"locally"on"a"single"Mapper’s"output"– Output"from"the"Combiner"is"sent"to"the"Reducers"– Input"and"output"data"types"for"the"Combiner/Reducer"must"be"idenDcal"

! Combiner%and%Reducer%code%are%o^en%idenGcal%– Technically,"this"is"possible"if"the"operaDon"performed"is"commutaDve"and"associaDve"

The"Combiner"

Page 230: Cloudera_Developer_Training

06#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%see%how%a%Combiner%works,%let’s%revisit%the%WordCount%example%we%covered%earlier%

MapReduce"Example:"Word"Count"

map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)

reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)

Page 231: Cloudera_Developer_Training

06#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Input%to%the%Mapper:%

! Output%from%the%Mapper:%

MapReduce"Example:"Word"Count"(cont’d)"

(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')

('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

Page 232: Cloudera_Developer_Training

06#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Intermediate%data%sent%to%the%Reducer:%

! Final%Reducer%output:%

MapReduce"Example:"Word"Count"(cont’d)"

('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])

('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)

Page 233: Cloudera_Developer_Training

06#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A%Combiner%would%decrease%the%amount%of%data%sent%to%the%Reducer%– Intermediate"data"sent"to"the"Reducer"ager"a"Combiner"using"the"same"code"as"the"Reducer:"

! Combiners%decrease%the%amount%of%network%traffic%required%during%the%shuffle%and%sort%phase%– Ogen"also"decrease"the"amount"of"work"needed"to"be"done"by"the"Reducer"

Word"Count"With"Combiner"

('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [2]) ('sat', [2]) ('sofa', [1]) ('the', [4])

Page 234: Cloudera_Developer_Training

06#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%specify%the%Combiner%class%to%be%used%in%your%MapReduce%code,%put%the%following%line%in%your%Driver:%

! The%Combiner%uses%the%same%interface%as%the%Reducer%– Takes"in"a"key"and"a"list"of"values"– Outputs"zero"or"more"(key,"value)"pairs"– The"actual"method"called"is"the"reduce"method"in"the"class"

! VERY%IMPORTANT:%The%Combiner%may%run%once,%or%more%than%once,%on%the%output%from%any%given%Mapper%– Do"not"put"code"in"the"Combiner"which"could"influence"your"results"if"it"runs"more"than"once"

Specifying"a"Combiner"

job.setCombinerClass(YourCombinerClass.class);

Page 235: Cloudera_Developer_Training

06#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands#On%Exercise:%WriGng%and%ImplemenGng%a%Combiner%!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 236: Cloudera_Developer_Training

06#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%gain%pracGce%wriGng%Combiners%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"WriDng"and""ImplemenDng"a"Combiner""

Page 237: Cloudera_Developer_Training

06#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  Sedng%up%and%tearing%down%Mappers%and%Reducers%using%the%setup%and%cleanup%methods%

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 238: Cloudera_Developer_Training

06#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! It%is%common%to%want%your%Mapper%or%Reducer%to%execute%some%code%before%the%map%or%reduce%method%is%called%– IniDalize"data"structures"– Read"data"from"an"external"file"– Set"parameters"

! The%setup%method%is%run%before%the%map%or%reduce%method%is%called%for%the%first%Gme%

The"setup"Method"

public void setup(Context context)

Page 239: Cloudera_Developer_Training

06#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Similarly,%you%may%wish%to%perform%some%acGon(s)%a^er%all%the%records%have%been%processed%by%your%Mapper%or%Reducer%

! The%cleanup%method%is%called%before%the%Mapper%or%Reducer%terminates%

%

The"cleanup"Method"

public void cleanup(Context context) throws IOException, InterruptedException

Page 240: Cloudera_Developer_Training

06#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Passing"Parameters:"The"Wrong"Way!"

public class MyClass { private static int param; ... private static class MyMapper extends Mapper ... { public void map... {

int v = param; } } ... public static void main(String[] args) throws IOException { Job job = new Job(); param = 5; ... boolean success = job.waitForCompletion(true); return success ? 0 : 1; } }

Page 241: Cloudera_Developer_Training

06#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Passing"Parameters:"The"Right"Way"

public class MyClass { private static class MyMapper extends Mapper ... { public void setup(Context context) { Configuration conf = context.getConfiguration();

int v = conf.getInt("param", 0); ...

} public void map... } public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); conf.setInt ("param",5); Job job = new Job(conf); ... boolean success = job.waitForCompletion(true); return success ? 0 : 1; } }

Page 242: Cloudera_Developer_Training

06#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriGng%custom%ParGGoners%for%beHer%load%balancing%!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 243: Cloudera_Developer_Training

06#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%ParGGoner%divides%up%the%keyspace%– Controls"which"Reducer"each"intermediate"key"and"its"associated"values"goes"to"

! O^en,%the%default%behavior%is%fine%– Default"is"the"HashPartitioner

What"Does"The"ParDDoner"Do?"

public class HashPartitioner<K, V> extends Partitioner<K, V> { public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } }

Page 244: Cloudera_Developer_Training

06#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! SomeGmes%you%will%need%to%write%your%own%ParGGoner%

! Example:%your%key%is%a%custom%WritableComparable%which%contains%a%pair%of%values%(a, b) – You"may"decide"that"all"keys"with"the"same"value"for"a"need"to"go"to"the"same"Reducer"– The"default"ParDDoner"is"not"sufficient"in"this"case"

Custom"ParDDoners"

Page 245: Cloudera_Developer_Training

06#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Custom%ParGGoners%are%needed%when%performing%a%secondary%sort%(see%later)%

! Custom%ParGGoners%are%also%useful%to%avoid%potenGal%performance%issues%– To"avoid"one"Reducer"having"to"deal"with"many"very"large"lists"of"values"– Example:"in"our"word"count"job,"we"wouldn't"want"a"single"Reducer"dealing"with"all"the"three/"and"four/le=er"words,"while"another"only"had"to"handle"10/"and"11/le=er"words"

Custom"ParDDoners"(cont’d)"

Page 246: Cloudera_Developer_Training

06#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%create%a%custom%ParGGoner:%

1.  Create%a%class%for%the%ParGGoner%– Should"extend"Partitioner""

2.  Create%a%method%in%the%class%called%getPartition – Receives"the"key,"the"value,"and"the"number"of"Reducers"– Should"return"an"int"between"0"and"one"less"than"the"number"of"Reducers"– e.g.,"if"it"is"told"there"are"10"Reducers,"it"should"return"an"int"between"0"and"9"

3.  Specify%the%custom%ParGGoner%in%your%driver%code%

CreaDng"a"Custom"ParDDoner"

job.setPartitionerClass(MyPartitioner.class);

Page 247: Cloudera_Developer_Training

06#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%you%need%to%set%up%variables%for%use%in%your%parGGoner,%it%should%implement%Configurable

! Example:%

Aside:"SeZng"up"Variables"for"your"ParDDoner"

class MyPartitioner extends Partitioner<K, V> implements Configurable {

private Configuration configuration; // Define your own variables here @Override public void setConf(Configuration configuration) { this.configuration = configuration; // Set up your variables here } @Override public Configuration getConf() { return configuration; } ...

}

Page 248: Cloudera_Developer_Training

06#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%a%Hadoop%object%implements%Configurable,%its%setConf()%method%will%be%called%once,%when%it%is%instanGated%

! You%can%therefore%set%up%variables%in%the%setConf()%method%which%your%getPartition()%method%will%then%be%able%to%access%

Aside:"SeZng"up"Variables"for"your"ParDDoner"(cont’d)"

Page 249: Cloudera_Developer_Training

06#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands#On%Exercise:%WriGng%a%ParGGoner%!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 250: Cloudera_Developer_Training

06#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%write%code%which%uses%a%ParGGoner%and%mulGple%Reducers%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"WriDng"a"ParDDoner"

Page 251: Cloudera_Developer_Training

06#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing%HDFS%programaGcally%!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 252: Cloudera_Developer_Training

06#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%addiGon%to%using%the%command#line%shell,%you%can%access%HDFS%programmaGcally%– Useful"if"your"code"needs"to"read"or"write"‘side"data’"in"addiDon"to"the"standard"MapReduce"inputs"and"outputs"– Or"for"programs"outside"of"Hadoop"which"need"to"read"the"results"of"MapReduce"jobs"

! Beware:%HDFS%is%not%a%general#purpose%filesystem!%– Files"cannot"be"modified"once"they"have"been"wri=en,"for"example"

! Hadoop%provides%the%FileSystem%abstract%base%class%– Provides"an"API"to"generic"file"systems"

– Could"be"HDFS"– Could"be"your"local"file"system"– Could"even"be,"for"example,"Amazon"S3"

Accessing"HDFS"ProgrammaDcally"

Page 253: Cloudera_Developer_Training

06#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%order%to%use%the%FileSystem%API,%retrieve%an%instance%of%it%

! The%conf%object%has%read%in%the%Hadoop%configuraGon%files,%and%therefore%knows%the%address%of%the%NameNode%etc.%

! A%file%in%HDFS%is%represented%by%a%Path%object%

The"FileSystem"API"

Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf);

Path p = new Path("/path/to/my/file");

Page 254: Cloudera_Developer_Training

06#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Some%useful%API%methods:%– FSDataOutputStream create(...)

– Extends"java.io.DataOutputStream – Provides"methods"for"wriDng"primiDves,"raw"bytes"etc"

– FSDataInputStream open(...) – Extends"java.io.DataInputStream

• Provides"methods"for"reading"primiDves,"raw"bytes"etc – boolean delete(...) – boolean mkdirs(...) – void copyFromLocalFile(...) – void copyToLocalFile(...) – FileStatus[] listStatus(...)

The"FileSystem"API"(cont’d)"

Page 255: Cloudera_Developer_Training

06#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Get%a%directory%lisGng:%

The"FileSystem"API:"Directory"LisDng"

Path p = new Path("/my/path"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus[] fileStats = fs.listStatus(p); for (int i = 0; i < fileStats.length; i++) { Path f = fileStats[i].getPath(); // do something interesting }

Page 256: Cloudera_Developer_Training

06#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Write%data%to%a%file%

The"FileSystem"API:"WriDng"Data"

Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path p = new Path("/my/path/foo"); FSDataOutputStream out = fs.create(path, false); // write some raw bytes out.write(getBytes()); // write an int out.writeInt(getInt()); ... out.close();

Page 257: Cloudera_Developer_Training

06#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

! Using%the%Distributed%Cache%!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion"

Page 258: Cloudera_Developer_Training

06#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A%common%requirement%is%for%a%Mapper%or%Reducer%to%need%access%to%some%‘side%data’%– Lookup"tables"– DicDonaries"– Standard"configuraDon"values"

! One%opGon:%read%directly%from%HDFS%in%the%configure%method%– Works,"but"is"not"scalable"

! The%Distributed%Cache%provides%an%API%to%push%data%to%all%slave%nodes%– Transfer"happens"behind"the"scenes"before"any"task"is"executed"– Note:"Distributed"Cache"is"read/only"– Files"in"the"Distributed"Cache"are"automaDcally"deleted"from"slave"nodes"when"the"job"finishes"

The"Distributed"Cache:"MoDvaDon"

Page 259: Cloudera_Developer_Training

06#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Place%the%files%into%HDFS%

! Configure%the%Distributed%Cache%in%your%driver%code%

– .jar"files"added"with"addFileToClassPath"will"be"added"to"your"Mapper"or"Reducer’s"classpath"– Files"added"with"addCacheArchive"will"automaDcally"be"dearchived/decompressed"

Using"the"Distributed"Cache:"The"Difficult"Way"

Configuration conf = new Configuration(); DistributedCache.addCacheFile(new URI("/myapp/lookup.dat"),conf); DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"),conf); DistributedCache.addCacheArchive(new URI("/myapp/map.zip",conf); DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar",conf); DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz",conf); DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz",conf);

Page 260: Cloudera_Developer_Training

06#43%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%you%are%using%ToolRunner,%you%can%add%files%to%the%Distributed%Cache%directly%from%the%command%line%when%you%run%the%job%– No"need"to"copy"the"files"to"HDFS"first"

! Use%the%-files%opGon%to%add%files%

! The%-archives%flag%adds%archived%files,%and%automaGcally%unarchives%them%on%the%desGnaGon%machines%

! The%-libjars%flag%adds%jar%files%to%the%classpath%

Using"the"DistributedCache:"The"Easy"Way"

hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...

Page 261: Cloudera_Developer_Training

06#44%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Files%added%to%the%Distributed%Cache%are%made%available%in%your%task’s%local%working%directory%– Access"them"from"your"Mapper"or"Reducer"the"way"you"would"read"any"ordinary"local"file"

Accessing"Files"in"the"Distributed"Cache"

File f = new File("file_name_here");

Page 262: Cloudera_Developer_Training

06#45%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"! Using%the%Hadoop%API’s%library%of%Mappers,%Reducers%and%ParGGoners%!  Conclusion"

Page 263: Cloudera_Developer_Training

06#46%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%org.apache.hadoop.mapreduce.lib.*/*%packages%contain%a%library%of%Mappers,%Reducers,%and%ParGGoners%supporGng%the%new%API%

! Example%classes:%– InverseMapper"–"Swaps"keys"and"values"– RegexMapper"–"Extracts"text"based"on"a"regular"expression"– IntSumReducer,"LongSumReducer"–"Add"up"all"values"for"a"key"– TotalOrderPartitioner"–"Reads"a"previously/created"parDDon"file"and"parDDons"based"on"the"data"from"that"file"– Sample"the"data"first"to"create"the"parDDon"file"– Allows"you"to"parDDon"your"data"into"n"parDDons"without"hard/coding"the"parDDoning"informaDon"

! Refer%to%the%Javadoc%for%classes%available%in%your%version%of%CDH%– Available"classes"vary"greatly"from"version"to"version"

Reusable"Classes"for"the"New"API"

Page 264: Cloudera_Developer_Training

06#47%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%Delving%Deeper%into%the%Hadoop%API%

!  Using"the"ToolRunner"class"!  Decreasing"the"amount"of"intermediate"data"with"Combiners"

!  Hands/On"Exercise:"WriDng"and"ImplemenDng"a"Combiner"

!  SeZng"up"and"tearing"down"Mappers"and"Reducers"using"the"setup"and"cleanup"methods"

! WriDng"custom"ParDDoners"for"be=er"load"balancing"

!  Hands/On"Exercise:"WriDng"a"ParDDoner"

!  Accessing"HDFS"programaDcally"

!  Using"the"Distributed"Cache"!  Using"the"Hadoop"API’s"library"of"Mappers,"Reducers"and"ParDDoners"

!  Conclusion%

Page 265: Cloudera_Developer_Training

06#48%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! How%to%use%the%ToolRunner%class%

! How%to%decrease%the%amount%of%intermediate%data%with%Combiners%

! How%to%set%up%and%tear%down%Mappers%and%Reducers%by%using%the%setup%and%cleanup%methods%

! How%to%write%custom%ParGGoners%for%beHer%load%balancing%

! How%to%access%HDFS%programmaGcally%

! How%to%use%the%distributed%cache%

! How%to%use%the%Hadoop%API’s%library%of%Mappers,%Reducers,%and%ParGGoners%

Conclusion"

Page 266: Cloudera_Developer_Training

07#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Prac@cal"Development"Tips"and"Techniques"Chapter"7"

Page 267: Cloudera_Developer_Training

07#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "Introduc@on"!  "The"Mo@va@on"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "Wri@ng"a"MapReduce"Program"!  "Unit"Tes@ng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  %Prac+cal%Development%Tips%and%Techniques%!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"Manipula@on"in"MapReduce"""

!  "Integra@ng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"Introduc@on"to"Hive"and"Pig"!  "An"Introduc@on"to"Oozie"

Introduc@on"to"Apache"Hadoop"and"its"Ecosystem"

Basic%Programming%with%the%Hadoop%Core%API%

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"Introduc@on"

The"Hadoop"Ecosystem"

Page 268: Cloudera_Developer_Training

07#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! Strategies%for%debugging%MapReduce%code%

! How%to%test%MapReduce%code%locally%by%using%LocalJobRunner%

! How%to%write%and%view%log%files%

! How%to%retrieve%job%informa+on%with%counters%

! How%to%determine%the%op+mal%number%of%Reducers%for%a%job%

! Why%reusing%objects%is%a%best%prac+ce%

! How%to%create%Map#only%MapReduce%jobs%

Prac@cal"Development"Tips"and"Techniques"

Page 269: Cloudera_Developer_Training

07#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies%for%debugging%MapReduce%code%!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 270: Cloudera_Developer_Training

07#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Debugging%MapReduce%code%is%difficult!%– Each"instance"of"a"Mapper"runs"as"a"separate"task"

– O\en"on"a"different"machine"– Difficult"to"a=ach"a"debugger"to"the"process"– Difficult"to"catch"‘edge"cases’"

! Very%large%volumes%of%data%mean%that%unexpected%input%is%likely%to%appear%– Code"which"expects"all"data"to"be"well/formed"is"likely"to"fail"

Introduc@on"to"Debugging"

Page 271: Cloudera_Developer_Training

07#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Code%defensively%– Ensure"that"input"data"is"in"the"expected"format"– Expect"things"to"go"wrong"– Catch"excep@ons"

! Start%small,%build%incrementally%

! Make%as%much%of%your%code%as%possible%Hadoop#agnos+c%– Makes"it"easier"to"test"

! Write%unit%tests%

! Test%locally%whenever%possible%– With"small"amounts"of"data"

! Then%test%in%pseudo#distributed%mode%

! Finally,%test%on%the%cluster%

Common/Sense"Debugging"Tips"

Page 272: Cloudera_Developer_Training

07#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! When%tes+ng%in%pseudo#distributed%mode,%ensure%that%you%are%tes+ng%with%a%similar%environment%to%that%on%the%real%cluster%– Same"amount"of"RAM"allocated"to"the"task"JVMs"– Same"version"of"Hadoop"– Same"version"of"Java"– Same"versions"of"third/party"libraries"

Tes@ng"Strategies"

Page 273: Cloudera_Developer_Training

07#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes+ng%MapReduce%code%locally%using%LocalJobRunner%! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 274: Cloudera_Developer_Training

07#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%can%run%MapReduce%in%a%single,%local%process%– Does"not"require"any"Hadoop"daemons"to"be"running"– Uses"the"local"filesystem"instead"of"HDFS"– Known"as"LocalJobRunner"mode"

! This%is%a%very%useful%way%of%quickly%tes+ng%incremental%changes%to%code%

%"

Tes@ng"Locally"

Page 275: Cloudera_Developer_Training

07#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%run%in%LocalJobRunner%mode,%add%the%following%lines%to%the%driver%code:%

– CDH3:"– mapred.job.tracker,"fs.default.name"

– Or"set"these"op@ons"on"the"command"line"with"the"-D"flag"– If"your"code"is"using"ToolRunner"

! Some%limita+ons%of%LocalJobRunner%mode:%– Distributed"Cache"does"not"work"– The"job"can"only"specify"a"single"Reducer"– Some"‘beginner’"mistakes"may"not"be"caught"

– For"example,"a=emp@ng"to"share"data"between"Mappers"will"work,"because"the"code"is"running"in"a"single"JVM"

Tes@ng"Locally"(cont’d)"

job.set("mapreduce.jobtracker.address", "local"); job.set("fs.defaultFS", "file:///");

Page 276: Cloudera_Developer_Training

07#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%installa+on%of%Eclipse%on%your%VMs%is%configured%to%run%Hadoop%code%in%LocalJobRunner%mode%– From"within"the"IDE"

! This%allows%rapid%development%itera+ons%– ‘Agile"programming’"

LocalJobRunner"Mode"in"Eclipse"

Page 277: Cloudera_Developer_Training

07#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Specify%a%Run%Configura+on%%

LocalJobRunner"Mode"in"Eclipse"(cont’d)"

Page 278: Cloudera_Developer_Training

07#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Select%Java%Applica+on,%then%select%the%New%bu^on%

! Verify%that%the%Project%and%Main%Class%fields%are%pre#filled%correctly%

LocalJobRunner"Mode"in"Eclipse"(cont’d)"

Page 279: Cloudera_Developer_Training

07#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Specify%values%in%the%Arguments%tab%– Local"input"and"output"files"– Any"configura@on"op@ons"needed"when"your"job"runs"

! Define%breakpoints%if%desired%

! Execute%the%applica+on%in%run%mode%or%debug%mode%

LocalJobRunner"Mode"in"Eclipse"(cont’d)"

Page 280: Cloudera_Developer_Training

07#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Review%output%in%the%Eclipse%console%window%%

LocalJobRunner"Mode"in"Eclipse"(cont’d)"

Page 281: Cloudera_Developer_Training

07#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri+ng%and%viewing%log%files%!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 282: Cloudera_Developer_Training

07#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Tried#and#true%debugging%technique:%write%to%stdout%or%stderr

! If%running%in%LocalJobRunner%mode,%you%will%see%the%results%of%System.err.println()

! If%running%on%a%cluster,%that%output%will%not%appear%on%your%console%– Output"is"visible"via"Hadoop’s"Web"UI"

Before"Logging:"stdout"and"stderr

Page 283: Cloudera_Developer_Training

07#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! All%Hadoop%daemons%contain%a%Web%server%– Exposes"informa@on"on"a"well/known"port"

! Most%important%for%developers%is%the%JobTracker%Web%UI%– http://<job_tracker_address>:50030/ – http://localhost:50030/"if"running"in"pseudo/distributed"mode"

! Also%useful:%the%NameNode%Web%UI%– http://<name_node_address>:50070/

Aside:"The"Hadoop"Web"UI"

Page 284: Cloudera_Developer_Training

07#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Your%instructor%will%now%demonstrate%the%JobTracker%UI

Aside:"The"Hadoop"Web"UI"(cont’d)"

Page 285: Cloudera_Developer_Training

07#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! println%statements%rapidly%become%awkward%– Turning"them"on"and"off"in"your"code"is"tedious,"and"leads"to"errors"

! Logging%provides%much%finer#grained%control%over:%– What"gets"logged"– When"something"gets"logged"– How"something"is"logged"

Logging:"Be=er"Than"Prin@ng"

Page 286: Cloudera_Developer_Training

07#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%uses%log4j%to%generate%all%its%log%files%

! Your%Mappers%and%Reducers%can%also%use%log4j – All"the"ini@aliza@on"is"handled"for"you"by"Hadoop"

! Add%the%$HADOOP_HOME/lib/log4j.jar-1.2.15%file%to%your%classpath%when%you%reference%the%log4j%classes.%

Logging"With"log4j

import org.apache.log4j.Level; import org.apache.log4j.Logger; class FooMapper implements Mapper { private static final Logger LOGGER = Logger.getLogger (FooMapper.class.getName()); ... }

Page 287: Cloudera_Developer_Training

07#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Simply%send%strings%to%loggers%tagged%with%severity%levels:%

%

! Beware%expensive%opera+ons%like%concatena+on%– To"avoid"performance"penalty,"make"it"condi@onal"like"this:

Logging"With"log4j"(cont’d)"

LOGGER.trace("message"); LOGGER.debug("message"); LOGGER.info("message"); LOGGER.warn("message"); LOGGER.error("message"):

if (LOGGER.isDebugEnabled()) { LOGGER.debug("Account info:" + acct.getReport()); }

Page 288: Cloudera_Developer_Training

07#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Configura+on%for%log4j%is%stored%in%%/etc/hadoop/conf/log4j.properties

! Can%change%global%log%sebngs%with%hadoop.root.log%property%

! Can%override%log%level%on%a%per#class%basis:%

! Programma+cally:%

log4j"Configura@on"

log4j.logger.org.apache.hadoop.mapred.JobTracker=WARN log4j.logger.com.mycompany.myproject.FooMapper=DEBUG

LOGGER.setLevel(Level.WARN);

Page 289: Cloudera_Developer_Training

07#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Although%log%levels%can%be%set%in%log4j.properties,%this%would%require%modifica+on%of%files%on%all%slave%nodes%– In"prac@ce,"this"is"unrealis@c"

! Instead,%a%good%solu+on%is%to%set%the%log%level%in%your%code%based%on%a%command#line%parameter%

Dynamically"Sehng"Log"Levels"

Page 290: Cloudera_Developer_Training

07#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%the%code%for%your%Mapper%or%Reducer:%

! Then%on%the%command%line,%specify%the%log%level:%

Dynamically"Sehng"Log"Levels"(cont’d)"

public void configure(JobConf conf) { if ("DEBUG".equals(conf.get("com.cloudera.job.logging")){ LOGGER.setLevel(Level.DEBUG); LOGGER.debug("** Log Level set to DEBUG **"); } }

$ hadoop jar wc.jar WordCountWTool \ –D com.cloudera.job.logging=DEBUG indir outdir

Page 291: Cloudera_Developer_Training

07#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Log%files%are%stored%by%default%at%%/var/log/hadoop-0.20-mapreduce/ userlogs/${task.id}/syslog%

on%the%machine%where%the%task%a^empt%ran%– Configurable"

! Tedious%to%have%to%ssh%in%to%a%node%to%view%its%logs%– Much"easier"to"use"the"JobTracker"Web"UI"

– Automa@cally"retrieves"and"displays"the"log"files"for"you"

Where"Are"Log"Files"Stored?"

Page 292: Cloudera_Developer_Training

07#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%you%suspect%the%input%data%of%being%faulty,%you%may%be%tempted%to%log%the%(key,%value)%pairs%your%Mapper%receives%– Reasonable"for"small"amounts"of"input"data"– Cau@on!"If"your"job"runs"across"500GB"of"input"data,"you"could"be"wri@ng"up"to"500GB"of"log"files!"– Remember"to"think"at"scale…"

! Instead,%wrap%vulnerable%sec+ons%of%code%in%%try {...}%blocks%– Write"logs"in"the"catch {...}"block"

– This"way"only"cri@cal"data"is"logged"

Restric@ng"Log"Output"

Page 293: Cloudera_Developer_Training

07#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! You%can%throw%excep+ons%if%a%par+cular%condi+on%is%met%– For"example,"if"illegal"data"is"found"

"

! Usually%not%a%good%idea%– Excep@on"causes"the"task"to"fail"– If"a"task"fails"four"@mes,"the"en@re"job"will"fail"

Aside:"Throwing"Excep@ons"

throw new RuntimeException("Your message here");

Page 294: Cloudera_Developer_Training

07#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving%job%informa+on%with%Counters%!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 295: Cloudera_Developer_Training

07#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Counters%provide%a%way%for%Mappers%or%Reducers%to%pass%aggregate%values%back%to%the%driver%aeer%the%job%has%completed%– Their"values"are"also"visible"from"the"JobTracker’s"Web"UI"– And"are"reported"on"the"console"when"the"job"ends"

! Very%basic:%just%have%a%name%and%a%value%– Value"can"be"incremented"within"the"code"

! Counters%are%collected%into%Groups%– Within"the"group,"each"Counter"has"a"name"

! Example:%A%group%of%Counters%called%RecordType – Names:"TypeA,"TypeB,"TypeC – Appropriate"Counter"will"be"incremented"as"each"record"is"read"in"the"Mapper"

What"Are"Counters?"

Page 296: Cloudera_Developer_Training

07#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Counters%provide%a%way%for%Mappers%or%Reducers%to%pass%aggregate%values%back%to%the%driver%aeer%the%job%has%completed%– Their"values"are"also"visible"from"the"JobTracker’s"Web"UI"

! Counters%can%be%set%and%incremented%via%the%method%

! Example:%

What"Are"Counters?"(cont’d)"

context.getCounter(group, name).increment(amount);

context.getCounter("RecordType","A").increment(1);

Page 297: Cloudera_Developer_Training

07#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%retrieve%Counters%in%the%Driver%code%aeer%the%job%is%complete,%use%code%like%this%in%the%driver:%

Retrieving"Counters"in"the"Driver"Code"

long typeARecords = job.getCounters().findCounter("RecordType","A").getValue();

long typeBRecords =

job.getCounters().findCounter("RecordType","B").getValue();

Page 298: Cloudera_Developer_Training

07#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Do%not%rely%on%a%counter’s%value%from%the%Web%UI%while%a%job%is%running%– Due"to"possible"specula@ve"execu@on,"a"counter’s"value"could"appear"larger"than"the"actual"final"value"– Modifica@ons"to"counters"from"subsequently"killed/failed"tasks"will"be"removed"from"the"final"count"

Counters:"Cau@on"

Page 299: Cloudera_Developer_Training

07#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining%the%op+mal%number%of%Reducers%for%a%job%!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 300: Cloudera_Developer_Training

07#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! An%important%considera+on%when%crea+ng%your%job%is%to%determine%the%number%of%Reducers%specified%

! Default%is%a%single%Reducer%

! With%a%single%Reducer,%one%task%receives%all%keys%in%sorted%order%– This"is"some@mes"advantageous"if"the"output"must"be"in"completely"sorted"order"– Can"cause"significant"problems"if"there"is"a"large"amount"of"intermediate"data"– Node"on"which"the"Reducer"is"running"may"not"have"enough"disk"space"to"hold"all"intermediate"data"– The"Reducer"will"take"a"long"@me"to"run"

How"Many"Reducers"Do"You"Need?"

Page 301: Cloudera_Developer_Training

07#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%a%job%needs%to%output%a%file%where%all%keys%are%listed%in%sorted%order,%a%single%Reducer%must%be%used%

! Alterna+vely,%the%TotalOrderPar++oner%can%be%used%– Uses"an"externally"generated"file"which"contains"informa@on"about"intermediate"key"distribu@on"– Par@@ons"data"such"that"all"keys"which"go"to"the"first"Reducer"are"smaller"than"any"which"go"to"the"second,"etc"– In"this"way,"mul@ple"Reducers"can"be"used"– Concatena@ng"the"Reducers’"output"files"results"in"a"totally"ordered"list"

Jobs"Which"Require"a"Single"Reducer"

Page 302: Cloudera_Developer_Training

07#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Some%jobs%will%require%a%specific%number%of%Reducers%

! Example:%a%job%must%output%one%file%per%day%of%the%week%– Key"will"be"the"weekday"– Seven"Reducers"will"be"specified"– A"Par@@oner"will"be"wri=en"which"sends"one"key"to"each"Reducer"

Jobs"Which"Require"a"Fixed"Number"of"Reducers"

Page 303: Cloudera_Developer_Training

07#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Many%jobs%can%be%run%with%a%variable%number%of%Reducers%

! Developer%must%decide%how%many%to%specify%– Each"Reducer"should"get"a"reasonable"amount"of"intermediate"data,"but"not"too"much"– Chicken/and/egg"problem"

! Typical%way%to%determine%how%many%Reducers%to%specify:%– Test"the"job"with"a"rela@vely"small"test"data"set"– Extrapolate"to"calculate"the"amount"of"intermediate"data"expected"from"the"‘real’"input"data"– Use"that"to"calculate"the"number"of"Reducers"which"should"be"specified"

Jobs"With"a"Variable"Number"of"Reducers"

Page 304: Cloudera_Developer_Training

07#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Note:%you%should%take%into%account%the%number%of%Reduce%slots%likely%to%be%available%on%the%cluster%– If"your"job"requires"one"more"Reduce"slot"than"there"are"available,"a"second"‘wave’"of"Reducers"will"run"– Consis@ng"just"of"that"single"Reducer"– Poten@ally"doubling"the"amount"of"@me"spent"on"the"Reduce"phase"

– In"this"case,"increasing"the"number"of"Reducers"further"may"cut"down"the"@me"spent"in"the"Reduce"phase"– Two"or"more"waves"will"run,"but"the"Reducers"in"each"wave"will"have"to"process"less"data"

Jobs"With"a"Variable"Number"of"Reducers"(cont’d)"

Page 305: Cloudera_Developer_Training

07#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing%objects%!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 306: Cloudera_Developer_Training

07#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! It%is%generally%good%prac+ce%to%reuse%objects%– Instead"of"crea@ng"many"new"objects""

! Example:%Our%original%WordCount%Mapper%code)%

Reuse"of"Objects"is"Good"Prac@ce"

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } }

Each"@me"the"map()"method"is"called,"we"create"a"new"Text"object"and"a"new IntWritable"object."

Page 307: Cloudera_Developer_Training

07#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Instead,%this%is%be^er%prac+ce:%

Reuse"of"Objects"is"Good"Prac@ce"(cont’d)"

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text wordObject = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { wordObject.set(word); context.write(wordObject, one); } } } }

Create"objects"for"the"key"and"value"outside"of"your"map()"class"

Page 308: Cloudera_Developer_Training

07#43%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Instead,%this%is%be^er%prac+ce:%

Reuse"of"Objects"is"Good"Prac@ce"(cont’d)"

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text wordObject = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word : line.split("\\W+")) { if (word.length() > 0) { wordObject.set(word); context.write(wordObject, one); } } } }

Within"the"map()"method,"populate"the"objects"and"write"them"out."Hadoop"will"take"care"of"serializing"the"data"so"it"is"perfectly"safe"to"re/use"the"objects."

Page 309: Cloudera_Developer_Training

07#44%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%re#uses%objects%all%the%+me%

! For%example,%each%+me%the%Reducer%is%passed%a%new%value%the%same%object%is%reused%

! This%can%cause%subtle%bugs%in%your%code%– For"example,"if"you"build"a"list"of"value"objects"in"the"Reducer,"each"element"of"the"list"will"point"to"the"same"underlying"object"– Unless"you"do"a"deep"copy"

Object"Reuse:"Cau@on!"

Page 310: Cloudera_Developer_Training

07#45%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea+ng%Map#only%MapReduce%jobs%!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion"

Page 311: Cloudera_Developer_Training

07#46%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! There%are%many%types%of%job%where%only%a%Mapper%is%needed%

! Examples:%– Image"processing"– File"format"conversion"– Input"data"sampling"– ETL"

Map/Only"MapReduce"Jobs"

Page 312: Cloudera_Developer_Training

07#47%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%create%a%Map#only%job,%set%the%number%of%Reducers%to%0%in%your%Driver%code%

! Call%the%Job.setOutputKeyClass%and%Job.setOutputValueClass%methods%to%specify%the%output%classes%– Not"the"Job.setMapOutputKeyClass"and"Job.setMapOutputValueClass"methods"

! Anything%wri^en%using%the%Context.write%method%will%be%wri^en%to%HDFS%– Rather"than"wri=en"as"intermediate"data"– One"file"per"Mapper"will"be"wri=en"

Crea@ng"Map/Only"Jobs"

job.setNumReduceTasks(0);

Page 313: Cloudera_Developer_Training

07#48%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands#On%Exercise:%Using%Counters%and%a%Map#Only%Job%!  Conclusion"

Page 314: Cloudera_Developer_Training

07#49%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise%you%will%write%a%Map#Only%MapReduce%job%using%Counters%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"Using"Counters"and"a""Map/Only"Job""

Page 315: Cloudera_Developer_Training

07#50%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%Hadoop%Core%API%

Prac+cal%Development%Tips%%and%Techniques%

!  Strategies"for"debugging"MapReduce"code"!  Tes@ng"MapReduce"code"locally"using"LocalJobRunner"! Wri@ng"and"viewing"log"files"!  Retrieving"job"informa@on"with"Counters"!  Determining"the"op@mal"number"of"Reducers"for"a"job"!  Reusing"objects"!  Crea@ng"Map/only"MapReduce"jobs"!  Hands/On"Exercise:"Using"Counters"and"a"Map/Only"Job"!  Conclusion%

Page 316: Cloudera_Developer_Training

07#51%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! Strategies%for%debugging%MapReduce%code%

! How%to%test%MapReduce%code%locally%by%using%LocalJobRunner%

! How%to%write%and%view%log%files%

! How%to%retrieve%job%informa+on%with%counters%

! How%to%determine%the%op+mal%number%of%Reducers%for%a%job%

! Why%reusing%objects%is%a%best%prac+ce%

! How%to%create%Map#only%MapReduce%jobs%

Conclusion"

Page 317: Cloudera_Developer_Training

08#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Data"Input"and"Output"Chapter"8"

Page 318: Cloudera_Developer_Training

08#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducCon"!  "The"MoCvaCon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriCng"a"MapReduce"Program"!  "Unit"TesCng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracCcal"Development"Tips"and"Techniques"!  %Data%Input%and%Output%!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaCon"in"MapReduce"""

!  "IntegraCng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducCon"to"Hive"and"Pig"!  "An"IntroducCon"to"Oozie"

IntroducCon"to"Apache"Hadoop"and"its"Ecosystem"

Basic%Programming%with%the%

Hadoop%Core%API%

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducCon"

The"Hadoop"Ecosystem"

Page 319: Cloudera_Developer_Training

08#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! How%to%create%custom%Writable%and%WritableComparable%

implementaDons%

! How%to%save%binary%data%using%SequenceFile%and%Avro%data%files%

! How%to%implement%custom%InputFormats%and%OutputFormats%

! What%issues%to%consider%when%using%file%compression%

Data"Input"and"Output"

Page 320: Cloudera_Developer_Training

08#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Recap:"Inputs"to"Mappers"

Page 321: Cloudera_Developer_Training

08#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Recap:"Sort"and"Shuffle"

Page 322: Cloudera_Developer_Training

08#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Recap:"Reducers"to"Outputs"

Page 323: Cloudera_Developer_Training

08#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaDng%custom%Writable%and%WritableComparable%implementaDons%

!  Saving"binary"data"using"SequenceFiles"and"Avro"data"files"!  ImplemenCng"custom"InputFormats"and"OutputFormats"

!  Issues"to"consider"when"using"file"compression"

!  Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"

!  Conclusion"

Page 324: Cloudera_Developer_Training

08#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Data"Types"in"Hadoop"

Writable

WritableComparable

IntWritable LongWritable

Text …

Defines"a"de/serializaCon"protocol."Every"data"type"in"Hadoop"is"a"Writable

Defines"a"sort"order."All"keys"must"be WritableComparable

Concrete"classes"for"different"data"types"

Page 325: Cloudera_Developer_Training

08#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop’s%built#in%data%types%are%‘box’%classes%– They"contain"a"single"piece"of"data"

– Text:"String – IntWritable:"int – LongWritable:"long – FloatWritable:"float – etc."

! Writable%defines%the%wire%transfer%format%

– How"the"data"is"serialized"and"deserialized"

‘Box’"Classes"in"Hadoop"

Page 326: Cloudera_Developer_Training

08#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Example:%say%we%want%a%tuple%(a,%b)%

– We"could"arCficially"construct"it"by,"for"example,"saying"

! Inelegant%

! ProblemaDc%

– If"a"or"b"contained"commas,"for"example"

! Not%always%pracDcal%– Doesn’t"easily"work"for"binary"objects"

! SoluDon:%create%your%own%Writable%object%

CreaCng"a"Complex"Writable

Text t = new Text(a + "," + b); ... String[] arr = t.toString().split(",");

Page 327: Cloudera_Developer_Training

08#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The%readFields%and%write%methods%will%define%how%your%custom%

object%will%be%serialized%and%deserialized%by%Hadoop%

! The%DataInput%and%DataOutput%classes%support%– boolean – byte,"char"(Unicode:"2"bytes)"– double,"float,"int,"long,""– String"(Unicode"or"UTF/8) – Line"unCl"line"terminator"– unsigned"byte,"short – byte"array"

The"Writable"Interface"

public interface Writable { void readFields(DataInput in); void write(DataOutput out);

}

Page 328: Cloudera_Developer_Training

08#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Sample"Custom"Writable:"DateWritable"

class DateWritable implements Writable { int month, day, year; // Constructors omitted for brevity public void readFields(DataInput in) throws IOException { this.month = in.readInt(); this.day = in.readInt(); this.year = in.readInt(); } public void write(DataOutput out) throws IOException { out.writeInt(this.month); out.writeInt(this.day); out.writeInt(this.year); } }

Page 329: Cloudera_Developer_Training

08#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! SoluDon:%use%byte%arrays%

! Write%idiom:%

– Serialize"object"to"byte"array"– Write"byte"count"– Write"byte"array"

! Read%idiom:%

– Read"byte"count"– Create"byte"array"of"proper"size"– Read"byte"array"– Deserialize"object"

What"About"Binary"Objects?"

Page 330: Cloudera_Developer_Training

08#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! WritableComparable%is%a%sub#interface%of%Writable – Must"implement"compareTo,"hashCode,"equals"methods"

! All%keys%in%MapReduce%must%be%WritableComparable

WritableComparable

Page 331: Cloudera_Developer_Training

08#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Making"our"Sample"Object"a"WritableComparable

class DateWritable implements WritableComparable<DateWritable> { int month, day, year; // Constructors omitted for brevity public void readFields (DataInput in) . . . // Refer to Writable

// example public void write (DataOutput out) . . . // Refer to Writable

// example public boolean equals(Object o) { if (o instanceof DateWritable) { DateWritable other = (DateWritable) o; return this.year == other.year && this.month == other.month && this.day == other.day; } return false; }

Page 332: Cloudera_Developer_Training

08#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Making"our"Sample"Object"a"WritableComparable"(cont’d)"

public int compareTo(DateWritable other) { // Return -1 if this date is earlier // Return 0 if dates are equal // Return 1 if this date is later

if (this.year != other.year) { return (this.year < other.year ? -1 : 1); } else if (this.month != other.month) { return (this.month < other.month ? -1 : 1); } else if (this.day != other.day) { return (this.day < other.day ? -1 : 1); } return 0; } public int hashCode() { int seed = 163; // Arbitrary seed value return this.year * seed + this.month * seed + this.day * seed; } }

Page 333: Cloudera_Developer_Training

08#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Use%methods%in%Job%to%specify%your%custom%key/value%types%

! For%output%of%Mappers:%

! For%output%of%Reducers:%

! Input%types%are%defined%by%InputFormat – See"later"

Using"Custom"Types"in"MapReduce"Jobs"

job.setMapOutputKeyClass() job.setMapOutputValueClass()

job.setOutputKeyClass() job.setOutputValueClass()

Page 334: Cloudera_Developer_Training

08#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaCng"custom"Writable"and"WritableComparable"implementaCons"

!  Saving%binary%data%using%SequenceFiles%and%Avro%data%files%!  ImplemenCng"custom"InputFormats"and"OutputFormats"

!  Issues"to"consider"when"using"file"compression"

!  Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"

!  Conclusion"

Page 335: Cloudera_Developer_Training

08#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! SequenceFiles%are%files%containing%binary#encoded%key#value%pairs%– Work"naturally"with"Hadoop"data"types"– SequenceFiles"include"metadata"which"idenCfies"the"data"type"of"the"key"and"value"

! Actually,%three%file%types%in%one%– Uncompressed"– Record/compressed"– Block/compressed"

! Oaen%used%in%MapReduce%

– Especially"when"the"output"of"one"job"will"be"used"as"the"input"for"another"– SequenceFileInputFormat – SequenceFileOutputFormat

What"Are"SequenceFiles?"

Page 336: Cloudera_Developer_Training

08#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! It%is%possible%to%directly%access%SequenceFiles%from%your%code:%

Directly"Accessing"SequenceFiles"

Configuration config = new Configuration(); SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config); Text key = (Text) reader.getKeyClass().newInstance(); IntWritable value = (IntWritable) reader.getValueClass().newInstance(); while (reader.next(key, value)) { // do something here } reader.close();

Page 337: Cloudera_Developer_Training

08#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! SequenceFiles%are%very%useful%but%have%some%potenDal%problems%

! They%are%only%typically%accessible%via%the%Java%API%– Some"work"has"been"done"to"allow"access"from"other"languages"

! If%the%definiDon%of%the%key%or%value%object%changes,%the%file%becomes%

unreadable%

Problems"With"SequenceFiles"

Page 338: Cloudera_Developer_Training

08#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Apache%Avro%is%a%serializaDon%format%which%is%becoming%a%popular%

alternaDve%to%SequenceFiles%

– Project"was"created"by"Doug"Cufng,"the"creator"of"Hadoop"

! Self#describing%file%format%

– The"schema"for"the"data"is"included"in"the"file"itself"

! Compact%file%format%

! Portable%across%mulDple%languages%

– Support"for"C,"C++,"Java,"Python,"Ruby"and"others"! CompaDble%with%Hadoop%

– Via"the"AvroMapper"and"AvroReducer"classes"

An"AlternaCve"to"SequenceFiles:"Avro"

Page 339: Cloudera_Developer_Training

08#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaCng"custom"Writable"and"WritableComparable"implementaCons"

!  Saving"binary"data"using"SequenceFiles"and"Avro"data"files"!  ImplemenDng%custom%InputFormats%and%OutputFormats%

!  Issues"to"consider"when"using"file"compression"

!  Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"

!  Conclusion"

Page 340: Cloudera_Developer_Training

08#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reprise:"The"Role"of"the"InputFormat"

Page 341: Cloudera_Developer_Training

08#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Most"Common"InputFormats"

! Most%common%InputFormats:%

– TextInputFormat – KeyValueTextInputFormat – SequenceFileInputFormat

! Others%are%available%– NLineInputFormat

– Every"n"lines"of"an"input"file"is"treated"as"a"separate"InputSplit"– Configure"in"the"driver"code"by"sefng:"

mapreduce.input.lineinput.linespermap"(CDH"4)"mapred.line.inputformat.linespermap"(CDH"3)"

– MultiFileInputFormat – Abstract"class"that"manages"the"use"of"mulCple"files"in"a"single"task"– You"must"supply"a"getRecordReader()"implementaCon"

Page 342: Cloudera_Developer_Training

08#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! All%file#based%InputFormats%inherit%from%FileInputFormat

! FileInputFormat%computes%InputSplits%based%on%the%size%of%each%file,%

in%bytes%

– HDFS"block"size"is"used"as"upper"bound"for"InputSplit"size"– Lower"bound"can"be"specified"in"your"driver"code"– This"means"that"an"InputSplit"typically"correlates"to"an"HDFS"block"

– So"the"number"of"Mappers"will"equal"the"number"of"HDFS"blocks"of"input"data"to"be"processed"

! Important:%InputSplits%do%not%respect%record%boundaries!%

How"FileInputFormat"Works"

Page 343: Cloudera_Developer_Training

08#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! InputSplits%are%handed%to%the%RecordReaders%– Specified"by"the"path,"starCng"posiCon"offset,"length"

! RecordReaders%must:%

– Ensure"each"(key,"value)"pair"is"processed"– Ensure"no"(key,"value)"pair"is"processed"more"than"once"– Handle"(key,"value)"pairs"which"are"split"across"InputSplits"

What"RecordReaders"Do"

Page 344: Cloudera_Developer_Training

08#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Sample"InputSplit"

Page 345: Cloudera_Developer_Training

08#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

From"InputSplits"to"RecordReaders"

Page 346: Cloudera_Developer_Training

08#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Use%FileInputFormat%as%a%starDng%point%– Extend"it"

! Write%your%own%custom%RecordReader%

! Override%the%getRecordReader%method%in%FileInputFormat

! Override%isSplittable%if%you%don’t%want%input%files%to%be%split%– Method"is"passed"each"file"name"in"turn"– Return"false"for"non/spli=able"files"

WriCng"Custom"InputFormats"

Page 347: Cloudera_Developer_Training

08#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reprise:"Role"of"the"OutputFormat"

Page 348: Cloudera_Developer_Training

08#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! OutputFormats%work%much%like%InputFormat%classes%

! Custom%OutputFormats%must%provide%a%RecordWriter%implementaDon%

OutputFormat"

Page 349: Cloudera_Developer_Training

08#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaCng"custom"Writable"and"WritableComparable"implementaCons"

!  Saving"binary"data"using"SequenceFiles"and"Avro"data"files"!  ImplemenCng"custom"InputFormats"and"OutputFormats"

!  Issues%to%consider%when%using%file%compression%

!  Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"

!  Conclusion"

Page 350: Cloudera_Developer_Training

08#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hadoop%understands%a%variety%of%file%compression%formats%

– Including"GZip"! If%a%compressed%file%is%included%as%one%of%the%files%to%be%processed,%Hadoop%

will%automaDcally%decompress%it%and%pass%the%decompressed%contents%to%

the%Mapper%

– There"is"no"need"for"the"developer"to"worry"about"decompressing"the"file"

! However,%GZip%is%not%a%‘splifable%file%format’%

– A"GZipped"file"can"only"be"decompressed"by"starCng"at"the"beginning"of"the"file"and"conCnuing"on"to"the"end"– You"cannot"start"decompressing"the"file"part"of"the"way"through"it"

Hadoop"and"Compressed"Files"

Page 351: Cloudera_Developer_Training

08#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If%the%MapReduce%framework%receives%a%non#splifable%file%(such%as%a%

GZipped%file)%it%passes%the%en#re%file%to%a%single%Mapper%

! This%can%result%in%one%Mapper%running%for%far%longer%than%the%others%

– It"is"dealing"with"an"enCre"file,"while"the"others"are"dealing"with"smaller"porCons"of"files"– SpeculaCve"execuCon"could"occur"

– Although"this"will"provide"no"benefit"! Typically%it%is%not%a%good%idea%to%use%GZip%to%compress%files%which%will%be%

processed%by%MapReduce%

Non/Spli=able"File"Formats"and"Hadoop"

Page 352: Cloudera_Developer_Training

08#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Spli=able"Compression"Formats:"LZO"

! One%splifable%compression%format%is%LZO%

! Because%of%licensing%restricDons,%LZO%cannot%be%shipped%with%Hadoop%– But"it"is"easy"to"add"– See https://github.com/cloudera/hadoop-lzo

! To%make%an%LZO%file%splifable,%you%must%first%index%the%file%

! The%index%file%contains%informaDon%about%how%to%break%the%LZO%file%into%

splits%that%can%be%decompressed%

! Access%the%splifable%LZO%file%as%follows:%– In"Java"MapReduce"programs,"use"the"LzoTextInputFormat"class"– In"Streaming"jobs,"specify"-inputformat com.hadoop. mapred.DeprecatedLzoTextInputFormat"on"the"command"line""

Page 353: Cloudera_Developer_Training

08#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Snappy%is%a%relaDvely%new%compression%codec%

– Developed"at"Google"– Very"fast"

! Snappy%does%not%compress%a%SequenceFile%and%produce,%e.g.,%a%file%with%

a%.snappy%extension%– Instead,"it"is"a"codec"that"can"be"used"to"compress"data"within"a"file"– That"data"can"be"decompressed"automaCcally"by"Hadoop"(or"other"programs)"when"the"file"is"read"– Works"well"with"SequenceFiles,"Avro"files"

! Snappy%is%now%preferred%over%LZO%

Spli=able"Compression"for"SequenceFiles"and"Avro"Files"Using"the"Snappy"Codec"

Page 354: Cloudera_Developer_Training

08#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Specify%output%compression%in%the%JobConf%object%

! Specify%block%or%record%compression%%

– Block"compression"is"recommended"for"the"Snappy"codec"

! Set%the%compression%codec%to%the%Snappy%codec%in%the%Job%object%

! For%example:%

Compressing"Output"SequenceFiles"With"Snappy"

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; import org.apache.hadoop.io.SequenceFile.CompressionType; import org.apache.hadoop.io.compress.SnappyCodec; . . . job.setOutputFormatClass(SequenceFileOutputFormat.class); FileOutputFormat.setCompressOutput(job,true); FileOutputFormat.setOutputCompressorClass(job,SnappyCodec.class); SequenceFileOuptutFormat.setOutputCompressionType(job, CompressionType.BLOCK);

Page 355: Cloudera_Developer_Training

08#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaCng"custom"Writable"and"WritableComparable"implementaCons"

!  Saving"binary"data"using"SequenceFiles"and"Avro"data"files"!  ImplemenCng"custom"InputFormats"and"OutputFormats"

!  Issues"to"consider"when"using"file"compression"

!  Hands#On%Exercise:%Using%SequenceFiles%and%File%Compression%

!  Conclusion"

Page 356: Cloudera_Developer_Training

08#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%explore%reading%and%wriDng%uncompressed%and%compressed%SequenceFiles%%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"Using"Sequence"Files"and"File"Compression"

Page 357: Cloudera_Developer_Training

08#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Basic%Programming%with%the%%

Hadoop%Core%API%Data%Input%and%Output%

!  CreaCng"custom"Writable"and"WritableComparable"implementaCons"

!  Saving"binary"data"using"SequenceFiles"and"Avro"data"files"!  ImplemenCng"custom"InputFormats"and"OutputFormats"

!  Issues"to"consider"when"using"file"compression"

!  Hands/On"Exercise:"Using"SequenceFiles"and"File"Compression"

!  Conclusion%

Page 358: Cloudera_Developer_Training

08#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! How%to%create%custom%Writable%and%WritableComparable%

implementaDons%

! How%to%save%binary%data%using%SequenceFile%and%Avro%data%files%

! How%to%implement%custom%InputFormats%and%OutputFormats%

! What%issues%to%consider%when%using%file%compression%

Conclusion"

Page 359: Cloudera_Developer_Training

09#1%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Common"MapReduce"Algorithms"Chapter"9"

Page 360: Cloudera_Developer_Training

09#2%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducDon"!  "The"MoDvaDon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriDng"a"MapReduce"Program"!  "Unit"TesDng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracDcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  %Common%MapReduce%Algorithms%!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaDon"in"MapReduce"""

!  "IntegraDng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducDon"to"Hive"and"Pig"!  "An"IntroducDon"to"Oozie"

IntroducDon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem%Solving%with%MapReduce%

Course"Conclusion"and"Appendices"

Course"IntroducDon"

The"Hadoop"Ecosystem"

Page 361: Cloudera_Developer_Training

09#3%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%will%learn%

! How%to%sort%and%search%large%data%sets%

! How%to%perform%a%secondary%sort%

! How%to%index%data%

! How%to%compute%term%frequency%–%inverse%document%frequency%(TF#IDF)%

! How%to%calculate%word%co#occurrence%

Common"MapReduce"Algorithms"

Page 362: Cloudera_Developer_Training

09#4%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce%jobs%tend%to%be%relaOvely%short%in%terms%of%lines%of%code%

! It%is%typical%to%combine%mulOple%small%MapReduce%jobs%together%in%a%single%workflow%– OYen"using"Oozie"(see"later)"

! You%are%likely%to%find%that%many%of%your%MapReduce%jobs%use%very%similar%code%

! In%this%chapter%we%present%some%very%common%MapReduce%algorithms%– These"algorithms"are"frequently"the"basis"for"more"complex"MapReduce"jobs"

"

IntroducDon"

Page 363: Cloudera_Developer_Training

09#5%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorOng%and%searching%large%data%sets%!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 364: Cloudera_Developer_Training

09#6%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce%is%very%well%suited%to%sorOng%large%data%sets%

! Recall:%keys%are%passed%to%the%Reducer%in%sorted%order%

! Assuming%the%file%to%be%sorted%contains%lines%with%a%single%value:%– Mapper"is"merely"the"idenDty"funcDon"for"the"value""(k, v) -> (v, _) – Reducer"is"the"idenDty"funcDon""(k, _) -> (k, '')

SorDng"

Page 365: Cloudera_Developer_Training

09#7%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Trivial%with%a%single%Reducer%

! For%mulOple%Reducers,%need%to%choose%a%parOOoning%funcOon%such%that%if%%k1 < k2, partition(k1) <= partition(k2)

SorDng"(cont’d)"

Page 366: Cloudera_Developer_Training

09#8%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! SorOng%is%frequently%used%as%a%speed%test%for%a%Hadoop%cluster%– Mapper"and"Reducer"are"trivial"

– Therefore"sorDng"is"effecDvely"tesDng"the"Hadoop"framework’s"I/O"

! Good%way%to%measure%the%increase%in%performance%if%you%enlarge%your%cluster%– Run"and"Dme"a"sort"job"before"and"aYer"you"add"more"nodes"– terasort"is"one"of"the"sample"jobs"provided"with"Hadoop"

– Creates"and"sorts"very"large"files"

SorDng"as"a"Speed"Test"of"Hadoop"

Page 367: Cloudera_Developer_Training

09#9%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Assume%the%input%is%a%set%of%files%containing%lines%of%text%

! Assume%the%Mapper%has%been%passed%the%pa[ern%for%which%to%search%as%a%special%parameter%– We"saw"how"to"pass"parameters"to"your"Mapper"in"the"previous"chapter"

! Algorithm:%– Mapper"compares"the"line"against"the"pa=ern"– If"the"pa=ern"matches,"Mapper"outputs"(line, _)

– Or"(filename+line, _),"or"…"– If"the"pa=ern"does"not"match,"Mapper"outputs"nothing"– Reducer"is"the"IdenDty"Reducer"

– Just"outputs"each"intermediate"key"

Searching"

Page 368: Cloudera_Developer_Training

09#10%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing%a%secondary%sort%!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 369: Cloudera_Developer_Training

09#11%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Recall%that%keys%are%passed%to%the%Reducer%in%sorted%order%

! The%list%of%values%for%a%parOcular%key%is%not%sorted%– Order"may"well"change"between"different"runs"of"the"MapReduce"job"

! SomeOmes%a%job%needs%to%receive%the%values%for%a%parOcular%key%in%a%sorted%order%– This"is"known"as"a"secondary*sort"

Secondary"Sort:"MoDvaDon"

Page 370: Cloudera_Developer_Training

09#12%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Example:%Your%Reducer%will%emit%the%largest%value%produced%by%Mappers%for%each%different%key%

! Naïve%soluOon%– Loop"through"all"values,"keeping"track"of"the"largest"– Finally,"emit"the"largest"value"

! Be[er%soluOon%– Arrange"for"the"values"for"a"given"key"to"be"presented"to"the"Reducer"in"sorted,"descending"order"– Reducer"just"needs"to"read"and"emit"the"first"value"it"is"given"for"a"key"

Secondary"Sort:"MoDvaDon"(cont’d)"

Page 371: Cloudera_Developer_Training

09#13%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Comparator%classes%are%classes%that%compare%objects%

! Custom%comparators%can%be%used%in%a%secondary%sort%to%compare%composite%keys%

! Grouping%comparators%can%be%used%in%a%secondary%sort%to%ensure%that%only%the%natural%key%is%used%for%parOOoning%and%grouping%

Aside:"Comparator"Classes"

Page 372: Cloudera_Developer_Training

09#14%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To%implement%a%secondary%sort,%the%intermediate%key%should%be%a%composite%of%the%‘actual’%(natural)%key%and%the%value%

! Define%a%ParOOoner%which%parOOons%just%on%the%natural%key%

! Define%a%Comparator%class%which%sorts%on%the%enOre%composite%key%– Ensures"that"the"keys"are"passed"to"the"Reducer"in"the"desired"order"– Orders"by"natural"key"and,"for"the"same"natural"key,"on"the"value"porDon"of"the"key"– Specified"in"the"driver"code"by"

ImplemenDng"the"Secondary"Sort"

job.setSortComparatorClass(MyOKCC.class);

Page 373: Cloudera_Developer_Training

09#15%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Now%we%know%that%all%values%for%the%same%natural%key%will%go%to%the%same%Reducer%– And"they"will"be"in"the"order"we"desire"

! We%must%now%ensure%that%all%the%values%for%the%same%natural%key%are%passed%in%one%call%to%the%Reducer%

! Achieved%by%defining%a%Grouping%Comparator%class%%– Determines"which"keys"and"values"are"passed"in"a"single"call"to"the"Reducer""– Looks"at"just"the"natural"key"– Specified"in"the"driver"code"by"

ImplemenDng"the"Secondary"Sort"(cont’d)"

job.setGroupingComparatorClass(MyOVGC.class);

Page 374: Cloudera_Developer_Training

09#16%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Assume%we%have%input%with%(key,%value)%pairs%like%this%

! We%want%the%Reducer%to%receive%the%intermediate%data%for%each%key%in%descending%numerical%order%

Secondary"Sort:"Example"

foo 98 foo 101 bar 12 baz 18 foo 22 bar 55 baz 123

Page 375: Cloudera_Developer_Training

09#17%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Write%the%Mapper%such%that%the%intermediate%key%is%a%composite%of%the%natural%key%and%value%– For"example,"intermediate"output"may"look"like"this:"

Secondary"Sort:"Example"(cont’d)"

('foo#98', 98) ('foo#101', 101) ('bar#12',12) ('baz#18', 18) ('foo#22', 22) ('bar#55', 55) ('baz#123', 123)

Page 376: Cloudera_Developer_Training

09#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Write%a%class%that%extends%WritableComparator%and%sorts%on%natural%key,%and%for%idenOcal%natural%keys,%sorts%on%the%value%porOon%in%descending%order%– Just"override"compare(WritableComparable, WritableComparable)"– Supply"a"reference"to"this"class"in"your"driver"using"the"Job.setOutputKeyComparatorClass"method"– Will"result"in"keys"being"passed"to"the"Reducer"in"this"order:"

Secondary"Sort:"Example"(cont’d)"

('bar#55', 55) ('bar#12', 12) ('baz#123', 123) ('baz#18', 18) ('foo#101', 101) ('foo#98', 98) ('foo#22', 22)

Page 377: Cloudera_Developer_Training

09#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Finally,%write%another%WritableComparator%subclass%which%just%examines%the%first%(‘natural’)%porOon%of%the%key%– Again,"just"override"compare(WritableComparable, WritableComparable) – Supply"a"reference"to"this"class"in"your"driver"using"the"Job.setOutputValueGroupingComparator"method"– This"will"ensure"that"values"associated"with"the"same"natural"key"will"be"sent"to"the"same"pass"of"the"Reducer"– But"they’re"sorted"in"descending"order,"as"we"required"

Secondary"Sort:"Example"(cont’d)"

Page 378: Cloudera_Developer_Training

09#20%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing%data%!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 379: Cloudera_Developer_Training

09#21%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Assume%the%input%is%a%set%of%files%containing%lines%of%text%

! Key%is%the%byte%offset%of%the%line,%value%is%the%line%itself%

! We%can%retrieve%the%name%of%the%file%using%the%Context%object%– More"details"on"how"to"do"this"later"

Indexing"

Page 380: Cloudera_Developer_Training

09#22%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper:%– For"each"word"in"the"line,"emit"(word, filename)

! Reducer:%– IdenDty"funcDon"

– Collect"together"all"values"for"a"given"key"(i.e.,"all"filenames"for"a"parDcular"word)"– Emit"(word, filename_list)

Inverted"Index"Algorithm"

Page 381: Cloudera_Developer_Training

09#23%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Inverted"Index:"Dataflow"

Page 382: Cloudera_Developer_Training

09#24%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Recall%the%WordCount%example%we%used%earlier%in%the%course%– For"each"word,"Mapper"emi=ed"(word, 1) – Very"similar"to"the"inverted"index"

! This%is%a%common%theme:%reuse%of%exisOng%Mappers,%with%minor%modificaOons%

Aside:"Word"Count"

Page 383: Cloudera_Developer_Training

09#25%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands#On%Exercise:%CreaOng%an%Inverted%Index%!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 384: Cloudera_Developer_Training

09#26%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%this%Hands#On%Exercise,%you%will%write%a%MapReduce%program%to%generate%an%inverted%index%of%a%set%of%documents%

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercise:"CreaDng"an"Inverted"Index"

Page 385: Cloudera_Developer_Training

09#27%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuOng%term%frequency%–%inverse%document%frequency%(TF#IDF)%!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 386: Cloudera_Developer_Training

09#28%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Term%Frequency%–%Inverse%Document%Frequency%(TF#IDF)%– Answers"the"quesDon"“How"important"is"this"term"in"a"document?”"

! Known%as%a%term%weigh*ng%func*on%– Assigns"a"score"(weight)"to"each"term"(word)"in"a"document"

! Very%commonly%used%in%text%processing%and%search%

! Has%many%applicaOons%in%data%mining%

Term"Frequency"–"Inverse"Document"Frequency"

Page 387: Cloudera_Developer_Training

09#29%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Merely%counOng%the%number%of%occurrences%of%a%word%in%a%document%is%not%a%good%enough%measure%of%its%relevance%– If"the"word"appears"in"many"other"documents,"it"is"probably"less"relevant"– Some"words"appear"too"frequently"in"all"documents"to"be"relevant"

– Known"as"‘stopwords’"! TF#IDF%considers%both%the%frequency%of%a%word%in%a%given%document%and%the%number%of%documents%which%contain%the%word%

TF/IDF:"MoDvaDon"

Page 388: Cloudera_Developer_Training

09#30%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Consider%a%music%recommendaOon%system%– Given"many"users’"music"libraries,"provide"“you"may"also"like”"suggesDons"

! If%user%A%and%user%B%have%similar%libraries,%user%A%may%like%an%arOst%in%user%B’s%library%– But"some"arDsts"will"appear"in"almost"everyone’s"library,"and"should"therefore"be"ignored"when"making"recommendaDons"– Almost"everyone"has"The"Beatles"in"their"record"collecDon!"

TF/IDF:"Data"Mining"Example"

Page 389: Cloudera_Developer_Training

09#31%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Term%Frequency%(TF)%– Number"of"Dmes"a"term"appears"in"a"document"(i.e.,"the"count)"

! Inverse%Document%Frequency%(IDF)%

– N:"total"number"of"documents"– n:"number"of"documents"that"contain"a"term"

! TF#IDF%– TF"×"IDF"

TF/IDF"Formally"Defined"

idf = logNn

"

# $

%

& '

Page 390: Cloudera_Developer_Training

09#32%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! What%we%need:%– Number"of"Dmes"t"appears"in"a"document"

– Different"value"for"each"document"– Number"of"documents"that"contains"t"

– One"value"for"each"term"– Total"number"of"documents"

– One"value"

CompuDng"TF/IDF"

Page 391: Cloudera_Developer_Training

09#33%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Overview%of%algorithm:%3%MapReduce%jobs%– Job"1:"compute"term"frequencies"– Job"2:"compute"number"of"documents"each"word"occurs"in"– Job"3:"compute"TF/IDF"

! NotaOon%in%following%slides:%– docid"="a"unique"ID"for"each"document"– contents*="the"complete"text"of"each"document"– N"="total"number"of"documents"– term"="a"term"(word)"found"in"the"document*– /"="term"frequency"– n"="number"of"documents"a"term"appears"in"

! Note%that%real#world%systems%typically%perform%‘stemming’%on%terms%– Removal"of"plurals,"tense,"possessives"etc"

CompuDng"TF/IDF"With"MapReduce"

Page 392: Cloudera_Developer_Training

09#34%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper%– Input:"(docid,"contents)"– For"each"term"in"the"document,"generate"a"(term,"docid)"pair"

– i.e.,"we"have"seen"this"term"in"this"document"once"– Output:"((term,"docid),"1)"

! Reducer%– Sums"counts"for"word"in"document"– Outputs"((term,"docid),"/)"

– i.e.,"the"term"frequency"of"term"in"docid"is"/*

! We%can%add%a%Combiner,%which%will%use%the%same%code%as%the%Reducer%

CompuDng"TF/IDF:"Job"1"–"Compute"/*

Page 393: Cloudera_Developer_Training

09#35%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper%– Input:"((term,"docid),"/)"– Output:"(term,"(docid,"/,"1))"

! Reducer%– Sums"1s"to"compute"n"(number"of"documents"containing"term)"– Note:"need"to"buffer"(docid,"/)"pairs"while"we"are"doing"this"(more"later)"– Outputs"((term,"docid),"(/,"n))"

CompuDng"TF/IDF:"Job"2"–"Compute"n*

Page 394: Cloudera_Developer_Training

09#36%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper%– Input:"((term,"docid),"(/,"n))"– Assume"N"is"known"(easy"to"find)"– Output"((term,"docid),"TF"×"IDF)"

! Reducer%– The"idenDty"funcDon"

CompuDng"TF/IDF:"Job"3"–"Compute"TF/IDF"

Page 395: Cloudera_Developer_Training

09#37%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Job%2:%We%need%to%buffer%(docid,%0)%pairs%counts%while%summing%1’s%(to%compute%n)%– Possible"problem:"pairs"may"not"fit"in"memory!"– In"how"many"documents"does"the"word"“the”"occur?"

! Possible%soluOons%– Ignore"very/high/frequency"words"– Write"out"intermediate"data"to"a"file"– Use"another"MapReduce"pass"

CompuDng"TF/IDF:"Working"At"Scale"

Page 396: Cloudera_Developer_Training

09#38%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Several%small%jobs%add%up%to%full%algorithm%– Thinking"in"MapReduce"oYen"means"decomposing"a"complex"algorithm"into"a"sequence"of"smaller"jobs"

! Beware%of%memory%usage%for%large%amounts%of%data!%– Any"Dme"when"you"need"to"buffer"data,"there’s"a"potenDal"scalability"bo=leneck"

TF/IDF:"Final"Thoughts"

Page 397: Cloudera_Developer_Training

09#39%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaOng%word%co#occurrence%!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion"

Page 398: Cloudera_Developer_Training

09#40%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Word%Co#Occurrence%measures%the%frequency%with%which%two%words%appear%close%to%each%other%in%a%corpus%of%documents%– For"some"definiDon"of"‘close’"

! This%is%at%the%heart%of%many%data#mining%techniques%– Provides"results"for"“people"who"did"this,"also"do"that”"– Examples:"

– Shopping"recommendaDons"– Credit"risk"analysis"– IdenDfying"‘people"of"interest’"

Word"Co/Occurrence:"MoDvaDon"

Page 399: Cloudera_Developer_Training

09#41%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper%

! Reducer%

Word"Co/Occurrence:"Algorithm"

map(docid a, doc d) { foreach w in d do foreach u near w do emit(pair(w, u), 1)

}

reduce(pair p, Iterator counts) { s = 0 foreach c in counts do s += c emit(p, s)

}

Page 400: Cloudera_Developer_Training

09#42%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands#On%Exercise:%CalculaOng%Word%Co#Occurrence%! OpOonal%Hands#On%Exercise:%ImplemenOng%Word%Co#Occurrence%with%a%Custom%WritableComparable%

!  Conclusion"

Page 401: Cloudera_Developer_Training

09#43%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In%these%Hands#On%Exercises%you%will%write%an%applicaOon%that%counts%the%number%of%Omes%words%appear%next%to%each%other%

! If%you%complete%the%first%exercise,%please%a[empt%the%opOonal%follow#up%exercise,%in%which%you%will%rewrite%your%code%to%use%a%custom%WritableComparable

! Please%refer%to%the%Hands#On%Exercise%Manual%

Hands/On"Exercises:"CalculaDng"Word""Co/Occurrence,"Using"a"Custom"WritableComparable

Page 402: Cloudera_Developer_Training

09#44%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem%Solving%with%MapReduce%Common%MapReduce%Algorithms%

!  SorDng"and"searching"large"data"sets"!  Performing"a"secondary"sort"

!  Indexing"data"!  Hands/On"Exercise:"CreaDng"an"Inverted"Index"!  CompuDng"term"frequency"–"inverse"document"frequency"(TF/IDF)"

!  CalculaDng"word"co/occurrence"!  Hands/On"Exercise:"CalculaDng"Word"Co/Occurrence"

! OpDonal"Hands/On"Exercise:"ImplemenDng"Word"Co/Occurrence"with"a"

Custom"WritableComparable"

!  Conclusion%

Page 403: Cloudera_Developer_Training

09#45%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In%this%chapter%you%have%learned%

! How%to%sort%and%search%large%data%sets%

! How%to%perform%a%secondary%sort%

! How%to%index%data%

! How%to%compute%term%frequency%–%inverse%document%frequency%(TF#IDF)%

! How%to%calculate%word%co#occurrence%

Conclusion"

Page 404: Cloudera_Developer_Training

10#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Joining"Data"Sets"in"MapReduce"Jobs"Chapter"10"

Page 405: Cloudera_Developer_Training

10#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducEon"!  "The"MoEvaEon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriEng"a"MapReduce"Program"!  "Unit"TesEng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracEcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  $Joining$Data$Sets$in$MapReduce$Jobs$

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaEon"in"MapReduce"""

!  "IntegraEng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducEon"to"Hive"and"Pig"!  "An"IntroducEon"to"Oozie"

IntroducEon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem$Solving$with$MapReduce$

Course"Conclusion"and"Appendices"

Course"IntroducEon"

The"Hadoop"Ecosystem"

Page 406: Cloudera_Developer_Training

10#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$will$learn$

! How$to$write$a$Map#side$join$

! How$to$write$a$Reduce#side$join$

Joining"Data"Sets"in"MapReduce"Jobs"

Page 407: Cloudera_Developer_Training

10#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! We$frequently$need$to$join$data$together$from$two$sources$as$part$of$a$

MapReduce$job,$such$as$

– Lookup"tables"– Data"from"database"tables"

! There$are$two$fundamental$approaches:$Map#side$joins$and$Reduce#side$

joins$

! Map#side$joins$are$easier$to$write,$but$have$potenKal$scaling$issues$

! We$will$invesKgate$both$types$of$joins$in$this$chapter$

IntroducEon"

Page 408: Cloudera_Developer_Training

10#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! But$first…$

! Avoid$wriKng$joins$in$Java$MapReduce$if$you$can!$

! AbstracKons$such$as$Pig$and$Hive$are$much$easier$to$use$

– Save"hours"of"programming"

! If$you$are$dealing$with$text#based$data,$there$really$is$no$reason$not$to$use$Pig$or$Hive$

But"First…"

Page 409: Cloudera_Developer_Training

10#6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem$Solving$with$MapReduce$

Joining$Data$Sets$in$$

MapReduce$Jobs$

! WriKng$a$Map#side$join$

! WriEng"a"Reduce/side"join"

!  Conclusion"

Page 410: Cloudera_Developer_Training

10#7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Basic$idea$for$Map#side$joins:$

– Load"one"set"of"data"into"memory,"stored"in"a"hash"table"– Key"of"the"hahs"table"is"the"join"key"

– Map"over"the"other"set"of"data,"and"perform"a"lookup"on"the"hash"table"using"the"join"key"– If"the"join"key"is"found,"you"have"a"successful"join"

– Otherwise,"do"nothing"

Map/Side"Joins:"The"Algorithm"

Page 411: Cloudera_Developer_Training

10#8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Map#side$joins$have$scalability$issues$

– The"associaEve"array"may"become"too"large"to"fit"in"memory"

! Possible$soluKon:$break$one$data$set$into$smaller$pieces$

– Load"each"piece"into"memory"individually,"mapping"over"the"second"data"set"each"Eme"– Then"combine"the"result"sets"together"

Map/Side"Joins:"Problems,"Possible"SoluEons"

Page 412: Cloudera_Developer_Training

10#9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem$Solving$with$MapReduce$

Joining$Data$Sets$in$$

MapReduce$Jobs$

! WriEng"a"Map/side"join"

! WriKng$a$Reduce#side$join$

!  Conclusion"

Page 413: Cloudera_Developer_Training

10#10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! For$a$Reduce#side$join,$the$basic$concept$is:$– Map"over"both"data"sets"– Emit"a"(key,"value)"pair"for"each"record"

– Key"is"the"join"key,"value"is"the"enEre"record"– In"the"Reducer,"do"the"actual"join"

– Because"of"the"Shuffle"and"Sort,"values"with"the"same"key"are"brought"together"

Reduce/Side"Joins:"The"Basic"Concept"

Page 414: Cloudera_Developer_Training

10#11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Example$input$data:$

$

! Required$output:$

Reduce/Side"Joins:"Example"

EMP: 42, Aaron, loc(13) LOC: 13, New York City

EMP: 42, Aaron, loc(13), New York City

Page 415: Cloudera_Developer_Training

10#12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A$data$structure$to$hold$a$record$could$look$like$this:$

Example"Record"Data"Structure"

class Record { enum Typ { emp, loc }; Typ type; String empName; int empId; int locId; String locationName; }

Page 416: Cloudera_Developer_Training

10#13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reduce/Side"Join:"Mapper"

void map(k, v) { Record r = parse(v); emit (r.locId, r); }

Page 417: Cloudera_Developer_Training

10#14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Reduce/Side"Join:"Reducer"

void reduce(k, values) { Record thisLocation; List<Record> employees; for (Record v in values) { if (v.type == Typ.loc) { thisLocation = v; } else { employees.add(v); } } for (Record e in employees) { e.locationName = thisLocation.locationName; emit(e); } }

Page 418: Cloudera_Developer_Training

10#15$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! All$employees$for$a$given$locaKon$must$potenKally$be$buffered$in$the$

Reducer$

– Could"result"in"out/of/memory"errors"for"large"data"sets"

! SoluKon:$Ensure$the$locaKon$record$is$the$first$one$to$arrive$at$the$Reducer$

– Using"a"Secondary"Sort"

Scalability"Problems"With"Our"Reducer"

Page 419: Cloudera_Developer_Training

10#16$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Be=er"Intermediate"Key"

class LocKey { boolean isPrimary; int locId; public int compareTo(LocKey k) { if (locId != k.locId) { return Integer.compare(locId, k.locId); } else { return Boolean.compare(k.isPrimary, isPrimary); } } public int hashCode() { return locId; } }

Page 420: Cloudera_Developer_Training

10#17$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Be=er"Intermediate"Key"(cont’d)"

class LocKey { boolean isPrimary; int locId; public int compareTo(LocKey k) { if (locId != k.locId) { return Integer.compare(locId, k.locId); } else { return Boolean.compare(k.isPrimary, isPrimary); } } public int hashCode() { return locId; } }

The"compareTo"method"ensures"that"primary"keys"will"sort"earlier"than"non/primary"keys"for"the"same"locaEon."

Page 421: Cloudera_Developer_Training

10#18$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Be=er"Intermediate"Key"(cont’d)"

class LocKey { boolean isPrimary; int locId; public int compareTo(LocKey k) { if (locId != k.locId) { return Integer.compare(locId, k.locId); } else { return Boolean.compare(k.isPrimary, isPrimary); } } public int hashCode() { return locId; } }

The"hashCode"method"ensures"that"all"records"with"the"same"key"will"go"to"the"same"Reducer."This"is"an"alternaEve"to"providing"a"custom"Comparator."

Page 422: Cloudera_Developer_Training

10#19$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Be=er"Mapper"

void map(k, v) { Record r = parse(v); if (r.type == Typ.emp) { emit (setisPrimaryFalse(r.locId), r); } else { emit (setisPrimaryTrue(r.locId), r); } }

Page 423: Cloudera_Developer_Training

10#20$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

A"Be=er"Reducer"

Record thisLoc; void reduce(k, values) { for (Record v in values) { if (v.type == Typ.loc) { thisLoc = v; } else { v.locationName = thisLoc.locationName; emit(v); } } }

Page 424: Cloudera_Developer_Training

10#21$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Create$a$Grouping$Comparator$to$ensure$that$all$records$with$the$same$

locaKon$are$passed$to$the$Reducer$in$one$call$

Create"a"Grouping"Comparator…"

class LocIDComparator extends WritableComparator { public int compare(Record r1, Record r2) { return Integer.compare(r1.locId, r2.locId); } }

Page 425: Cloudera_Developer_Training

10#22$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

…And"Configure"Hadoop"To"Use"It"In"The"Driver"

job.setOutputValueGroupingComparator(LocIDComparator.class)

Page 426: Cloudera_Developer_Training

10#23$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Problem$Solving$with$MapReduce$

Joining$Data$Sets$in$$

MapReduce$Jobs$

! WriEng"a"Map/side"join"

! WriEng"a"Reduce/side"join"

!  Conclusion$

Page 427: Cloudera_Developer_Training

10#24$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$have$learned$

! How$to$join$write$a$Map#side$join$

! How$to$write$a$Reduce#side$join$

Conclusion"

Page 428: Cloudera_Developer_Training

11"1#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Integra@ng"Hadoop"into"the""Enterprise"Workflow"Chapter"11"

Page 429: Cloudera_Developer_Training

11"2#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "Introduc@on"!  "The"Mo@va@on"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "Wri@ng"a"MapReduce"Program"!  "Unit"Tes@ng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "Prac@cal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"Manipula@on"in"MapReduce"""

!  #Integra,ng#Hadoop#into#the#Enterprise#Workflow#!  "Machine"Learning"and"Mahout"!  "An"Introduc@on"to"Hive"and"Pig"!  "An"Introduc@on"to"Oozie"

Introduc@on"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"Introduc@on"

The#Hadoop#Ecosystem#

Page 430: Cloudera_Developer_Training

11"3#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In#this#chapter#you#will#learn#

! How#Hadoop#can#be#integrated#into#an#exis,ng#enterprise#

! How#to#load#data#from#an#exis,ng#RDBMS#into#HDFS#by#using#Sqoop#

! How#to#manage#real",me#data#such#as#log#files#using#Flume#

! How#to#access#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#

Integra@ng"Hadoop"Into"The"Enterprise"Workflow"

Page 431: Cloudera_Developer_Training

11"4#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra,ng#Hadoop#into#an#exis,ng#enterprise#!  Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"

!  Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

! Managing"real/@me"data"using"Flume"

!  Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H=pFS"

!  Conclusion"

Page 432: Cloudera_Developer_Training

11"5#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Your#data#center#already#has#a#lot#of#components#– Database"servers"– Data"warehouses"– File"servers"– Backup"systems"

! How#does#Hadoop#fit#into#this#ecosystem?#

Introduc@on"

Page 433: Cloudera_Developer_Training

11"6#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Rela,onal#Database#Management#Systems#(RDBMSs)#have#many#strengths#– Ability"to"handle"complex"transac@ons"– Ability"to"process"hundreds"or"thousands"of"queries"per"second"– Real/@me"delivery"of"results"– Simple"but"powerful"query"language"

RDBMS"Strengths"

Page 434: Cloudera_Developer_Training

11"7#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! There#are#some#areas#where#RDBMSs#are#less#ideal#– Data"schema"is"determined"before"data"is"ingested"

– Can"make"ad/hoc"data"collec@on"difficult"– Upper"bound"on"data"storage"of"100s"of"terabytes"– Prac@cal"upper"bound"on"data"in"a"single"query"of"10s"of"terabytes"

RDBMS"Weaknesses"

Page 435: Cloudera_Developer_Training

11"8#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Typical#scenario:#use#an#interac,ve#RDBMS#to#serve#queries#from#a#Web#site#etc#

! Data#is#later#extracted#and#loaded#into#a#data#warehouse#for#future#processing#and#archiving#– Usually"denormalized"into"an"OLAP"cube"

Typical"RDBMS"Scenario"

OLAP: OnLine Analytical Processing

Page 436: Cloudera_Developer_Training

11"9#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Typical"RDBMS"Scenario"(cont’d)"

Oracle, SAP...

Business intelligence apps

Enterprise web site

Interactive database

Data export OLAP load

Page 437: Cloudera_Developer_Training

11"10#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! All#dimensions#must#be#prematerialized#– Re/materializa@on"can"be"very"@me"consuming"

! Daily#data#load"in#,mes#can#increase#– Typically"this"leads"to"some"data"being"discarded"

OLAP"Database"Limita@ons"

Page 438: Cloudera_Developer_Training

11"11#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Using"Hadoop"to"Augment"Exis@ng"Databases"

Oracle, SAP...

Business intelligence apps

Enterprise web site

Interactive database

Hadoop

Recommendations, etc...

Dynamic OLAP queries

New data

Page 439: Cloudera_Developer_Training

11"12#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Processing#power#scales#with#data#storage#– As"you"add"more"nodes"for"storage,"you"get"more"processing"power"‘for"free’"

! Views#do#not#need#prematerializa,on#– Ad/hoc"full"or"par@al"dataset"queries"are"possible"

! Total#query#size#can#be#mul,ple#petabytes#

Benefits"of"Hadoop"

Page 440: Cloudera_Developer_Training

11"13#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cannot#serve#interac,ve#queries#– The"fastest"Hadoop"job"will"s@ll"take"several"seconds"to"run"

! Less#powerful#updates#– No"transac@ons"– No"modifica@on"of"exis@ng"records"

Hadoop"Tradeoffs"

Page 441: Cloudera_Developer_Training

11"14#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Enterprise#data#is#o_en#held#on#large#fileservers,#such#as#– NetApp"– EMC"

! Advantages:#– Fast"random"access"– Many"concurrent"clients"

! Disadvantages#– High"cost"per"terabyte"of"storage"

Tradi@onal"High/Performance"File"Servers"

Page 442: Cloudera_Developer_Training

11"15#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Choice#of#des,na,on#medium#depends#on#the#expected#access#paKerns#– Sequen@ally"read,"append/only"data:"HDFS"– Random"access:"file"server"

! HDFS#can#crunch#sequen,al#data#faster#

! Offloading#data#to#HDFS#leaves#more#room#on#file#servers#for#‘interac,ve’#data#

! Use#the#right#tool#for#the#job!#

File"Servers"and"Hadoop"

Page 443: Cloudera_Developer_Training

11"16#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra@ng"Hadoop"into"an"exis@ng"enterprise"!  Loading#data#into#HDFS#from#an#RDBMS#using#Sqoop#

!  Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

! Managing"real/@me"data"using"Flume"

!  Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H=pFS"

!  Conclusion"

Page 444: Cloudera_Developer_Training

11"17#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Typical#scenario:#the#need#to#use#data#stored#in#an#RDBMS#(such#as#Oracle#database,#MySQL#or#Teradata)#in#a#MapReduce#job#– Lookup"tables"– Legacy"data"

! Possible#to#read#directly#from#an#RDBMS#in#your#Mapper#– Can"lead"to"the"equivalent"of"a"distributed"denial"of"service"(DDoS)"a=ack"on"your"RDBMS"– In"prac@ce"–"don’t"do"it!"

! BeKer#scenario:#import#the#data#into#HDFS#beforehand##

Impor@ng"Data"From"an"RDBMS"to"HDFS"

Page 445: Cloudera_Developer_Training

11"18#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Sqoop:#open#source#tool#originally#wriKen#at#Cloudera#– Now"a"top/level"Apache"Soeware"Founda@on"project"

! Imports#tables#from#an#RDBMS#into#HDFS#– Just"one"table"– All"tables"in"a"database"– Just"por@ons"of"a"table"

– Sqoop"supports"a"WHERE"clause"! Uses#MapReduce#to#actually#import#the#data#

– ‘Thro=les’"the"number"of"Mappers"to"avoid"DDoS"scenarios"– Uses"four"Mappers"by"default"– Value"is"configurable"

! Uses#a#JDBC#interface#– Should"work"with"any"JDBC/compa@ble"database"

Sqoop:"SQL"to"Hadoop"

Page 446: Cloudera_Developer_Training

11"19#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Imports#data#to#HDFS#as#delimited#text#files#or#SequenceFiles#– Default"is"a"comma/delimited"text"file"

! Can#be#used#for#incremental#data#imports#– First"import"retrieves"all"rows"in"a"table"– Subsequent"imports"retrieve"just"rows"created"since"the"last"import"

! Generates#a#class#file#which#can#encapsulate#a#row#of#the#imported#data#– Useful"for"serializing"and"deserializing"data"in"subsequent"MapReduce"jobs"

Sqoop:"SQL"to"Hadoop"(cont’d)"

Page 447: Cloudera_Developer_Training

11"20#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cloudera#has#partnered#with#other#organiza,ons#to#create#custom#Sqoop#connectors#– Use"a"system’s"na@ve"protocols"to"access"data"rather"than"JDBC"– Provides"much"faster"performance"

! Current#systems#supported#by#custom#connectors#include:#– Netezza"– Teradata"– Oracle"Database"(connector"developed"with"Quest"Soeware)"

! Others#are#in#development#

! Custom#connectors#are#not#open#source,#but#are#free#– Available"from"the"Cloudera"Web"site"

Custom"Sqoop"Connectors"

Page 448: Cloudera_Developer_Training

11"21#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Standard#syntax:#

! Tools#include:#

! Op,ons#include:#

Sqoop:"Basic"Syntax"

sqoop tool-name [tool-options]

--connect --username --password

import import-all-tables list-tables

Page 449: Cloudera_Developer_Training

11"22#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Example:#import#a#table#called#employees#from#a#database#called#personnel#in#a#MySQL#RDBMS#

! Example:#as#above,#but#only#records#with#an#ID#greater#than#1000#

Sqoop:"Example"

sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \ --table employees

sqoop import --username fred --password derf \ --connect jdbc:mysql://database.example.com/personnel \

--table employees \ --where "id > 1000"

Page 450: Cloudera_Developer_Training

11"23#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Sqoop#can#take#data#from#HDFS#and#insert#it#into#an#already"exis,ng#table#in#an#RDBMS#with#the#command#

! For#general#Sqoop#help:#

! For#help#on#a#par,cular#command:#

Sqoop:"Other"Op@ons"

sqoop export [options]

sqoop help

sqoop help command

Page 451: Cloudera_Developer_Training

11"24#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra@ng"Hadoop"into"an"exis@ng"enterprise"!  Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"

!  Hands"On#Exercise:#Impor,ng#Data#With#Sqoop#

! Managing"real/@me"data"using"Flume"

!  Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H=pFS"

!  Conclusion"

Page 452: Cloudera_Developer_Training

11"25#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In#this#Hands"On#Exercise,#you#will#import#data#into#HDFS#from#MySQL#

! Please#refer#to#the#Hands"On#Exercise#Manual#

Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

Page 453: Cloudera_Developer_Training

11"26#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra@ng"Hadoop"into"an"exis@ng"enterprise"!  Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"

!  Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

! Managing#real",me#data#using#Flume#

!  Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H=pFS"

!  Conclusion"

Page 454: Cloudera_Developer_Training

11"27#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Flume#is#a#distributed,#reliable,#available#service#for#efficiently#moving#large#amounts#of#data#as#it#is#produced#– Ideally"suited"to"gathering"logs"from"mul@ple"systems"and"inser@ng"them"into"HDFS"as"they"are"generated"

! Flume#is#Open#Source#– Ini@ally"developed"by"Cloudera"

! Flume’s#design#goals:#– Reliability"– Scalability"– Manageability"– Extensibility"

Flume:"Basics"

Page 455: Cloudera_Developer_Training

11"28#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Flume:"High/Level"Overview"

Agent## Agent# Agent#

Agent# Agent#

Agent(s)#

Agent#

compress#

encrypt#

batch#

encrypt#

•  Optionally process incoming data: perform transformations, suppressions, metadata enrichment

•  Each agent can be configured with an in memory or durable channel

•  Writes to multiple HDFS file

formats (text, SequenceFile, JSON, Avro, others)

•  Parallelized writes across many collectors – as much write throughput as required

Page 456: Cloudera_Developer_Training

11"29#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Each#Flume#agent#has#a#source,#a#sink#and#a#channel#

! Source#– Tells"the"node"where"to"receive"data"from"

! Sink#– Tells"the"node"where"to"send"data"to"

! Channel#– A"queue"between"the"Source"and"Sink"– Can"be"in/memory"only"or"‘Durable’"

– Durable"channels"will"not"lose"data"if"power"is"lost"

Flume"Agent"Characteris@cs"

Page 457: Cloudera_Developer_Training

11"30#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Channels#provide#Flume’s#reliability#

! Memory#Channel#– Data"will"be"lost"if"power"is"lost"

! File#Channel#– Data"stored"on"disk"

– Guarantees"durability"of"data"in"face"of"a"power"loss"! Data#transfer#between#Agents#and#Channels#is#transac,onal#

– A"failed"data"transfer"to"a"downstream"agent"rolls"back"and"retries"

! Can#configure#mul,ple#Agents#with#the#same#task#– e.g.,"two"Agents"doing"the"job"of"one"“collector”"–"if"one"agent"fails"then"upstream"agents"would"fail"over"

Flume’s"Design"Goals:"Reliability"

Page 458: Cloudera_Developer_Training

11"31#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Scalability#– The"ability"to"increase"system"performance"linearly"–"or"be=er"–"by"adding"more"resources"to"the"system"– Flume"scales"horizontally"

– As"load"increases,"more"machines"can"be"added"to"the"configura@on"

Flume’s"Design"Goals:"Scalability"

Page 459: Cloudera_Developer_Training

11"32#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Manageability#– The"ability"to"control"data"flows,"monitor"nodes,"modify"the"sejngs,"and"control"outputs"of"a"large"system"

! Configura,on#is#loaded#from#a#proper,es#file#– Proper@es"file"can"be"reloaded"on"the"fly"– File"must"be"pushed"out"to"each"node"(using"scp,"Puppet,"Chef,"etc.)"

Flume’s"Design"Goals:"Manageability"

Page 460: Cloudera_Developer_Training

11"33#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Extensibility#– The"ability"to"add"new"func@onality"to"a"system"

! Flume#can#be#extended#by#adding#Sources#and#Sinks#to#exis,ng#storage#layers#or#data#plalorms#– General"Sources"include"data"from"files,"syslog,"and"standard"output"from"a"process"– General"Sinks"include"files"on"the"local"filesystem"or"HDFS"– Developers"can"write"their"own"Sources"or"Sinks"

Flume’s"Design"Goals:"Extensibility"

Page 461: Cloudera_Developer_Training

11"34#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Flume#is#typically#used#to#ingest#log#files#from#real",me#systems#such#as#Web#servers,#firewalls#and#mailservers#into#HDFS#

! Currently#in#use#in#many#large#organiza,ons,#inges,ng#millions#of#events#per#day#– At"least"one"organiza@on"is"using"Flume"to"ingest"over"200"million"events"per"day"

! Flume#is#typically#installed#and#configured#by#a#system#administrator#– Check"the"Flume"documenta@on"if"you"intend"to"install"it"yourself"

Flume:"Usage"Pa=erns"

Page 462: Cloudera_Developer_Training

11"35#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra@ng"Hadoop"into"an"exis@ng"enterprise"!  Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"

!  Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

! Managing"real/@me"data"using"Flume"

!  Accessing#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#

!  Conclusion"

Page 463: Cloudera_Developer_Training

11"36#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Many#applica,ons#generate#data#which#will#ul,mately#reside#in#HDFS#

! If#Flume#is#not#an#appropriate#solu,on#for#inges,ng#the#data,#some#other#method#must#be#used#

! Typically#this#is#done#as#a#batch#process#

! Problem:#many#legacy#systems#do#not#‘understand’#HDFS#– Difficult"to"write"to"HDFS"if"the"applica@on"is"not"wri=en"in"Java"– May"not"have"Hadoop"installed"on"the"system"genera@ng"the"data"

! We#need#some#way#for#these#systems#to#access#HDFS#

FuseDFS"and"H=pFS:"Mo@va@on"

Page 464: Cloudera_Developer_Training

11"37#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! FuseDFS#is#based#on#FUSE#(Filesystem#in#USEr#space)#

! Allows#you#to#mount#HDFS#as#a#‘regular’#filesystem#

! Note:#HDFS#limita,ons#s,ll#exist!#– Not"intended"as"a"general/purpose"filesystem"– Files"are"write/once"– Not"op@mized"for"low"latency"

! FuseDFS#is#included#as#part#of#the#Hadoop#distribu,on#

FuseDFS"

Page 465: Cloudera_Developer_Training

11"38#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Provides#an#HTTP/HTTPS#REST#interface#to#HDFS#– Supports"both"reads"and"writes"from/to"HDFS"– Can"be"accessed"from"within"a"program"– Can"be"used"via"command/line"tools"such"as"curl"or"wget

! Client#accesses#the#HKpFS#server#– H=pFS"server"then"accesses"HDFS"

! Example:#curl http://httpfs-host:14000/webhdfs/v1/user/foo/README.txt#returns#the#contents#of#the#HDFS##/user/foo/README.txt#file#

H=pFS"

REST: REpresentational State Transfer

Page 466: Cloudera_Developer_Training

11"39#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The#Hadoop#Ecosystem#Integra,ng#Hadoop#into#the#Enterprise#Workflow#

!  Integra@ng"Hadoop"into"an"exis@ng"enterprise"!  Loading"data"into"HDFS"from"an"RDBMS"using"Sqoop"

!  Hands/On"Exercise:"Impor@ng"Data"With"Sqoop"

! Managing"real/@me"data"using"Flume"

!  Accessing"HDFS"from"legacy"systems"with"FuseDFS"and"H=pFS"

!  Conclusion#

Page 467: Cloudera_Developer_Training

11"40#©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In#this#chapter#you#have#learned#

! How#Hadoop#can#be#integrated#into#an#exis,ng#enterprise#

! How#to#load#data#from#an#exis,ng#RDBMS#into#HDFS#by#using#Sqoop#

! How#to#manage#real",me#data#such#as#log#files#using#Flume#

! How#to#access#HDFS#from#legacy#systems#with#FuseDFS#and#HKpFS#

Conclusion"

Page 468: Cloudera_Developer_Training

12#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Machine"Learning"and"Mahout"Chapter"12"

Page 469: Cloudera_Developer_Training

12#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducBon"!  "The"MoBvaBon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriBng"a"MapReduce"Program"!  "Unit"TesBng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracBcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"ManipulaBon"in"MapReduce"""

!  "IntegraBng"Hadoop"into"the"Enterprise"Workflow"!  $Machine$Learning$and$Mahout$!  "An"IntroducBon"to"Hive"and"Pig"!  "An"IntroducBon"to"Oozie"

IntroducBon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"IntroducBon"

The$Hadoop$Ecosystem$

Page 470: Cloudera_Developer_Training

12#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$will$learn$

! Machine$Learning$basics$

! Mahout$basics$

Machine"Learning"and"Mahout"

Page 471: Cloudera_Developer_Training

12#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$Machine$Learning$and$Mahout$

!  Introduc@on$to$Machine$Learning$

!  Using"Mahout"

!  Hands/On"Exercise:"Using"a"Mahout"Recommender"

!  Conclusion"

Page 472: Cloudera_Developer_Training

12#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Machine$Learning$is$a$complex$discipline$

! Much$research$is$ongoing$

! Here$we$merely$give$a$very$high#level$overview$of$some$aspects$of$ML$

Machine"learning:"IntroducBon"

Page 473: Cloudera_Developer_Training

12#6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Most$programs$tell$computers$exactly$what$to$do$– Database"transacBons"and"queries"– Controllers"

– Phone"systems,"manufacturing"processes,"transport,"weaponry,"etc."

– Media"delivery"– Simple"search"– Social"systems"

– Chat,"blogs,"e/mail"etc."

What"Is"Machine"Learning"Not?"

Page 474: Cloudera_Developer_Training

12#7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! An$alterna@ve$technique$is$to$have$computers$learn$what$to$do$

! Machine$Learning$refers$to$a$few$classes$of$program$that$leverage$collected$data$to$drive$future$program$behavior$

! This$represents$another$major$opportunity$to$gain$value$from$data$

What"Is"Machine"Learning?"

Page 475: Cloudera_Developer_Training

12#8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Machine$Learning$systems$are$sensi@ve$to$the$skill$you$bring$to$them$

! However,$prac@@oners$oMen$agree$[Banko$and$Brill,$2001]:$

“It’s$not$who$has$the$best$algorithms$that$wins.$$It’s$who$has$the$most$data.”$

or…$

“There’s$no$data$like$more$data.”$

Why"Use"Hadoop"for"Machine"Learning?"

Page 476: Cloudera_Developer_Training

12#9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Machine$Learning$is$an$ac@ve$area$of$research$and$new$applica@ons$

! There$are$three$well#established$categories$of$techniques$for$exploi@ng$data:$– CollaboraBve"filtering"(recommendaBons)"– Clustering"– ClassificaBon"

The"‘Three"Cs’"

Page 477: Cloudera_Developer_Training

12#10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Collabora@ve$Filtering$is$a$technique$for$recommenda@ons$

! Example$applica@on:$given$people$who$each$like$certain$books,$learn$to$suggest$what$someone$may$like$based$on$what$they$already$like$

! Very$useful$in$helping$users$navigate$data$by$expanding$to$topics$that$have$affinity$with$their$established$interests$

! Collabora@ve$Filtering$algorithms$are$agnos@c$to$the$different$types$of$data$items$involved$– So"they"are"equally"useful"in"many"different"domains"

CollaboraBve"Filtering"

Page 478: Cloudera_Developer_Training

12#11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Clustering$algorithms$discover$structure$in$collec@ons$of$data$– Where"no"formal"structure"previously"existed"

! They$discover$what$clusters,$or$‘groupings’,$naturally$occur$in$data$

! Examples:$– Finding"related"news"arBcles"– Computer"vision"(groups"of"pixels"that"cohere"into"objects)"

Clustering"

Page 479: Cloudera_Developer_Training

12#12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The$previous$two$techniques$are$considered$‘unsupervised’$learning$– The"algorithm"discovers"groups"or"recommendaBons"itself"

! Classifica@on$is$a$form$of$‘supervised’$learning$

! A$classifica@on$system$takes$a$set$of$data$records$with$known$labels$– Learns"how"to"label"new"records"based"on"that"informaBon"

! Example:$– Given"a"set"of"e/mails"idenBfied"as"spam/not"spam,"label"new"e/mails"as"spam/not"spam"– Given"tumors"idenBfied"as"benign"or"malignant,"classify"new"tumors"

ClassificaBon"

Page 480: Cloudera_Developer_Training

12#13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$Machine$Learning$and$Mahout$

!  IntroducBon"to"Machine"Learning"

! Using$Mahout$

!  Hands/On"Exercise:"Using"a"Mahout"Recommender"

!  Conclusion"

Page 481: Cloudera_Developer_Training

12#14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mahout$is$a$Machine$Learning$library$wriaen$in$Java$$– Included"in"CDH3"onwards"– Contains"algorithms"for"each"of"the"categories"listed"

! Algorithms$included$in$Mahout:$

Mahout:"A"Machine"Learning"Library"

Recommenda@on$ Clustering$ Classifica@on$

Pearson"correlaBon"Log"likelihood"Spearman"correlaBon"Tanimoto"coefficient"Singular"value"decomposiBon"(SVD)"Linear"interpolaBon"Cluster/based"recommenders""

k/means"clustering"Canopy"clustering"Fuzzy"k/means"Latent"Dirichlet"analysis"(LDA)"

StochasBc"gradient"descent"(SGD)"Support"vector"machine"(SVM)"""Naïve"Bayes"Complementary"naïve"Bayes"Random"forests""

Page 482: Cloudera_Developer_Training

12#15$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Some$Mahout$algorithms$can$be$used$by$stand#alone$programs$

! Many$are$op@mized$to$work$with$Hadoop$

! Mahout$also$comes$with$some$pre#built$scripts$to$analyze$data$– We"will"use"one"of"these"in"the"Hands/On"Exercise"

! The$libraries$are$‘data$agnos@c’$– Example:"the"Recommender"engines"don’t"care"whether"you"are"gehng"recommendaBons"for"books,"music,"movies,"brands"of"toothpaste…"

Mahout:"A"Machine"Learning"Library"(cont’d)"

Page 483: Cloudera_Developer_Training

12#16$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$Machine$Learning$and$Mahout$

!  IntroducBon"to"Machine"Learning"

!  Using"Mahout"

!  Hands#On$Exercise:$Using$a$Mahout$Recommender$

!  Conclusion"

Page 484: Cloudera_Developer_Training

12#17$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In$this$Hands#On$Exercise,$you$will$use$a$Mahout$recommender$to$generate$a$set$of$movie$recommenda@ons$

! Please$refer$to$the$Hands#On$Exercise$Manual$

Hands/On"Exercise:"Using"a"Mahout"Recommender"

Page 485: Cloudera_Developer_Training

12#18$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$Machine$Learning$and$Mahout$

!  IntroducBon"to"Machine"Learning"

!  Using"Mahout"

!  Hands/On"Exercise:"Using"a"Mahout"Recommender"

!  Conclusion$

Page 486: Cloudera_Developer_Training

12#19$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$have$learned$

! Machine$Learning$basics$

! Mahout$basics$

Conclusion"

Page 487: Cloudera_Developer_Training

13#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

An"Introduc@on"to"Hive"and"Pig"Chapter"13"

Page 488: Cloudera_Developer_Training

13#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "Introduc@on"!  "The"Mo@va@on"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "Wri@ng"a"MapReduce"Program"!  "Unit"Tes@ng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "Prac@cal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"Manipula@on"in"MapReduce"""

!  "Integra@ng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  $An$Introduc/on$to$Hive$and$Pig$!  "An"Introduc@on"to"Oozie"

Introduc@on"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"Introduc@on"

The$Hadoop$Ecosystem$

Page 489: Cloudera_Developer_Training

13#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$will$learn$

! What$features$Hive$provides$

! What$features$Pig$provides$

! How$to$choose$between$Pig$and$Hive$

An"Introduc@on"to"Hive"and"Pig"

Page 490: Cloudera_Developer_Training

13#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The$mo/va/on$for$Hive$and$Pig$

!  Hive"basics"!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig"basics"!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing"between"Hive"and"Pig"!  Conclusion"

Page 491: Cloudera_Developer_Training

13#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce$code$is$typically$wriGen$in$Java$– Although"it"can"be"wri=en"in"other"languages"using"Hadoop"Streaming"

! Requires:$– A"programmer"– Who"is"a"good$Java"programmer"– Who"understands"how"to"think"in"terms"of"MapReduce"– Who"understands"the"problem"they’re"trying"to"solve"– Who"has"enough"@me"to"write"and"test"the"code"– Who"will"be"available"to"maintain"and"update"the"code"in"the"future"as"requirements"change"

Hive"and"Pig:"Mo@va@on"

Page 492: Cloudera_Developer_Training

13#6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Many$organiza/ons$have$only$a$few$developers$who$can$write$good$MapReduce$code$

! Meanwhile,$many$other$people$want$to$analyze$data$– Business"analysts"– Data"scien@sts"– Sta@s@cians"– Data"analysts"

! What’s$needed$is$a$higher#level$abstrac/on$on$top$of$MapReduce$– Providing"the"ability"to"query"the"data"without"needing"to"know"MapReduce"in@mately"– Hive"and"Pig"address"these"needs"

Hive"and"Pig:"Mo@va@on"(cont’d)"

Page 493: Cloudera_Developer_Training

13#7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive$basics$!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig"basics"!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing"between"Hive"and"Pig"!  Conclusion"

Page 494: Cloudera_Developer_Training

13#8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Hive:"Introduc@on"

! Hive$was$originally$developed$at$Facebook$– Provides"a"very"SQL/like"language"– Can"be"used"by"people"who"know"SQL"– Under"the"covers,"generates"MapReduce"jobs"that"run"on"the"Hadoop"cluster"– Enabling"Hive"requires"almost"no"extra"work"by"the"system"administrator"

! Hive$is$now$a$top#level$Apache$SoTware$Founda/on$project$

Page 495: Cloudera_Developer_Training

13#9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

The"Hive"Data"Model"

! Hive$‘layers’$table$defini/ons$on$top$of$data$in$HDFS$

! Tables$– Typed"columns"(int,"float,"string,"boolean"and"so"on)"– Also"array,"struct,"map"(for"JSON/like"data)"

! Par//ons$– e.g.,"to"range/par@@on"tables"by"date"

! Buckets$– Hash"par@@ons"within"ranges"(useful"for"sampling,"join"op@miza@on)"

Page 496: Cloudera_Developer_Training

13#10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Primi/ve$types:$– TINYINT – SMALLINT – INT – BIGINT – FLOAT – BOOLEAN – DOUBLE – STRING – BINARY"(available"star@ng"in"CDH4)"– TIMESTAMP"(available"star@ng"in"CDH4)

! Type$constructors:$– ARRAY < primitive-type > – MAP < primitive-type, data-type > – STRUCT < col-name : data-type, ... >

Hive"Data"Types"

Page 497: Cloudera_Developer_Training

13#11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hive’s$Metastore$is$a$database$containing$table$defini/ons$and$other$metadata$– By"default,"stored"locally"on"the"client"machine"in"a"Derby"database"– If"mul@ple"people"will"be"using"Hive,"the"system"administrator"should"create"a"shared"Metastore"– Usually"in"MySQL"or"some"other"rela@onal"database"server"

The"Hive"Metastore"

Page 498: Cloudera_Developer_Training

13#12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hive$tables$are$stored$in$Hive’s$‘warehouse’$directory$in$HDFS$– By"default,"/user/hive/warehouse

! Tables$are$stored$in$subdirectories$of$the$warehouse$directory$– Par@@ons"form"subdirectories"of"tables"

! Possible$to$create$external(tables$if$the$data$is$already$in$HDFS$and$should$not$be$moved$from$its$current$loca/on$

! Actual$data$is$stored$in$flat$files$– Control"character/delimited"text,"or"SequenceFiles"– Can"be"in"arbitrary"format"with"the"use"of"a"custom"Serializer/Deserializer"(‘SerDe’)"

Hive"Data:"Physical"Layout"

Page 499: Cloudera_Developer_Training

13#13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To$launch$the$Hive$shell,$start$a$terminal$and$run$

$ hive

! Results$in$the$Hive$prompt:$

hive>

Star@ng"The"Hive"Shell"

Page 500: Cloudera_Developer_Training

13#14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Hive"Basics:"Crea@ng"Tables"

hive> SHOW TABLES; hive> CREATE TABLE shakespeare

(freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive> DESCRIBE shakespeare; $

Page 501: Cloudera_Developer_Training

13#15$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$– Assumes"that"the"data"is"already"in"HDFS"

! If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH – Automa@cally"loads"it"into"HDFS"in"the"correct"directory"

Loading"Data"Into"Hive"

LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;

Page 502: Cloudera_Developer_Training

13#16$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The$Sqoop$op/on$--hive-import$will$automa/cally$create$a$Hive$table$from$the$imported$data$– Imports"the"data"– Generates"the"Hive"CREATE TABLE"statement"based"on"the"table"defini@on"in"the"RDBMS"– Runs"the"statement"– Note:"This"will"move"the"imported"table"into"Hive’s"warehouse"directory"

Using"Sqoop"to"Import"Data"into"Hive"Tables"

Page 503: Cloudera_Developer_Training

13#17$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hive$supports$most$familiar$SELECT$syntax$

Basic"SELECT"Queries"

hive> SELECT * FROM shakespeare LIMIT 10; hive> SELECT * FROM shakespeare WHERE freq > 100 ORDER BY freq ASC LIMIT 10;

Page 504: Cloudera_Developer_Training

13#18$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Joining$datasets$is$a$complex$opera/on$in$standard$Java$MapReduce$– We"saw"this"earlier"in"the"course"

! In$Hive,$it’s$easy!$

Joining"Tables"

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 5;

Page 505: Cloudera_Developer_Training

13#19$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The$SELECT$statement$on$the$previous$slide$would$write$the$data$to$the$console$

! To$store$the$results$in$HDFS,$create$a$new$table$then$write,$for$example:$

– Results"are"stored"in"the"table"– Results"are"just"files"within"the"newTable"directory"

– Data"can"be"used"in"subsequent"queries,"or"in"MapReduce"jobs"

Storing"Output"Results"

INSERT OVERWRITE TABLE newTable SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 5;

Page 506: Cloudera_Developer_Training

13#20$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Hive$supports$manipula/on$of$data$via$User#Defined$Func/ons$(UDFs)$– Wri=en"in"Java"

! Also$supports$user#created$scripts$wriGen$in$any$language$via$the$TRANSFORM$operator$– Essen@ally"leverages"Hadoop"Streaming"– Example:"

Using"User/Defined"Code"

INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Page 507: Cloudera_Developer_Training

13#21$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Not$all$‘standard’$SQL$is$supported$– Subqueries"are"only"supported"in"the"FROM"clause"

– No"correlated"subqueries"! No$support$for$UPDATE$or$DELETE

! No$support$for$INSERTing$single$rows$

Hive"Limita@ons"

Page 508: Cloudera_Developer_Training

13#22$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Main$Web$site$is$at$http://hive.apache.org/

! Cloudera$training$course:$Cloudera$Training$for$Apache$Hive$And$Pig$

Hive:"Where"To"Learn"More"

Page 509: Cloudera_Developer_Training

13#23$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive"basics"!  Hands#On$Exercise:$Manipula/ng$Data$with$Hive$

!  Pig"basics"!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing"between"Hive"and"Pig"!  Conclusion"

Page 510: Cloudera_Developer_Training

13#24$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In$this$Hands#On$Exercise,$you$will$manipulate$a$dataset$using$Hive$

! Please$refer$to$the$Hands#On$Exercise$Manual$

Hands/On"Exercise:"Manipula@ng"Data""With"Hive"

Page 511: Cloudera_Developer_Training

13#25$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive"basics"!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig$basics$!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing"between"Hive"and"Pig"!  Conclusion"

Page 512: Cloudera_Developer_Training

13#26$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Pig$was$originally$created$at$Yahoo!$to$answer$a$similar$need$to$Hive$– Many"developers"did"not"have"the"Java"and/or"MapReduce"knowledge"required"to"write"standard"MapReduce"programs"– But"s@ll"needed"to"query"data"

! Pig$is$a$high#level$plahorm$for$crea/ng$MapReduce$programs$– Language"is"called"PigLa@n"– Rela@vely"simple"syntax"– Under"the"covers,"PigLa@n"scripts"are"turned"into"MapReduce"jobs"and"executed"on"the"cluster"

! Pig$is$now$a$top#level$Apache$project$$

Pig:"Introduc@on"

Page 513: Cloudera_Developer_Training

13#27$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Installa/on$of$Pig$requires$no$modifica/on$to$the$cluster$

! The$Pig$interpreter$runs$on$the$client$machine$– Turns"PigLa@n"into"standard"Java"MapReduce"jobs,"which"are"then"submi=ed"to"the"JobTracker"

! There$is$(currently)$no$shared$metadata,$so$no$need$for$a$shared$metastore$of$any$kind$

Pig"Installa@on"

Page 514: Cloudera_Developer_Training

13#28$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In$Pig,$a$single$element$of$data$is$an$atom(

! A$collec/on$of$atoms$–$such$as$a$row,$or$a$par/al$row$–$is$a$tuple$

! Tuples$are$collected$together$into$bags$

! Typically,$a$PigLa/n$script$starts$by$loading$one$or$more$datasets$into$bags,$and$then$creates$new$bags$by$modifying$those$it$already$has$

Pig"Concepts"

Page 515: Cloudera_Developer_Training

13#29$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Pig$supports$many$features$which$allow$developers$to$perform$sophis/cated$data$analysis$without$having$to$write$Java$MapReduce$code$– Joining"datasets"– Grouping"data"– Referring"to"elements"by"posi@on"rather"than"name"

– Useful"for"datasets"with"many"elements"– Loading"non/delimited"data"using"a"custom"SerDe"– Crea@on"of"user/defined"func@ons,"wri=en"in"Java"– And"more"

Pig"Features"

Page 516: Cloudera_Developer_Training

13#30$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Star/ng$Grunt$

! Useful$commands:$

Using"the"Grunt"Shell"to"Run"PigLa@n"

$ pig -help (or -h) $ pig -version (-i) $ pig -execute (-e) $ pig script.pig$

$ pig grunt>

Page 517: Cloudera_Developer_Training

13#31$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Here,$we$load$a$directory$of$data$into$a$bag$called$emps

! Then$we$create$a$new$bag$called$rich$which$contains$just$those$records$where$the$salary$por/on$is$greater$than$100000$

! Finally,$we$write$the$contents$of$the$srtd$bag$to$a$new$directory$in$HDFS$– By"default,"the"data"will"be"wri=en"in"tab/separated"format"

! Alterna/vely,$to$write$the$contents$of$a$bag$to$the$screen,$say$

A"Sample"Pig"Script"

emps = LOAD 'people' AS (id, name, salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO 'rich_people';

DUMP srtd;

Page 518: Cloudera_Developer_Training

13#32$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To$view$the$structure$of$a$bag:$

! Joining$two$datasets:$

More"PigLa@n"

DESCRIBE bagname;

data1 = LOAD 'data1' AS (col1, col2, col3, col4); data2 = LOAD 'data2' AS (colA, colB, colC); jnd = JOIN data1 BY col3, data2 BY colA; STORE jnd INTO 'outfile';

Page 519: Cloudera_Developer_Training

13#33$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Grouping:$

! Creates$a$new$bag$– Each"tuple"in"grpd"has"an"element"called"group,"and"an"element"called"bag1 – The"group"element"has"a"unique"value"for"elementX"from"bag1 – The"bag1"element"is"itself"a"bag,"containing"all"the"tuples"from"bag1"with"that"value"for"elementX

More"PigLa@n:"Grouping"

grpd = GROUP bag1 BY elementX

Page 520: Cloudera_Developer_Training

13#34$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The$FOREACH...GENERATE$statement$iterates$over$members$of$a$bag$

! Example:$

! Can$combine$with$COUNT:$

More"PigLa@n:"FOREACH

justnames = FOREACH emps GENERATE name;

summedUp = FOREACH grpd GENERATE group, COUNT(bag1) AS elementCount;

Page 521: Cloudera_Developer_Training

13#35$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Main$Web$site$is$at$http://pig.apache.org

! To$locate$the$Pig$documenta/on:$– For"CDH3,"select"the"Release"0.8.1"link"under"documenta@on"on"the"lef"side"of"the"page""– For"CDH4,"select"the"Release"0.9.2"link"under"documenta@on"on"the"lef"side"of"the"page""

! Cloudera$training$course:$Cloudera$Training$for$Apache$Hive$And$Pig$

Pig:"Where"To"Learn"More"

Page 522: Cloudera_Developer_Training

13#36$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive"basics"!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig"basics"!  Hands#On$Exercise:$Using$Pig$to$Retrieve$Movie$Names$from$our$

Recommender$

!  Choosing"between"Hive"and"Pig"!  Conclusion"

Page 523: Cloudera_Developer_Training

13#37$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In$this$Hands#On$Exercise,$you$will$use$Pig$to$take$the$data$you$generated$with$Mahout$earlier$in$the$course$and$produce$the$actual$movie$names$that$have$been$recommended$

! Please$refer$to$the$Hands#On$Exercise$Manual$

Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"From"Our"Recommender"

Page 524: Cloudera_Developer_Training

13#38$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive"basics"!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig"basics"!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing$between$Hive$and$Pig$!  Conclusion"

Page 525: Cloudera_Developer_Training

13#39$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Typically,$organiza/ons$wan/ng$an$abstrac/on$on$top$of$standard$MapReduce$will$choose$to$use$either$Hive$or$Pig$$

! Which$one$is$chosen$depends$on$the$skillset$of$the$target$users$– Those"with"an"SQL"background"will"naturally"gravitate"towards"Hive"– Those"who"do"not"know"SQL"will"ofen"choose"Pig"

! Each$has$strengths$and$weaknesses;$it$is$worth$spending$some$/me$inves/ga/ng$each$so$you$can$make$an$informed$decision$

! Some$organiza/ons$are$now$choosing$to$use$both$– Pig"deals"be=er"with"less/structured"data,"so"Pig"is"used"to"manipulate"the"data"into"a"more"structured"form,"then"Hive"is"used"to"query"that"structured"data"

Choosing"Between"Pig"and"Hive"

Page 526: Cloudera_Developer_Training

13#40$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Hive$and$Pig$

!  The"mo@va@on"for"Hive"and"Pig"

!  Hive"basics"!  Hands/On"Exercise:"Manipula@ng"Data"with"Hive"

!  Pig"basics"!  Hands/On"Exercise:"Using"Pig"to"Retrieve"Movie"Names"from"our"

Recommender"

!  Choosing"between"Hive"and"Pig"!  Conclusion$

Page 527: Cloudera_Developer_Training

13#41$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$have$learned$

! What$features$Hive$provides$

! What$features$Pig$provides$

! How$to$choose$between$Pig$and$Hive$

Conclusion"

Page 528: Cloudera_Developer_Training

14#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

An"Introduc@on"to"Oozie"Chapter"14"

Page 529: Cloudera_Developer_Training

14#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "Introduc@on"!  "The"Mo@va@on"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "Wri@ng"a"MapReduce"Program"!  "Unit"Tes@ng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "Prac@cal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  "Graph"Manipula@on"in"MapReduce"""

!  "Integra@ng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"Introduc@on"to"Hive"and"Pig"!  $An$Introduc/on$to$Oozie$

Introduc@on"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course"Conclusion"and"Appendices"

Course"Introduc@on"

The$Hadoop$Ecosystem$

Page 530: Cloudera_Developer_Training

14#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$will$learn$

! What$Oozie$is$

! How$to$create$Oozie$workflows$

An"Introduc@on"to"Oozie"

Page 531: Cloudera_Developer_Training

14#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Oozie$

!  Introduc/on$to$Oozie$!  Crea@ng"Oozie"workflows"!  Hands/On"Exercise:"Running"an"Oozie"Workflow"

!  Conclusion"

Page 532: Cloudera_Developer_Training

14#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Many$problems$cannot$be$solved$with$a$single$$MapReduce$job$

! Instead,$a$workflow$of$jobs$must$be$created$

! Simple$workflow:$– Run"Job"A"– Use"output"of"Job"A"as"input"to"Job"B"– Use"output"of"Job"B"as"input"to"Job"C"– Output"of"Job"C"is"the"final"required"output"

! Easy$if$the$workflow$is$linear$like$this$– Can"be"created"as"standard"Driver"code"

The"Mo@va@on"for"Oozie"

Job A

Start Data

Job B

Job C

Final Result

Page 533: Cloudera_Developer_Training

14#6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! If$the$workflow$is$more$complex,$Driver$code$becomes$much$more$difficult$to$maintain$

! Example:$running$mul/ple$jobs$in$parallel,$using$the$output$from$all$of$those$jobs$as$the$input$to$the$next$job$

! Example:$including$Hive$or$Pig$jobs$as$part$of$the$workflow$

The"Mo@va@on"for"Oozie"(cont’d)"

Page 534: Cloudera_Developer_Training

14#7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Oozie$is$a$‘workflow$engine’$

! Runs$on$a$server$– Typically"outside"the"cluster"

! Runs$workflows$of$Hadoop$jobs$– Including"Pig,"Hive,"Sqoop"jobs"– Submits"those"jobs"to"the"cluster"based"on"a"workflow"defini@on"

! Workflow$defini/ons$are$submiWed$via$HTTP$

! Jobs$can$be$run$at$specific$/mes$– One/off"or"recurring"jobs"

! Jobs$can$be$run$when$data$is$present$in$a$directory$

What"is"Oozie?"

Page 535: Cloudera_Developer_Training

14#8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Oozie$

!  Introduc@on"to"Oozie"!  Crea/ng$Oozie$workflows$!  Hands/On"Exercise:"Running"an"Oozie"Workflow"

!  Conclusion"

Page 536: Cloudera_Developer_Training

14#9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Oozie$workflows$are$wriWen$in$XML$$

! Workflow$is$a$collec/on$of$ac/ons$– MapReduce"jobs,"Pig"jobs,"Hive"jobs"etc."

! A$workflow$consists$of$control*flow*nodes$and$ac/on*nodes$

! Control$flow$nodes$define$the$beginning$and$end$of$a$workflow$– They"provide"methods"to"determine"the"workflow"execu@on"path"

– Example:"Run"mul@ple"jobs"simultaneously"

! Ac/on$nodes$trigger$the$execu/on$of$a$processing$task,$such$as$– A"MapReduce"job"– A"Pig"job"– A"Sqoop"data"import"job"

Oozie"Workflow"Basics"

Page 537: Cloudera_Developer_Training

14#10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Simple$example$workflow$for$WordCount:$

Simple"Oozie"Example"

Page 538: Cloudera_Developer_Training

14#11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

Page 539: Cloudera_Developer_Training

14#12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

A"workflow"is"wrapped"in"the"workflow"en@ty"

Page 540: Cloudera_Developer_Training

14#13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

The"start"node"is"the"control"node"which"tells"Oozie"which"workflow"node"should"be"run"first."There"

must"be"one"start"node"in"an"Oozie"workflow."In"our"example,"we"are"telling"Oozie"to"start"by"transi@oning"to"the"wordcount"workflow"node."

Page 541: Cloudera_Developer_Training

14#14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

The"wordcount"ac@on"node"defines"a"map-reduce"ac@on"–"a"standard"Java"MapReduce"job."

Page 542: Cloudera_Developer_Training

14#15$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

Within"the"ac@on,"we"define"the"job’s"proper@es."

Page 543: Cloudera_Developer_Training

14#16$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

We"specify"what"to"do"if"the"ac@on"ends"successfully,"

and"what"to"do"if"it"fails."In"this"example,"if"the"job"is"

successful"we"go"to"the"end"node."If"it"fails"we"go"to"the"kill"node."

Page 544: Cloudera_Developer_Training

14#17$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

Every"workflow"must"have"an"end"node."This"indicates"that"the"workflow"has"completed"

successfully."

Page 545: Cloudera_Developer_Training

14#18$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Simple"Oozie"Example"(cont’d)"

<workflow-app name='wordcount-wf' xmlns="uri:oozie:workflow:0.1"> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.mapper.class</name> <value>org.myorg.WordCount.Map</value> </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> </property> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='end'/> </action> <kill name='kill'> <message>Something went wrong: ${wf:errorCode('wordcount')}</message> </kill/> <end name='end'/> </workflow-app>

If"the"workflow"reaches"a"kill"node,"it"will"kill"all"running"ac@ons"and"then"terminate"with"an"error."A"

workflow"can"have"zero"or"more"kill"nodes."

Page 546: Cloudera_Developer_Training

14#19$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A$decision$control$node$allows$Oozie$to$determine$the$workflow$execu/on$path$based$on$some$criteria$– Similar"to"a"switch/case"statement"

! fork$and$join$control$nodes$split$one$execu/on$path$into$mul/ple$execu/on$paths$which$run$concurrently$– fork"splits"the"execu@on"path"– join"waits"for"all"concurrent"execu@on"paths"to"complete"before"proceeding"– fork"and"join"are"used"in"pairs"

Other"Oozie"Control"Nodes"

Page 547: Cloudera_Developer_Training

14#20$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Node$Name$ Descrip/on$

map-reduce Runs"either"a"Java"MapReduce"or"Streaming"job"

fs Create"directories,"move"or"delete"files"or"directories"

java Runs"the"main()"method"in"the"specified"Java"class"as"a"single/Map,"Map/only"job"on"the"cluster"

pig Runs"a"Pig"job"

hive Runs"a"Hive"job"

sqoop Runs"a"Sqoop"job"

email Sends"an"e/mail"message"

Oozie"Workflow"Ac@on"Nodes"

Page 548: Cloudera_Developer_Training

14#21$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! To$submit$an$Oozie$workflow$using$the$command#line$tool:$

$

! Oozie$can$also$be$called$from$within$a$Java$program$– Via"the"Oozie"client"API"

Submidng"an"Oozie"Workflow"

$ oozie job -oozie http://<oozie_server>/oozie -config config_file -run

Page 549: Cloudera_Developer_Training

14#22$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

More"on"Oozie"

Informa/on$ Resource$

Oozie"installa@on"and"configura@on"

CDH"Installa@on"Guide$http://docs.cloudera.com

Oozie"workflows"and"ac@ons" https://oozie.apache.org

The"procedure"of"running"a"MapReduce"job"using"Oozie"

https://cwiki.apache.org/OOZIE/ map-reduce-cookbook.html

Page 550: Cloudera_Developer_Training

14#23$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Oozie$

!  Introduc@on"to"Oozie"!  Crea@ng"Oozie"workflows"!  Hands#On$Exercise:$Running$an$Oozie$Workflow$

!  Conclusion"

Page 551: Cloudera_Developer_Training

14#24$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! In$this$Hands#On$Exercise$you$will$run$Oozie$jobs$

! Please$refer$to$the$Hands#On$Exercise$Manual$

Hands/On"Exercise:"Running"an"Oozie"Workflow"

Page 552: Cloudera_Developer_Training

14#25$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$An$Introduc/on$to$Oozie$

!  Introduc@on"to"Oozie"!  Crea@ng"Oozie"workflows"!  Hands/On"Exercise:"Running"an"Oozie"Workflow"

!  Conclusion$

Page 553: Cloudera_Developer_Training

14#26$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$chapter$you$have$learned$

! What$Oozie$is$

! How$to$create$Oozie$workflows$

Conclusion"

Page 554: Cloudera_Developer_Training

15#1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Conclusion"Chapter"15"

Page 555: Cloudera_Developer_Training

15#2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducAon"!  "The"MoAvaAon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriAng"a"MapReduce"Program"!  "Unit"TesAng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracAcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  $Conclusion$!  "Cloudera"Enterprise"!  "Graph"ManipulaAon"in"MapReduce"""

!  "IntegraAng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducAon"to"Hive"and"Pig"!  "An"IntroducAon"to"Oozie"

IntroducAon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course$Conclusion$and$Appendices$

Course"IntroducAon"

The"Hadoop"Ecosystem"

Page 556: Cloudera_Developer_Training

15#3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

During$this$course,$you$have$learned:$

! The$core$technologies$of$Hadoop$

! How$HDFS$and$MapReduce$work$

! How$to$develop$MapReduce$applicaFons$

! How$to$unit$test$MapReduce$applicaFons$

! How$to$use$MapReduce$combiners,$parFFoners,$and$distributed$cache$

! Best$pracFces$for$developing$and$debugging$MapReduce$applicaFons$

! How$to$implement$data$input$and$output$in$MapReduce$applicaFons$

Conclusion"

Page 557: Cloudera_Developer_Training

15#4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Algorithms$for$common$MapReduce$tasks$

! How$to$join$data$sets$in$MapReduce$

! How$Hadoop$integrates$into$the$data$center$

! How$to$use$Mahout’s$Machine$Learning$algorithms$

! How$Hive$and$Pig$can$be$used$for$rapid$applicaFon$development$

! How$to$create$large$workflows$using$Oozie$

Conclusion"(cont’d)"

Page 558: Cloudera_Developer_Training

15#5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! This$course$helps$to$prepare$you$for$the$Cloudera$CerFfied$Developer$for$Apache$Hadoop$exam$

! For$more$informaFon$about$Cloudera$cerFficaFon,$refer$to$$http://university.cloudera.com/certification.html

! Thank$you$for$aTending$the$course!$

! If$you$have$any$quesFons$or$comments,$please$contact$us$via$$http://www.cloudera.com$$

CerAficaAon"

Page 559: Cloudera_Developer_Training

A"1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Cloudera"Enterprise"Appendix"A"

Page 560: Cloudera_Developer_Training

A"2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducBon"!  "The"MoBvaBon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriBng"a"MapReduce"Program"!  "Unit"TesBng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracBcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  $Cloudera$Enterprise$!  "Graph"ManipulaBon"in"MapReduce"""

!  "IntegraBng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducBon"to"Hive"and"Pig"!  "An"IntroducBon"to"Oozie"

IntroducBon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course$Conclusion$and$Appendices$

Course"IntroducBon"

The"Hadoop"Ecosystem"

Page 561: Cloudera_Developer_Training

A"3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Includes$support$and$management$for$all$core$components$of$CDH$

Cloudera"Enterprise"Core"

Page 562: Cloudera_Developer_Training

A"4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cloudera$Manager$provides$enterprise"grade$Hadoop$deployment$and$management$

! Built"in$intelligence$and$best$pracBces$

! Integrates$with$Cloudera’s$support$infrastructure$

Cloudera"Manager"

Page 563: Cloudera_Developer_Training

A"5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Cloudera"Manager"(cont’d)"

Page 564: Cloudera_Developer_Training

A"6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

AcBvity"Monitor"

Page 565: Cloudera_Developer_Training

A"7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Includes$support$and$management$for$HBase$

Cloudera"Enterprise"RTD"

Page 566: Cloudera_Developer_Training

A"8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Cloudera$Enterprise$makes$it$easy$to$run$open$source$Hadoop$in$producBon$

! Includes$$– Cloudera’s"DistribuBon"including"Apache"Hadoop"(CDH)"– Cloudera"Manager"– ProducBon"Support"

! Cloudera$Manager$enables$you$to:$– Simplify"and"accelerate"Hadoop"deployment"– Reduce"the"costs"and"risks"of"adopBng"Hadoop"in"producBon"– Reliably"operate"Hadoop"in"producBon"with"repeatable"success"– Apply"SLAs"to"Hadoop"– Increase"control"over"Hadoop"cluster"provisioning"and"management"

Conclusion"

Page 567: Cloudera_Developer_Training

B"1$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Graph"ManipulaAon"in"MapReduce"Appendix"B"

Page 568: Cloudera_Developer_Training

B"2$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Course"Chapters"

!  "IntroducAon"!  "The"MoAvaAon"for"Hadoop"!  "Hadoop:"Basic"Concepts"!  "WriAng"a"MapReduce"Program"!  "Unit"TesAng"MapReduce"Programs"!  "Delving"Deeper"into"the"Hadoop"API"!  "PracAcal"Development"Tips"and"Techniques"!  "Data"Input"and"Output"!  "Common"MapReduce"Algorithms"!  "Joining"Data"Sets"in"MapReduce"Jobs"

!  "Conclusion"!  "Cloudera"Enterprise"!  $Graph$Manipula0on$in$MapReduce$$$

!  "IntegraAng"Hadoop"into"the"Enterprise"Workflow"!  "Machine"Learning"and"Mahout"!  "An"IntroducAon"to"Hive"and"Pig"!  "An"IntroducAon"to"Oozie"

IntroducAon"to"Apache"Hadoop"and"its"Ecosystem"

Basic"Programming"with"the"Hadoop"Core"API"

Problem"Solving"with"MapReduce"

Course$Conclusion$and$Appendices$

Course"IntroducAon"

The"Hadoop"Ecosystem"

Page 569: Cloudera_Developer_Training

B"3$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$appendix$you$will$learn$

! What$graphs$are$

! Best$prac0ces$for$represen0ng$graphs$in$Hadoop$

! How$to$implement$a$single$source$shortest$path$algorithm$in$MapReduce$

Graph"ManipulaAon"in"MapReduce"

Page 570: Cloudera_Developer_Training

B"4$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Conclusion$and$Appendixes$Graph$Manipula0on$in$MapReduce$

!  Graphs$!  Best"pracAces"for"represenAng"graphs"in"MapReduce"

!  ImplemenAng"a"single/source"shortest/path"algorithm"in"MapReduce"

!  Conclusion"

Page 571: Cloudera_Developer_Training

B"5$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Loosely$speaking,$a$graph$is$a$set$of$ver0ces,$or$nodes,$connected$by$edges,$or$lines$

! There$are$many$different$types$of$graphs$

– Directed"– Undirected"– Cyclic"– Acyclic"– Weighted"– Unweighted"– DAG"(Directed,"Acyclic"Graph)"is"a"very"common"graph"type"

IntroducAon:"What"Is"A"Graph?"

Page 572: Cloudera_Developer_Training

B"6$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Graphs$are$everywhere$– Hyperlink"structure"of"the"Web"– Physical"structure"of"computers"on"a"network"– Roadmaps"– Airline"flights"– Social"networks"

What"Can"Graphs"Represent?"

Page 573: Cloudera_Developer_Training

B"7$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Finding$the$shortest$path$through$a$graph$– RouAng"Internet"traffic"– Giving"driving"direcAons"

! Finding$the$minimum$spanning$tree$

– Lowest/cost"way"of"connecAng"all"nodes"in"a"graph"– Example:"telecoms"company"laying"fiber"

– Must"cover"all"customers"– Need"to"minimize"fiber"used"

! Finding$maximum$flow$

– Move"the"most"amount"of"‘traffic’"through"a"network"– Example:"airline"scheduling"

Examples"of"Graph"Problems"

Page 574: Cloudera_Developer_Training

B"8$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Finding$cri0cal$nodes$without$which$a$graph$would$break$into$disjoint$components$

– Controlling"the"spread"of"epidemics"– Breaking"up"terrorist"cells"

Examples"of"Graph"Problems"(cont’d)"

Page 575: Cloudera_Developer_Training

B"9$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Graph$algorithms$typically$involve:$

– Performing"computaAons"at"each"vertex"– Traversing"the"graph"in"some"manner"

! Key$ques0ons:$– How"do"we"represent"graph"data"in"MapReduce?"– How"do"we"traverse"a"graph"in"MapReduce?"

Graphs"and"MapReduce"

Page 576: Cloudera_Developer_Training

B"10$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Conclusion$and$Appendixes$Graph$Manipula0on$in$MapReduce$

!  Graphs"!  Best$prac0ces$for$represen0ng$graphs$in$MapReduce$

!  ImplemenAng"a"single/source"shortest/path"algorithm"in"MapReduce"

!  Conclusion"

Page 577: Cloudera_Developer_Training

B"11$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Imagine$we$want$to$represent$this$simple$graph:$

! Two$approaches:$– Adjacency"matrices"– Adjacency"lists"

RepresenAng"Graphs"

1$

2$

3$

4$

Page 578: Cloudera_Developer_Training

B"12$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Represent$the$graph$as$an$n$x$n$square$matrix$

Adjacency"Matrices"

1$

2$

3$

4$

v1 v2 v3 v4

v1 0 1 0 1

v2 1 0 1 1

v3 1 0 0 0

v4 1 0 1 0

Page 579: Cloudera_Developer_Training

B"13$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Advantages:$– Naturally"encapsulates"iteraAon"over"nodes"– Rows"and"columns"correspond"to"inlinks"and"outlinks"

! Disadvantages:$– Lots"of"zeros"for"sparse"matrices"– Lots"of"wasted"space"

Adjacency"Matrices:"CriAque"

Page 580: Cloudera_Developer_Training

B"14$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Take$an$adjacency$matrix…$and$throw$away$all$the$zeros$

Adjacency"Lists"

v1 v2 v3 v4

v1 0 1 0 1

v2 1 0 1 1

v3 1 0 0 0

v4 1 0 1 0

v1: v2, v4 v2: v1, v3, v4 v3: v1 v4: v1, v3

Page 581: Cloudera_Developer_Training

B"15$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Advantages:$– Much"more"compact"representaAon"– Easy"to"compute"outlinks"– Graph"structure"can"be"broken"up"and"distributed"

! Disadvantages:$– More"difficult"to"compute"inlinks"

Adjacency"Lists:"CriAque"

Page 582: Cloudera_Developer_Training

B"16$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Adjacency$lists$are$the$preferred$way$of$represen0ng$graphs$in$MapReduce$

– Typically"we"represent"each"vertex"(node)"with"an"ID"number"– A"field"of"type"long"usually"suffices"

! Typical$encoding$format$(Writable)$

– long:"vertex"ID"of"the"source"– int:"number"of"outgoing"edges"– Sequence"of"longs:"desAnaAon"verAces"

Encoding"Adjacency"Lists"

1: [2] 2, 4 2: [3] 1, 3, 4 3: [1] 1 4: [2] 1, 3

v1: v2, v4 v2: v1, v3, v4 v3: v1 v4: v1, v3

Page 583: Cloudera_Developer_Training

B"17$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Conclusion$and$Appendixes$Graph$Manipula0on$in$MapReduce$

!  Graphs"!  Best"pracAces"for"represenAng"graphs"in"MapReduce"

!  Implemen0ng$a$single"source$shortest"path$algorithm$in$MapReduce$

!  Conclusion"

Page 584: Cloudera_Developer_Training

B"18$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Problem:$find$the$shortest$path$from$a$source$node$to$one$or$more$target$

nodes$

! Serial$algorithm:$Dijkstra’s$Algorithm$

– Not"suitable"for"parallelizaAon"! MapReduce$algorithm:$parallel$breadth"first$search$

Single"Source"Shortest"Path"

Page 585: Cloudera_Developer_Training

B"19$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! The$algorithm,$intui0vely:$

– Distance"to"the"source"="0"– For"all"nodes"directly"reachable"from"the"source,"distance"="1"– For"all"nodes"reachable"from"some"node"n"in"the"graph,"distance"from"source"="1"+"min(distance"to"that"node)"

Parallel"Breadth/First"Search"

Page 586: Cloudera_Developer_Training

B"20$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Mapper:$

– Input"key"is"some"vertex"ID"– Input"value"is"D"(distance"from"source),"adjacency"list"– Processing:"For"all"nodes"in"the"adjacency"list,""emit"(node"ID,"D"+"1)"– If"the"distance"to"this"node"is"D,"then"the"distance"to"any"node"reachable"from"this"node"is"D"+"1"

! Reducer:$– Receives"vertex"and"list"of"distance"values"– Processing:"Selects"the"shortest"distance"value"for"that"node"

Parallel"Breadth/First"Search:"Algorithm"

Page 587: Cloudera_Developer_Training

B"21$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! A$MapReduce$job$corresponds$to$one$itera0on$of$parallel$breadth"first$search$

– Each"iteraAon"advances"the"‘known"fronAer’"by"one"hop"– IteraAon"is"accomplished"by"using"the"output"from"one"job"as"the"input"to"the"next"

! How$many$itera0ons$are$needed?$

– MulAple"iteraAons"are"needed"to"explore"the"enAre"graph"– As"many"as"the"diameter"of"the"graph"

– Graph"diameters"are"surprisingly"small,"even"for"large"graphs"– ‘Six"degrees"of"separaAon’"

! Controlling$itera0ons$in$Hadoop$– Use"counters;"when"you"reach"a"node,"‘count’"it"– At"the"end"of"each"iteraAon,"check"the"counters"

– When"you’ve"reached"all"the"nodes,"you’re"finished"

IteraAons"of"Parallel"BFS"

Page 588: Cloudera_Developer_Training

B"22$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Characteris0cs$of$Parallel$BFS$– Mappers"emit"distances,"Reducers"select"the"shortest"distance"– Output"of"the"Reducers"becomes"the"input"of"the"Mappers"for"the"next"iteraAon"

! Problem:$where$did$the$graph$structure$(adjacency$lists)$go?$

! Solu0on:$Mapper$must$emit$the$adjacency$lists$as$well$

– Mapper"emits"two"types"of"key/value"pairs"– RepresenAng"distances"– RepresenAng"adjacency"lists"

– Reducer"recovers"the"adjacency"list"and"preserves"it"for"the"next"iteraAon"

One"More"Trick:"Preserving"Graph"Structure"

Page 589: Cloudera_Developer_Training

B"23$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Parallel"BFS:"Pseudo/Code"

From Lin & Dyer (2010) Data-Intensive Text Processing with MapReduce

Page 590: Cloudera_Developer_Training

B"24$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! Your$instructor$will$now$demonstrate$the$parallel$breadth"first$search$

algorithm$

Parallel"BFS:"DemonstraAon"

Page 591: Cloudera_Developer_Training

B"25$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

! MapReduce$is$adept$at$manipula0ng$graphs$

– Store"graphs"as"adjacency"lists"! Typically,$MapReduce$graph$algorithms$are$itera0ve$

– Iterate"unAl"some"terminaAon"condiAon"is"met"– Remember"to"pass"the"graph"structure"from"one"iteraAon"to"the"next"

Graph"Algorithms:"General"Thoughts"

Page 592: Cloudera_Developer_Training

B"26$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

Chapter"Topics"

Course$Conclusion$and$Appendixes$Graph$Manipula0on$in$MapReduce$

!  Graphs"!  Best"pracAces"for"represenAng"graphs"in"MapReduce"

!  ImplemenAng"a"single/source"shortest/path"algorithm"in"MapReduce"

!  Conclusion$

Page 593: Cloudera_Developer_Training

B"27$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."

In$this$appendix$you$have$learned$

! What$graphs$are$

! Best$prac0ces$for$represen0ng$graphs$in$Hadoop$

! How$to$implement$a$single$source$shortest$path$algorithm$in$MapReduce$

Conclusion"