Top Banner
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 1 Cloudera Custom Training Hands-On Exercises General Notes ...........................................................................................................................................3 Hands-On Exercise: Data Ingest With Hadoop Tools ..................................................................6 Hands-On Exercise: Running Queries from the Shell, Scripts, and Hue ............................ 13 Hands-On Exercise: Data Management ........................................................................................ 17 Hands-On Exercise: Relational Analysis ...................................................................................... 24 Hands-On Exercise: Working with Impala .................................................................................. 26 Hands-On Exercise: Analyzing Text and Complex Data With Hive ..................................... 29 Hands-On Exercise: Data Transformation with Hive .............................................................. 35 Hands-On Exercise: View the Spark Documentation............................................................... 43 Hands-On Exercise: Use the Spark Shell ...................................................................................... 44 Hands-On Exercise: Use RDDs to Transform a Dataset........................................................... 46 Hands-On Exercise: Process Data Files with Spark .................................................................. 53 Hands-On Exercise: Use Pair RDDs to Join Two Datasets ...................................................... 57
97

Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Jul 27, 2018

Download

Documents

DuongAnh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

1

Cloudera Custom Training Hands-On Exercises

GeneralNotes...........................................................................................................................................3Hands-OnExercise:DataIngestWithHadoopTools..................................................................6Hands-OnExercise:RunningQueriesfromtheShell,Scripts,andHue............................13Hands-OnExercise:DataManagement........................................................................................17Hands-OnExercise:RelationalAnalysis......................................................................................24Hands-OnExercise:WorkingwithImpala..................................................................................26Hands-OnExercise:AnalyzingTextandComplexDataWithHive.....................................29Hands-OnExercise:DataTransformationwithHive..............................................................35Hands-OnExercise:ViewtheSparkDocumentation...............................................................43Hands-OnExercise:UsetheSparkShell......................................................................................44Hands-OnExercise:UseRDDstoTransformaDataset...........................................................46Hands-OnExercise:ProcessDataFileswithSpark..................................................................53Hands-OnExercise:UsePairRDDstoJoinTwoDatasets......................................................57

Page 2: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

2

Hands-OnExercise:WriteandRunaSparkApplication........................................................62Hands-OnExercise:ConfigureaSparkApplication.................................................................67Hands-OnExercise:ViewJobsandStagesintheSparkApplicationUI.............................71Hands-OnExercise:PersistanRDD...............................................................................................77Hands-OnExercise:ImplementanIterativeAlgorithm.........................................................79Hands-OnExercise:UseBroadcastVariables............................................................................83Hands-OnExercise:UseAccumulators........................................................................................84Hands-OnExercise:UseSparkSQLforETL................................................................................85AppendixA:EnablingiPythonNotebook....................................................................................89DataModelReference........................................................................................................................91RegularExpressionReference........................................................................................................95

Page 3: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

3

General Notes Cloudera’strainingcoursesuseavirtualmachine(VM)witharecentversionofCDHalreadyinstalledandconfiguredforyou.TheVMrunsinpseudo-distributedmode,aconfigurationthatenablesaHadoopclustertorunonasinglemachine.

Points to Note While Working in the VM

1. TheVMissettoautomaticallyloginastheusertraining.Ifyoulogout,youcanlogbackinastheusertrainingwiththepasswordtraining.Therootpasswordisalsotraining,thoughyoucanprefixanycommandwithsudotorunitasroot.

2. Exercisesoftencontainstepswithcommandsthatlooklikethis:

$ hdfs dfs -put accounting_reports_taxyear_2013 \

/user/training/tax_analysis/

The$symbolrepresentsthecommandprompt.Donotincludethischaracterwhencopyingandpastingcommandsintoyourterminalwindow.Also,thebackslash(\)signifiesthatthecommandcontinuesonthenextline.Youmayeitherenterthecodeasshown(ontwolines),oromitthebackslashandtypethecommandonasingleline.

SomecommandsaretobeexecutedinthePythonorScalaSparkShells;thosearecolorcodedandshownwithpyspark>(blue)orscala>(red)prompts,respectively.Linuxcommandstepsthatapplytoonlyonelanguageortheotherarealsocolorcoded,butstillprecededwiththe$prompt.

3. AlthoughmanystudentsarecomfortableusingUNIXtexteditorslikevioremacs,somemightpreferagraphicaltexteditor.Toinvokethegraphicaleditorfromthecommandline,typegeditfollowedbythepathofthefileyouwishtoedit.Appending&tothecommandallowsyoutotypeadditionalcommandswhiletheeditorisstillopen.Hereisanexampleofhowtoeditafilenamedmyfile.txt:

$ gedit myfile.txt &

Page 4: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

4

Class-Specific VM Customization

YourVMisusedinseveralofCloudera’strainingclasses.Thisparticularclassdoesnotrequiresomeoftheservicesthatstartbydefault,whileotherservicesthatdonotstartbydefaultarerequiredforthisclass.Beforestartingthecourseexercises,runthecoursesetupscript:

$ ~/scripts/analyst/training_setup_da.sh

Youmaysafelyignoreanymessagesaboutservicesthathavealreadybeenstartedorshutdown.Youonlyneedtorunthisscriptonce.

Points to Note During the Exercises

SampleSolutions

Ifyouneedahintorwanttocheckyourwork,thesample_solutionsubdirectorywithineachexercisedirectorycontainscompletecodesamples.

Catch-upScript

Ifyouareunabletocompleteanexercise,wehaveprovidedascripttocatchyouupautomatically.Eachexercisehasinstructionsforrunningthecatch-upscript.

$ADIREnvironmentVariable

$ADIRisashortcutthatpointstothe/home/training/training_materials/ analystdirectory,whichcontainsthecodeanddatayouwilluseintheexercises.

FewerStep-by-StepInstructionsasYouWorkThroughTheseExercises

Astheexercisesprogress,andyougainmorefamiliaritywiththetoolsyou’reusing,weprovidefewerstep-by-stepinstructions.Youshouldfeelfreetoaskyourinstructorforassistanceatanytime,ortoconsultwithyourfellowstudents.

Page 5: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

5

BonusExercises

Manyoftheexercisescontainoneormoreoptional“bonus”sections.Weencourageyoutoworkthroughtheseiftimeremainsafteryoufinishthemainexerciseandwouldlikeanadditionalchallengetopracticewhatyouhavelearned.

Page 6: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

6

Hands-On Exercise: Data Ingest With Hadoop Tools InthisexerciseyouwillpracticeusingtheHadoopcommandlineutilitytointeract

withHadoop’sDistributedFilesystem(HDFS)anduseSqooptoimporttablesfroma

relationaldatabasetoHDFS.

To begin, you must launch the Data Analyst VM.

BesureyouhaverunthesetupscriptasdescribedintheGeneralNotessectionabove.Ifyouhavenotrunityet,dosonow:

$ ~/scripts/analyst/training_setup_da.sh

Step 1: Exploring HDFS using the Hue File Browser

1. StarttheFirefoxWebbrowseronyourVMbyclickingontheiconinthesystemtoolbar:

2. InFirefox,clickontheHuebookmarkinthebookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)

3. Afterafewseconds,youshouldseeHue’shomescreen.Thefirsttimeyoulogin,youwillbepromptedtocreateanewusernameandpassword.Entertraininginboththeusernameandpasswordfields,andthenclickthe“SignIn”button.

4. WheneveryoulogintoHueaTipspopupwillappear.Tostopitappearinginthefuture,checktheDonotshowthisdialogagainoptionbeforedismissingthepopup.

Page 7: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

7

5. ClickFileBrowserintheHuetoolbar.YourHDFShomedirectory(/user/training)displays.(SinceyouruserIDontheclusteristraining,yourhomedirectoryinHDFSis/user/training.)Thedirectorycontainsnofilesordirectoriesyet.

6. Createatemporarysub-directory:selectthe+NewmenuandclickDirectory.

7. EnterdirectorynametestandclicktheCreatebutton.Yourhomedirectorynowcontainsadirectorycalledtest.

8. Clickontesttoviewthecontentsofthatdirectory;currentlyitcontainsnofilesorsubdirectories.

9. UploadafiletothedirectorybyselectingUploadàFiles.

10. ClickSelectFilestobringupafilebrowser.Bydefault,the/home/training/Desktopfolderdisplays.Clickthehomedirectorybutton(training)thennavigatetothecoursedatadirectory:training_materials/analyst/data.

11. ChooseanyofthedatafilesinthatdirectoryandclicktheOpenbutton.

12. ThefileyouselectedwillbeloadedintothecurrentHDFSdirectory.Clickthefilenametoseethefile’scontents.BecauseHDFSisdesigntostoreverylargefiles,Huewillnotdisplaytheentirefile,justthefirstpageofdata.Youcanclickthearrowbuttonsorusethescrollbartoseemoreofthedata.

13. ReturntothetestdirectorybyclickingViewfilelocationinthelefthandpanel.

14. Abovethelistoffilesinyourcurrentdirectoryisthefullpathofthedirectoryyouarecurrentlydisplaying.Youcanclickonanydirectoryinthepath,oronthefirstslash(/)

Page 8: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

8

togotothetoplevel(root)directory.Clicktrainingtoreturntoyourhomedirectory.

15. Deletethetemporarytestdirectoryyoucreated,includingthefileinit,byselectingthecheckboxnexttothedirectorynamethenclickingtheMovetotrashbutton.(ConfirmthatyouwanttodeletebyclickingYes.)

Step 2: Exploring HDFS using the command line

4. Youcanusethehdfs dfscommandtointeractwithHDFSfromthecommandline.CloseorminimizeFirefox,thenopenaterminalwindowbyclickingtheiconinthesystemtoolbar:

16. Intheterminalwindow,enter:

$ hdfs dfs

Thisdisplaysahelpmessagedescribingallsubcommandsassociatedwithhdfs dfs.

17. Runthefollowingcommand:

$ hdfs dfs -ls /

ThisliststhecontentsoftheHDFSrootdirectory.Oneofthedirectorieslistedis/user.Eachuserontheclusterhasa‘home’directorybelow/usercorrespondingtohisorheruserID.

18. Ifyoudonotspecifyapath,hdfs dfsassumesyouarereferringtoyourhomedirectory:

Page 9: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

9

$ hdfs dfs -ls

19. Notethe/dualcore directory.Mostofyourworkinthiscoursewillbeinthatdirectory.Trycreatingatemporarysubdirectoryin/dualcore:

$ hdfs dfs -mkdir /dualcore/test1

20. Next,addaWebserverlogfiletothisnewdirectoryinHDFS:

$ hdfs dfs -put $ADIR/data/access.log /dualcore/test1/

Overwriting Files in Hadoop

Unlike the UNIX shell, Hadoop won’t overwrite files and directories. This feature helps

protect users from accidentally replacing data that may have taken hours to produce. If

you need to replace a file or directory in HDFS, you must first remove the existing one.

Please keep this in mind in case you make a mistake and need to repeat a step during

the Hands-On Exercises.

To remove a file:

$ hdfs dfs -rm /dualcore/example.txt

To remove a directory and all its files and subdirectories (recursively):

$ hdfs dfs -rm -r /dualcore/example/

21. Verifythelaststepbylistingthecontentsofthe/dualcore/test1directory.Youshouldobservethattheaccess.logfileispresentandoccupies106,339,468bytesofspaceinHDFS:

$ hdfs dfs -ls /dualcore/test1

22. Removethetemporarydirectoryanditscontents:

Page 10: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

10

$ hdfs dfs -rm -r /dualcore/test1

Step 3: Importing Database Tables into HDFS with Sqoop

Dualcorestoresinformationaboutitsemployees,customers,products,andordersinaMySQLdatabase.Inthenextfewsteps,youwillexaminethisdatabasebeforeusingSqooptoimportitstablesintoHDFS.

5. Inaterminalwindow,logintoMySQLandselectthedualcoredatabase:

$ mysql --user=training --password=training dualcore

23. Next,listtheavailabletablesinthedualcoredatabase(mysql>representstheMySQLclientpromptandisnotpartofthecommand):

mysql> SHOW TABLES;

24. Reviewthestructureoftheemployeestableandexamineafewofitsrecords:

mysql> DESCRIBE employees;

mysql> SELECT emp_id, fname, lname, state, salary FROM

employees LIMIT 10;

25. ExitMySQLbytypingquit,andthenhittheenterkey:

mysql> quit

Data Model Reference

For your convenience, you will find a reference section depicting the structure for the

tables you will use in the exercises at the end of this Exercise Manual.

26. Next,runthefollowingcommand,whichimportstheemployeestableintothe/dualcoredirectorycreatedearlierusingtabcharacterstoseparateeachfield:

Page 11: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

11

$ sqoop import \

--connect jdbc:mysql://localhost/dualcore \

--username training --password training \

--fields-terminated-by '\t' \

--warehouse-dir /dualcore \

--table employees

Hiding Passwords

Typing the database password on the command line is a potential security risk since

others may see it. An alternative to using the --password argument is to use -P and let

Sqoop prompt you for the password, which is then not visible when you type it.

Sqoop Code Generation

After running the sqoop import command above, you may notice a new file named

employee.java in your local directory. This is an artifact of Sqoop’s code generation

and is really only of interest to Java developers, so you can ignore it.

27. RevisethepreviouscommandandimportthecustomerstableintoHDFS.

28. RevisethepreviouscommandandimporttheproductstableintoHDFS.

29. RevisethepreviouscommandandimporttheorderstableintoHDFS.

Page 12: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

12

30. Next,importtheorder_detailstableintoHDFS.Thecommandisslightlydifferentbecausethistableonlyholdsreferencestorecordsintheordersandproductstable,andlacksaprimarykeyofitsown.Consequently,youwillneedtospecifythe--split-byoptionandinstructSqooptodividetheimportworkamongtasksbasedonvaluesintheorder_idfield.Analternativeistousethe-m 1optiontoforceSqooptoimportallthedatawithasingletask,butthiswouldsignificantlyreduceperformance.

$ sqoop import \

--connect jdbc:mysql://localhost/dualcore \

--username training --password training \

--fields-terminated-by '\t' \

--warehouse-dir /dualcore \

--table order_details \

--split-by=order_id

This is the end of the Exercise

Page 13: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

13

Hands-On Exercise: Running Queries from the Shell, Scripts, and Hue

Exercise directory: $ADIR/exercises/queries

InthisexerciseyouwillpracticeusingtheHuequeryeditorandtheImpalaandHive

shellstoexecutesimplequeries.Theseexercisesusethetablesthathavebeen

populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngestWith

HadoopTools”exercise.

IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Explore the customers table using Hue

OnewaytorunImpalaandHivequeriesisthroughyourWebbrowserusingHue’sQueryEditors.Thisisespeciallyconvenientifyouusemorethanonecomputer–orifyouuseadevice(suchasatablet)thatisn’tcapableofrunningtheImpalaorBeelineshellsitself–becauseitdoesnotrequireanysoftwareotherthanabrowser.

6. StarttheFirefoxWebbrowserifitisn’trunning,thenclickontheHuebookmarkintheFirefoxbookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)

7. Afterafewseconds,youshouldseeHue’shomescreen.Ifyoudon’tcurrentlyhaveanactivesession,youwillfirstbepromptedtologin.Entertraininginboththeusernameandpasswordfields,andthenclicktheSignInbutton.

8. SelecttheQueryEditorsmenuintheHuetoolbar.NotethattherearequeryeditorsforbothImpalaandHive(aswellasothertoolssuchasPig.)TheinterfaceisverysimilarforbothHiveandImpala.Fortheseexercises,selecttheImpalaqueryeditor.

Page 14: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

14

9. ThisisthefirsttimewehaverunImpalasinceweimportedthedatausingSqoop.TellImpalatoreloadtheHDFSmetadataforthetablebyenteringthefollowingcommandinthequeryarea,thenclickingExecute.

INVALIDATE METADATA

10. Makesurethedefaultdatabaseisselectedinthedatabaselistontheleftsideofthepage.

11. Belowtheselecteddatabaseisalistofthetablesinthatdatabase.Selectthecustomerstabletoviewthecolumnsinthetable.

12. ClickthePreviewSampleDataicon( )nexttothetablenametoviewsampledatafromthetable.Whenyouaredone,clicktoOKbuttontoclosethewindow.

Step #2: Run a Query Using Hue

Dualcoreranacontestinwhichcustomerspostedvideosofinterestingwaystousetheirnewtablets.A$5,000prizewillbeawardedtothecustomerwhosevideoreceivedthehighestrating.

However,theregistrationdatawaslostduetoanRDBMScrash,andtheonlyinformationwehaveisfromthevideos.Thewinningcustomerintroducedherselfonlyas“BridgetfromKansasCity”inhervideo.

Youwillneedtorunaquerythatidentifiesthewinner’srecordinourcustomerdatabasesothatwecansendherthe$5,000prize.

13. AllyouknowaboutthewinneristhathernameisBridgetandshelivesinKansasCity.IntheImpalaQueryEditor,enteraqueryinthetextareatofindthewinningcustomer.UsetheLIKEoperatortodoawildcardsearchfornamessuchas"Bridget","Bridgette"or"Bridgitte".Remembertofilteronthecustomer'scity.

14. Afterenteringthequery,clicktheExecutebutton.

Whilethequeryisexecuting,theLogtabdisplaysongoinglogoutputfromthequery.Whenthequeryiscomplete,theResultstabopens,displayingtheresultsofthequery.

Question:Whichcustomerdidyourqueryidentifyasthewinnerofthe$5,000prize?

Page 15: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

15

Step #3: Run a Query from the Impala Shell

Runatop-NquerytoidentifythethreemostexpensiveproductsthatDualcorecurrentlyoffers.

15. Startaterminalwindowifyoudon’tcurrentlyhaveonerunning.

16. OntheLinuxcommandlineintheterminalwindow,starttheImpalashell:

$ impala-shell

ImpaladisplaystheURLoftheImpalaserverintheshellcommandprompt,e.g.:

[localhost.localdomain:21000] >

17. Attheprompt,reviewtheschemaoftheproductstablebyentering

DESCRIBE products;

RememberthatSQLcommandsintheshellmustbeterminatedbyasemi-colon(;),unlikeintheHuequeryeditor.

18. Showasampleof10recordsfromtheproductstable:

SELECT * FROM products LIMIT 10;

19. Executeaquerythatdisplaysthethreemostexpensiveproducts.Hint:UseORDER BY.

20. Whenyouaredone,exittheImpalashell:

exit;

Step #4: Run a Script in the Impala Shell

TherulesforthecontestdescribedearlierrequirethatthewinnerboughttheadvertisedtabletfromDualcorebetweenMay1,2013andMay31,2013.Beforewecanauthorizeouraccountingdepartmenttopaythe$5,000prize,youmustensurethatBridgetiseligible.

Page 16: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

16

Sincethisqueryinvolvesjoiningdatafromseveraltables,andwehavenotyetcoveredJOIN,you’vebeenprovidedwithascriptintheexercisedirectory.

21. Changetothedirectoryforthishands-onexercise:

$ cd $ADIR/exercises/queries

22. Reviewthecodeforthequery:

$ cat verify_tablet_order.sql

23. Executethescriptusingtheshell’s-foption:

$ impala-shell -f verify_tablet_order.sql

Question:DidBridgetordertheadvertisedtabletinMay?

Step #5: Run a Query Using Beeline

24. AttheLinuxcommandlineinaterminalwindow,startBeeline:

$ beeline -u jdbc:hive2://localhost:10000

BeelinedisplaystheURLoftheHiveserverintheshellcommandprompt,e.g.:

0: jdbc:hive2://localhost:10000>

25. ExecuteaquerytofindalltheGigabuxbrandproductswhosepriceislessthan1000($10).

26. ExittheBeelineshellbyentering

!exit

This is the end of the Exercise

Page 17: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

17

Hands-On Exercise: Data Management Exercisedirectory:$ADIR/exercises/data_mgmt

Inthisexerciseyouwillpracticeusingseveralcommontechniquesforcreatingand

populatingtables.

IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Review Existing Tables using the Metastore Manager

1. InFirefox,visittheHuehomepage,andthenchooseDataBrowsersàMetastoreTablesintheHuetoolbar.

2. Makesuredefaultdatabaseisselected.

3. Selectthecustomerstabletodisplaythetablebrowserandreviewthelistofcolumns.

4. SelecttheSampletabtoviewthefirsthundredrowsofdata.

Step #2: Create and Load a New Table using the Metastore Manager

Createandthenloadatablewithproductratingsdata.

5. Beforecreatingthetable,reviewthefilescontainingtheproductratingsdata.Thefilesarein/home/training/training_materials/analyst/data.Youcanusetheheadcommandinaterminalwindowtoseethefirstfewlines:

Page 18: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

18

$ head $ADIR/data/ratings_2012.txt

$ head $ADIR/data/ratings_2013.txt

6. Copythedatafilestothe/dualcoredirectoryinHDFS.YoumayuseeithertheHueFileBrowser,orthehdfscommandintheterminalwindow:

$ hdfs dfs -put $ADIR/data/ratings_2012.txt /dualcore/

$ hdfs dfs -put $ADIR/data/ratings_2013.txt /dualcore/

7. ReturntotheMetastoreManagerinHue.Selectthedefaultdatabasetoviewthetablebrowser.

8. ClickonCreateanewtablemanuallytostartthetabledefinitionwizard.

9. Thefirstwizardstepistospecifythetable’sname(required)andadescription(optional).Entertablenameratings,thenclickNext.

10. InthenextstepyoucanchoosewhetherthetablewillbestoredasaregulartextfileoruseacustomSerializer/Deserializer,orSerDe.SerDeswillbecoveredlaterinthecourse.Fornow,selectDelimited,thenclickNext.

11. Thenextstepallowsyoutochangethedefaultdelimiters.Forasimpletable,onlythefieldterminatorisrelevant;collectionandmapdelimitersareusedforcomplexdatainHive,andwillbecoveredlaterinthecourse.SelectTab(\t)forthefieldterminator,thenclickNext.

12. Inthenextstep,chooseafileformat.FileFormatswillbecoveredlaterinthecourse.Fornow,selectTextFile,thenclickNext.

13. Inthenextstep,youcanchoosewhethertostorethefileinthedefaultdatawarehousedirectoryoradifferentlocation.MakesuretheUsedefaultlocationboxischecked,thenclickNext.

14. Thenextstepinthewizardletsyouaddcolumns.Thefirstcolumnoftheratingstableisthetimestampofthetimethattheratingwasposted.Entercolumnnamepostedandchoosecolumntypetimestamp.

Page 19: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

19

15. YoucanaddadditionalcolumnsbyclickingtheAddacolumnbutton.Repeatthestepsabovetoenteracolumnnameandtypeforallthecolumnsfortheratingstable:

FieldName FieldTypeposted timestamp cust_id int prod_id int rating tinyint message string

16. Whenyouhaveaddedallthecolumns,scrolldownandclickCreatetable.ThiswillstartajobtodefinethetableintheMetastore,andcreatethewarehousedirectoryinHDFStostorethedata.

17. Whenthejobiscomplete,thenewtablewillappearinthetablebrowser.

18. Optional:UsetheHueFileBrowserorthehdfscommandtoviewthe/user/hive/warehousedirectorytoconfirmcreationoftheratingssubdirectory.

19. Nowthatthetableiscreated,youcanloaddatafromafile.OnewaytodothisisinHue.ClickImportTableunderActions.

20. IntheImportdatadialogbox,enterorbrowsetotheHDFSlocationofthe2012productratingsdatafile:/dualcore/ratings_2012.txt.ThenclickSubmit.(Youwillloadthe2013ratingsinamoment.)

21. Next,verifythatthedatawasloadedbyselectingtheSampletabinthetablebrowserfortheratingstable.Theresultsshouldlooklikethis:

22. Tryingqueryingthedatainthetable.InHue,switchtotheImpalaQueryEditor.

Page 20: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

20

23. Initiallythenewtablewillnotappear.YoumustfirstreloadImpala’smetadatacachebyenteringandexecutingthecommandbelow.(Impalametadatacachingwillbecoveredindepthlaterinthecourse.)

INVALIDATE METADATA;

24. Ifthetabledoesnotappearinthetablelistontheleft,clicktheReloadbutton: .(Thisrefreshesthepage,notthemetadataitself.)

25. Tryexecutingaquery,suchascountingthenumberofratings:

SELECT COUNT(*) FROM ratings;

Thetotalnumberofrecordsshouldbe464.

26. AnotherwaytoloaddataintoatableisusingtheLOAD DATAcommand.Loadthe2013ratingsdata:

LOAD DATA INPATH '/dualcore/ratings_2013.txt' INTO TABLE

ratings;

27. TheLOAD DATA INPATHcommandmovesthefiletothetable’sdirectory.UsingtheHueFileBrowserorhdfscommand,verifythatthefileisnolongerpresentintheoriginaldirectory:

$ hdfs dfs -ls /dualcore/ratings_2013.txt

28. Optional:Verifythatthe2013dataisshownalongsidethe2012datainthetable’swarehousedirectory.

29. Finally,counttherecordsintheratingstabletoensurethatall21,997areavailable:

SELECT COUNT(*) FROM ratings;

Page 21: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

21

Step #3: Create an External Table Using CREATE TABLE

YouimporteddatafromtheemployeestableinMySQLintoHDFSinanearlierexercise.Nowwewanttobeabletoquerythisdata.SincethedataalreadyexistsinHDFS,thisisagoodopportunitytouseanexternaltable.

InthelastexerciseyoupracticedcreatingatableusingtheMetastoreManager;thistime,useanImpalaSQLstatement.YoumayuseeithertheImpalashell,ortheImpalaQueryEditorinHue.

30. WriteandexecuteaCREATE TABLE statementtocreateanexternaltableforthetab-delimitedrecordsinHDFSat/dualcore/employees.Thedataformatisshownbelow:

FieldName FieldTypeemp_id STRING fname STRING lname STRING address STRING city STRING state STRING zipcode STRING job_title STRING email STRING active STRING salary INT

31. Runthefollowingquerytoverifythatyouhavecreatedthetablecorrectly.

SELECT job_title, COUNT(*) AS num

FROM employees

GROUP BY job_title

ORDER BY num DESC

LIMIT 3;

ItshouldshowthatSalesAssociate,Cashier,andAssistantManagerarethethreemostcommonjobtitlesatDualcore.

Page 22: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

22

Bonus Exercise #1: Use Sqoop’s Hive Import Option to Create a Table

Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.

YouusedSqoopinanearlierexercisetoimportdatafromMySQLintoHDFS.SqoopcanalsocreateaHive/Impalatablewiththesamefieldsasthesourcetableinadditiontoimportingtherecords,whichsavesyoufromhavingtowriteaCREATE TABLEstatement.

32. Inaterminalwindow,executethefollowingcommandtoimportthesupplierstablefromMySQLasanewmanagedtable:

$ sqoop import \

--connect jdbc:mysql://localhost/dualcore \

--username training --password training \

--fields-terminated-by '\t' \

--table suppliers \

--hive-import

33. Itisalwaysagoodideatovalidatedataafteraddingit.ExecutethefollowingquerytocountthenumberofsuppliersinTexas.YoumayuseeithertheImpalashellortheHueImpalaQueryEditor.RemembertoinvalidatethemetadatacachesothatImpalacanfindthenewtable.

INVALIDATE METADATA;

SELECT COUNT(*) FROM suppliers WHERE state='TX';

Thequeryshouldshowthatninerecordsmatch.

Bonus Exercise #2: Alter a Table

Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.Youcancompareyourworkagainstthefilesfoundinthebonus_02/sample_solution/subdirectory.

Page 23: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

23

InthisexerciseyouwillmodifythesupplierstableyouimportedusingSqoopinthepreviousexercise.YoumaycompletetheseexercisesusingeithertheImpalashellortheImpalaqueryeditorinHue.

34. UseALTER TABLEtorenamethecompanycolumntoname.

35. UsetheDESCRIBEcommandonthesupplierstabletoverifythechange.

36. UseALTER TABLEtorenametheentiretabletovendors.

37. AlthoughtheALTER TABLEcommandoftenrequiresthatwemakeacorrespondingchangetothedatainHDFS,renamingatableorcolumndoesnot.Youcanverifythisbyrunningaqueryonthetableusingthenewnames,e.g.:

SELECT supp_id, name FROM vendors LIMIT 10;

This is the end of the Exercise

Page 24: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

24

Hands-On Exercise: Relational Analysis Exercisedirectory:$ADIR/exercises/relational_analysis

Inthisexerciseyouwillwritequeriestoanalyzedataintablesthathavebeen

populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngest”exercise.

IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:

$ ~/scripts/analyst/catchup.sh

SeveralanalysisquestionsaredescribedbelowandyouwillneedtowritetheSQLcodetoanswerthem.Youcanusewhichevertoolyouprefer–ImpalaorHive–usingwhichevermethodyoulikebest,includingshell,script,ortheHueQueryEditor,torunyourqueries.

Step #1: Calculate Top N Products

• WhichtopthreeproductshasDualcoresoldmoreofthananyother?Hint:RememberthatifyouuseaGROUP BYclause,youmustgroupbyallfieldslistedintheSELECTclausethatarenotpartofanaggregatefunction.

Step #2: Calculate Order Total

• Whichordershadthehighesttotal?

Step #3: Calculate Revenue and Profit

• WriteaquerytoshowDualcore’srevenue(totalpriceofproductssold)andprofit(priceminuscost)bydate.

o Hint:Theorder_datecolumnintheorder_detailstableisoftypeTIMESTAMP.UseTO_DATEtogetjustthedateportionofthevalue.

Page 25: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

25

Thereareseveralwaysyoucouldwritethesequeries.Onepossiblesolutionforeachisinthesample_solution/directory.

Bonus Exercise #1: Rank Daily Profits by Month

Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.

• Writeaquerytoshowhoweachday’sprofitrankscomparedtootherdayswithinthesameyearandmonth.

o Hint:Usethepreviousexercise’ssolutionasasub-query;findtheROW_NUMBERoftheresultswithineachyearandmonth

Thereareseveralwaysyoucouldwritethisquery.Onepossiblesolutionisinthebonus_01/sample_solution/directory.

This is the end of the Exercise

Page 26: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

26

Hands-On Exercise: Working with Impala

InthisexerciseyouwillexplorethequeryexecutionplanforvarioustypesofqueriesinImpala.

IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Review Query Execution Plans

1. Reviewtheexecutionplanforthefollowingquery.YoumayuseeithertheImpalaQueryEditorinHueortheImpalashellcommandlinetool.

SELECT * FROM products;

2. Notethatthequeryexplanationincludesawarningthattableandcolumnstatsarenotavailablefortheproductstable.Computethestatsbyexecuting

COMPUTE STATS products;

3. Nowviewthequeryplanagain,thistimewithoutthewarning.

4. Thepreviousquerywasaverysimplequeryagainstasingletable.Tryreviewingthequeryplanofamorecomplexquery.Thefollowingqueryreturnsthetop3productssold.BeforeEXPLAINingthequery,computestatsonthetablestobequeried.

Page 27: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

27

SELECT brand, name, COUNT(p.prod_id) AS sold

FROM products p

JOIN order_details d

ON (p.prod_id = d.prod_id)

GROUP BY brand, name, p.prod_id

ORDER BY sold DESC

LIMIT 3;

Questions:Howmanystagesarethereinthisquery?Whataretheestimatedper-hostmemoryrequirementsforthisquery?Whatisthetotalsizeofallpartitionstobescanned?

5. Thetablesinthequeriesaboveareallhaveonlyasinglepartition.Tryreviewingthequeryplanforapartitionedtable.Recallthatinthe“DataStorageandPerformance”exercise,youcreatedanadstablepartitionedonthenetworkcolumn.Comparethequeryplansforthefollowingtwoqueries.Thefirstcalculatesthetotalcostofclickedadseachadcampaign;theseconddoesthesame,butforalladsononeofadnetworks.

SELECT campaign_id, SUM(cpc)

FROM ads

WHERE was_clicked=1

GROUP BY campaign_id

ORDER BY campaign_id;

SELECT campaign_id, SUM(cpc)

FROM ads

WHERE network=1

GROUP BY campaign_id

ORDER BY campaign_id;

Questions:Whatistheestimateper-hostmemoryrequirementsforthetwoqueries?Whatexplainsthedifference?

Page 28: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

28

Bonus Exercise #1: Review the Query Summary

Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.

ThisexercisemustbecompletedintheImpalaShellcommandlinetool,becauseitusesfeaturesnotyetavailableinHue.Refertothe“RunningQueriesfromtheShell,Scripts,andHue”exerciseforhowtousetheshellifneeded.

6. Tryexecutingoneofthequeriesyouexaminedabove,e.g.

SELECT brand, name, COUNT(p.prod_id) AS sold

FROM products p

JOIN order_details d

ON (p.prod_id = d.prod_id)

GROUP BY brand, name, p.prod_id

ORDER BY sold DESC

LIMIT 3;

7. Afterthequerycompletes,executetheSUMMARYcommand:

SUMMARY;

8. Questions:Whichstagetookthelongestaveragetimetocomplete?Whichtookthemostmemory?

This is the end of the Exercise

Page 29: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

29

Hands-On Exercise: Analyzing Text and Complex Data With Hive

Exercisedirectory:$ADIR/exercises/complex_data

Inthisexercise,youwill

• UseHive'sabilitytostorecomplexdatatoworkwithdatafromacustomerloyalty

program

• UseaRegexSerDetoloadweblogdataintoHive

• UseHive’stextprocessingfeaturestoanalyzecustomers’commentsandproduct

ratings,uncoverproblemsandproposepotentialsolutions.

IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Create, Load and Query a Table with Complex Data

Dualcorerecentlystartedaloyaltyprogramtorewardourbestcustomers.Acolleaguehasalreadyprovideduswithasampleofthedatathatcontainsinformationaboutcustomerswhohavesignedupfortheprogram,includingtheirphonenumbers(asamap),alistofpastorderIDs(asanarray),andastructthatsummarizestheminimum,maximum,average,andtotalvalueofpastorders.Youwillcreatethetable,populateitwiththeprovideddata,andthenrunafewqueriestopracticereferencingthesetypesoffields.

YoumayuseeithertheBeelineshellorHue’sHiveQueryEditortocompletetheseexercises.

1. Createatablewiththefollowingcharacteristics:

Page 30: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

30

Name:loyalty_programType:EXTERNALColumns:

FieldName FieldTypecust_id STRING fname STRING lname STRING email STRING level STRING phone MAP<STRING,STRING> order_ids ARRAY<INT> order_value STRUCT<min:INT, max:INT, avg:INT,

total:INT> FieldTerminator:|(verticalbar) Collectionitemterminator: ,(comma) MapKeyTerminator: : (colon)Location:/dualcore/loyalty_program

2. Examinethedatain$ADIR/data/loyalty_data.txttoseehowitcorrespondstothefieldsinthetable.

3. LoadthedatafilebyplacingitintotheHDFSdatawarehousedirectoryforthenewtable.YoucanuseeithertheHueFileBrowser,orthehdfscommand:

$ hdfs dfs -put $ADIR/data/loyalty_data.txt \

/dualcore/loyalty_program/

4. RunaquerytoselecttheHOMEphonenumber(hint:mapkeysarecase-sensitive)forcustomerID1200866.Youshouldsee408-555-4914astheresult.

5. Selectthethirdelementfromtheorder_idsarrayforcustomerID1200866(hint:elementsareindexedfromzero).Thequeryshouldreturn5278505.

6. Selectthetotalattributefromtheorder_valuestructforcustomerID1200866.Thequeryshouldreturn401874.

Page 31: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

31

Step #2: Create and Populate the Web Logs Table

Manyinterestinganalysescanbedoneondatafromtheusageofawebsite.Thefirststepistoloadthesemi-structureddataintheweblogfilesintoaHivetable.Typicallogfileformatsarenotdelimited,soyouwillneedtousetheRegexSerDeandspecifyapatternHivecanusetoparselinesintoindividualfieldsyoucanthenquery.

7. Examinethecreate_web_logs.hqlscripttogetanideaofhowitusesaRegexSerDetoparselinesinthelogfile(anexampleloglineisshowninthecommentatthetopofthefile).Whenyouhaveexaminedthescript,runittocreatethetable.YoucanpastethecodeintotheHiveQueryEditor,oruseHCatalog:

$ hcat -f $ADIR/exercises/complex_data/create_web_logs.hql

8. Populatethetablebyaddingthelogfiletothetable’sdirectoryinHDFS:

$ hdfs dfs -put $ADIR/data/access.log /dualcore/web_logs/

9. VerifythatthedataisloadedcorrectlybyrunningthisquerytoshowthetopthreeitemsuserssearchedforonourWebsite:

SELECT term, COUNT(term) AS num FROM

(SELECT LOWER(REGEXP_EXTRACT(request,

'/search\\?phrase=(\\S+)', 1)) AS term

FROM web_logs

WHERE request REGEXP '/search\\?phrase=') terms

GROUP BY term

ORDER BY num DESC

LIMIT 3;

Youshouldseethatitreturnstablet(303),ram(153)andwifi(148).

Note:TheREGEXPoperator,whichisavailableinsomeSQLdialects,issimilartoLIKE,butusesregularexpressionsformorepowerfulpatternmatching.TheREGEXPoperatorissynonymouswiththeRLIKEoperator.

Page 32: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

32

Bonus Exercise #1: Analyze Numeric Product Ratings

Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.

CustomerratingsandfeedbackaregreatsourcesofinformationforbothcustomersandretailerslikeDualcore.

However,customercommentsaretypicallyfree-formtextandmustbehandleddifferently.Fortunately,Hiveprovidesextensivesupportfortextprocessing.

Beforedelvingintotextprocessing,you’llbeginbyanalyzingthenumericratingscustomershaveassignedtovariousproducts.Inthenextbonusexercise,youwillusetheseresultsindoingtextanalysis.

10. ReviewtheratingstablestructureusingtheHiveQueryEditororusingtheDESCRIBEcommandintheBeelineshell.

11. Wewanttofindtheproductthatcustomerslikemost,butmustguardagainstbeingmisledbyproductsthathavefewratingsassigned.Runthefollowingquerytofindtheproductwiththehighestaverageamongallthosewithatleast50ratings:

SELECT prod_id, FORMAT_NUMBER(avg_rating, 2) AS avg_rating

FROM (SELECT prod_id, AVG(rating) AS avg_rating,

COUNT(*) AS num

FROM ratings

GROUP BY prod_id) rated

WHERE num >= 50

ORDER BY avg_rating DESC

LIMIT 1;

12. Rewrite,andthenexecute,thequeryabovetofindtheproductwiththelowestaverageamongproductswithatleast50ratings.YoushouldseethattheresultisproductID1274673withanaverageratingof1.10.

Page 33: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

33

Bonus Exercise #2: Analyze Rating Comments

Weobservedearlierthatcustomersareverydissatisfiedwithoneoftheproductswesell.Althoughnumericratingscanhelpidentifywhichproductthatis,theydon’ttelluswhycustomersdon’tliketheproduct.Althoughwecouldsimplyreadthroughallthecommentsassociatedwiththatproducttolearnthisinformation,thatapproachdoesn’tscale.Next,youwilluseHive’stextprocessingsupporttoanalyzethecomments.

13. Thefollowingquerynormalizesallcommentsonthatproducttolowercase,breaksthemintoindividualwordsusingtheSENTENCESfunction,andpassesthosetotheNGRAMSfunctiontofindthefivemostcommonbigrams(two-wordcombinations).Runthequery:

SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(message)), 2, 5))

AS bigrams

FROM ratings

WHERE prod_id = 1274673;

14. Mostofthesewordsaretoocommontoprovidemuchinsight,thoughtheword“expensive”doesstandoutinthelist.Modifythepreviousquerytofindthefivemostcommontrigrams(three-wordcombinations),andthenrunthatqueryinHive.

15. Amongthepatternsyouseeintheresultisthephrase“tentimesmore.”Thismightberelatedtothecomplaintsthattheproductistooexpensive.Nowthatyou’veidentifiedaspecificphrase,lookatafewcommentsthatcontainitbyrunningthisquery:

SELECT message

FROM ratings

WHERE prod_id = 1274673

AND message LIKE '%ten times more%'

LIMIT 3;

Youshouldseethreecommentsthatsay,“Whydoestheredonecosttentimesmorethantheothers?”

Page 34: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

34

16. Wecaninferthatcustomersarecomplainingaboutthepriceofthisitem,butthecommentalonedoesn’tprovideenoughdetail.Oneofthewords(“red”)inthatcommentwasalsofoundinthelistoftrigramsfromtheearlierquery.Writeandexecuteaquerythatwillfindalldistinctcommentscontainingtheword“red”thatareassociatedwithproductID1274673.

17. Thepreviousstepshouldhavedisplayedtwocomments:

18. “Whatissospecialaboutred?”

19. “Whydoestheredonecosttentimesmorethantheothers?”

Thesecondcommentimpliesthatthisproductisoverpricedrelativetosimilarproducts.WriteandrunaquerythatwilldisplaytherecordforproductID1274673intheproductstable.

20. Yourqueryshouldhaveshownthattheproductwasa“16GBUSBFlashDrive(Red)”fromthe“Orion”brand.Next,runthisquerytoidentifysimilarproducts:

SELECT *

FROM products

WHERE name LIKE '%16 GB USB Flash Drive%'

AND brand='Orion';

Thequeryresultsshowthatwehavethreealmostidenticalproducts,buttheproductwiththenegativereviews(theredone)costsabouttentimesasmuchastheothers,justassomeofthecommentssaid.

Basedonthecostandpricecolumns,itappearsthatdoingtextprocessingontheproductratingshashelpedusuncoverapricingerror.

This is the end of the Exercise

Page 35: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

35

Hands-On Exercise: Data Transformation with Hive

Exercisedirectory:$ADIR/exercises/transform

InthisexerciseyouwillexplorethedatafromDualcore’sWebserverthatyouloaded

inthe“AnalyzingTextandComplexData”exercise.Queriesonthatdatawillreveal

thatmanycustomersabandontheirshoppingcartsbeforecompletingthecheckout

process.Youwillcreateseveraladditionaltables,usingdatafromaTRANSFORMscript

andasuppliedUDF,whichyouwilluselatertoanalyzehowDualcorecouldturnthis

problemintoanopportunity.

IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Analyze Customer Checkouts

AsonmanyWebsites,Dualcore’scustomersaddproductstotheirshoppingcartsandthenfollowa“checkout”processtocompletetheirpurchase.Wewanttofigureoutifcustomerswhostartthecheckoutprocessarecompletingit.Sinceeachpartofthefour-stepcheckoutprocesscanbeidentifiedbyitsURLinthelogs,wecanusearegularexpressiontoidentifythem:

Step RequestURL Description1 /cart/checkout/step1-viewcart Viewlistofitemsaddedtocart2 /cart/checkout/step2-shippingcost Notifycustomerofshippingcost3 /cart/checkout/step3-payment Gatherpaymentinformation4 /cart/checkout/step4-receipt Showreceiptforcompletedorder

Page 36: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

36

Note:Becausetheweb_logstableusesaRegexSerDes,whichisafeaturenotsupportedbyImpala,thisstepmustbecompletedinHive.YoumayuseeithertheBeelineshellortheHiveQueryEditorinHue.

1. RunthefollowingqueryinHivetoshowthenumberofrequestsforeachstepofthecheckoutprocess:

SELECT COUNT(*), request

FROM web_logs

WHERE request REGEXP '/cart/checkout/step\\d.+'

GROUP BY request;

Theresultsofthisqueryhighlightamajorproblem.Aboutoneoutofeverythreecustomersabandonstheircartafterthesecondstep.Thismightmeanmillionsofdollarsinlostrevenue,solet’sseeifwecandeterminethecause.

2. Thelogfile’scookiefieldstoresavaluethatuniquelyidentifieseachusersession.Sincenotallsessionsinvolvecheckoutsatall,createanewtablecontainingthesessionIDandnumberofcheckoutstepscompletedforjustthosesessionsthatdo:

CREATE TABLE checkout_sessions AS

SELECT cookie, ip_address, COUNT(request) AS steps_completed

FROM web_logs

WHERE request REGEXP '/cart/checkout/step\\d.+'

GROUP BY cookie, ip_address;

3. Runthisquerytoshowthenumberofpeoplewhoabandonedtheircartaftereachstep:

SELECT steps_completed, COUNT(cookie) AS num

FROM checkout_sessions

GROUP BY steps_completed;

Youshouldseethatmostcustomerswhoabandonedtheirorderdidsoafterthesecondstep,whichiswhentheyfirstlearnhowmuchitwillcosttoshiptheirorder.

Page 37: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

37

4. Optional:Becausethenewcheckout_sessionstabledoesnotuseaSerDes,itcanbequeriedinImpala.TryrunningthesamequeryasinthepreviousstepinImpala.Whathappens?

Step #2: Use TRANSFORM for IP Geolocation

Basedonwhatyou'vejustseen,itseemslikelythatcustomersabandontheircartsduetohighshippingcosts.Theshippingcostisbasedonthecustomer'slocationandtheweightoftheitemsthey'veordered.Althoughthisinformationisn'tinthedatabase(sincetheorderwasn'tcompleted),wecangatherenoughdatafromthelogstoestimatethem.

Wedon'thavethecustomer'saddress,butwecanuseaprocessknownas"IPgeolocation"tomapthecomputer'sIPaddressinthelogfiletoanapproximatephysicallocation.Sincethisisn'tabuilt-incapabilityofHive,you'lluseaprovidedPythonscripttoTRANSFORMtheip_addressfieldfromthecheckout_sessionstabletoaZIPcode,aspartofHiveQLstatementthatcreatesanewtablecalledcart_zipcodes.

Regarding TRANSFORM and UDF Examples in this Exercise

During this exercise, you will use a Python script for IP geolocation and a UDF to

calculate shipping costs. Both are implemented merely as a simulation – compatible with

the fictitious data we use in class and intended to work even when Internet access is

unavailable. The focus of these exercises is on how to use external scripts and UDFs,

rather than how the code for the examples works internally.

5. Examinethecreate_cart_zipcodes.hqlscriptandobservethefollowing:

a. Itcreatesanewtablecalledcart_zipcodesbasedonselectstatement.

b. Thatselectstatementtransformstheip_address,cookie,andsteps_completedfieldsfromthecheckout_sessionstableusingaPythonscript.

c. ThenewtablecontainstheZIPcodeinsteadofanIPaddress,plustheothertwofieldsfromtheoriginaltable.

6. Examinetheipgeolocator.pyscriptandobservethefollowing:

Page 38: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

38

a. RecordsarereadfromHiveonstandardinput.

b. Thescriptsplitsthemintoindividualfieldsusingatabdelimiter.

c. Theip_addrfieldisconvertedtozipcode,butthecookieandsteps_completedfieldsarepassedthroughunmodified.

d. Thethreefieldsineachoutputrecordaredelimitedwithtabsareprintedtostandardoutput.

7. CopythePythonfiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:

$ hdfs dfs -put $ADIR/exercises/transform/ipgeolocator.py \

/dualcore/

8. Runthescripttocreatethecart_zipcodestable.YoucaneitherpastethecodeintotheHiveQueryEditor,oruseBeelineinaterminalwindow:

$ beeline -u jdbc:hive2://localhost:10000 \

-f $ADIR/exercises/transform/create_cart_zipcodes.hql

Step #3: Extract List of Products Added to Each Cart

Asdescribedearlier,estimatingtheshippingcostalsorequiresalistofitemsinthecustomer’scart.YoucanidentifyproductsaddedtothecartsincetherequestURLlookslikethis(onlytheproductIDchangesfromonerecordtothenext):/cart/additem?productid=1234567

9. WriteaHiveQLstatementtocreateatablecalledcart_itemswithtwofields:cookieandprod_idbasedondataselectedtheweb_logstable.Keepthefollowinginmindwhenwritingyourstatement:

a. Theprod_idfieldshouldcontainonlytheseven-digitproductID(hint:usetheREGEXP_EXTRACTfunction)

Page 39: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

39

b. UseaWHEREclausewithREGEXPusingthesameregularexpressionasabove,sothatyouonlyincluderecordswherecustomersareaddingitemstothecart.

c. Ifyouneedahintonhowtowritethestatement,lookatthecreate_cart_items.hqlfileintheexercise’ssample_solutiondirectory.

10. Verifythecontentsofthenewtablebyrunningthisquery:

SELECT COUNT(DISTINCT cookie) FROM cart_items

WHERE prod_id=1273905;

Ifthisdoesn’treturn47,thencompareyourstatementtothecreate_cart_items.hqlfile,makethenecessarycorrections,andthenre-runyourstatement(afterdroppingthecart_itemstable).

Step #4: Create Tables to Join Web Logs with Product Data

YounowhavetablesrepresentingtheZIPcodesandproductsassociatedwithcheckoutsessions,butyou'llneedtojointhesewiththeproductstabletogettheweightoftheseitemsbeforeyoucanestimateshippingcosts.Inordertodosomemoreanalysislater,we’llalsoincludetotalsellingpriceandtotalwholesalecostinadditiontothetotalshippingweightforallitemsinthecart.

11. RunthefollowingHiveQLtocreateatablecalledcart_orderswiththeinformation:

Page 40: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

40

CREATE TABLE cart_orders AS

SELECT z.cookie, steps_completed, zipcode,

SUM(shipping_wt) AS total_weight,

SUM(price) AS total_price,

SUM(cost) AS total_cost

FROM cart_zipcodes z

JOIN cart_items i

ON (z.cookie = i.cookie)

JOIN products p

ON (i.prod_id = p.prod_id)

GROUP BY z.cookie, zipcode, steps_completed;

Step #5: Create a Table Using a UDF to Estimate Shipping Cost

Wefinallyhavealltheinformationweneedtoestimatetheshippingcostforeachabandonedorder.Oneofthedevelopersonourteamhasalreadywritten,compiled,andpackagedaHiveUDFthatwillcalculatetheshippingcostgivenaZIPcodeandthetotalweightofallitemsintheorder.

12. BeforeyoucanuseaUDF,youmustmakeitavailabletoHive.First,copythefiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:

Page 41: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

41

$ hdfs dfs -put \

$ADIR/exercises/transform/geolocation_udf.jar \

/dualcore/

13. Next,registerthefunctionwithHiveandprovidethenameoftheUDFclassaswellasthealiasyouwanttouseforthefunction.RuntheHivecommandbelowtoassociatetheUDFwiththealiasCALC_SHIPPING_COST:

CREATE TEMPORARY FUNCTION CALC_SHIPPING_COST

AS'com.cloudera.hive.udf.UDFCalcShippingCost'

USING JAR 'hdfs:/dualcore/geolocation_udf.jar';

14. Nowcreateanewtablecalledcart_shippingthatwillcontainthesessionID,numberofstepscompleted,totalretailprice,totalwholesalecost,andtheestimatedshippingcostforeachorderbasedondatafromthecart_orderstable:

CREATE TABLE cart_shipping AS

SELECT cookie, steps_completed, total_price, total_cost,

CALC_SHIPPING_COST(zipcode, total_weight) AS shipping_cost

FROM cart_orders;

15. Finally,verifyyourtablebyrunningthefollowingquerytocheckarecord:

SELECT * FROM cart_shipping WHERE cookie='100002920697';

Thisshouldshowthatsessionashavingtwocompletedsteps,atotalretailpriceof$263.77,atotalwholesalecostof$236.98,andashippingcostof$9.09.

Note:Thetotal_price,total_cost,andshipping_costcolumnsinthecart_shippingtablecontainthenumberofcentsasintegers.Besuretodivideresultscontainingmonetaryamountsby100togetdollarsandcents.

Page 42: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

42

YoumaynowshutdowntheDataAnalystVMandlaunchtheSparkVM.Runthefollowingscript:

$ ~/scripts/sparkdev/training_setup_sparkdev.sh

This is the end of the Exercise

Page 43: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

43

Hands-On Exercise: View the Spark Documentation

Inthisexercise,youwillfamiliarizeyourselfwiththeSparkdocumentation.

You must now shut down the Data Analyst VM and launch the Spark VM, if

you have not already done so. IMPORTANT:Inordertoprepareforthisexercise,youmustrunthefollowingcommandbeforecontinuing:

$ ~/scripts/sparkdev/training_setup_sparkdev.sh

1. StartFirefoxinyourVirtualMachineandvisittheSparkdocumentationonyourlocalmachine,usingtheprovidedbookmarkoropeningtheURLfile:/usr/lib/spark/docs/_site/index.html

2. FromtheProgrammingGuidesmenu,selecttheSparkProgrammingGuide.Brieflyreviewtheguide.Youmaywishtobookmarkthepageforlaterreview.

3. FromtheAPIDocsmenu,selecteitherScalaorPython,dependingonyourlanguagepreference.BookmarktheAPIpageforuseduringclass.Laterexerciseswillreferyoutothisdocumentation.

This is the end of the Exercise

Page 44: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

44

Hands-On Exercise: Use the Spark Shell

Inthisexercise,youwillstarttheSparkShellandviewtheSparkContextobject.

YoumaychoosetodothisexerciseusingeitherScalaorPython.FollowtheinstructionsbelowforPython,orskiptothenextsectionforScala.

MostofthelaterexercisesassumeyouareusingPython,butScalasolutionsareprovidedonyourvirtualmachine,soyoushouldfeelfreetouseScalaifyouprefer.

Using the Python Spark Shell

1. Inaterminalwindow,startthepysparkshell:

$ pyspark

YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseetheIn[n]>promptafterafewseconds,hitReturnafewtimestoclearthescreenoutput.

Note: Your environment is set up to use IPython shell by default. If you would prefer to

use the regular Python shell, set IPYTHON=0 before starting pyspark.

4. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:

pyspark> sc

Pysparkwilldisplayinformationaboutthescobjectsuchas:

<pyspark.context.SparkContext at 0x2724490>

5. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.

6. YoucanexittheshellbyhittingCtrl-Dorbytypingexit.

Page 45: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

45

Using the Scala Spark Shell

7. Inaterminalwindow,starttheScalaSparkShell:

$ spark-shell

YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseethescala>promptafterafewseconds,hitEnterafewtimestoclearthescreenoutput.

8. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:

scala> sc

Scalawilldisplayinformationaboutthescobjectsuchas:

res0: org.apache.spark.SparkContext =

org.apache.spark.SparkContext@2f0301fa

9. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.

10. YoucanexittheshellbyhittingCtrl-Dortypingexit.

This is the end of the Exercise

Page 46: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

46

Hands-On Exercise: Use RDDs to Transform a Dataset

Files and Data Used in This Exercise:

Data files (local):

~/training_materials/sparkdev/data/frostroad.txt

~/training_materials/sparkdev/data/weblogs/2013-09-15.log

Solutions:

~/training_materials/sparkdev/solutions/LogIPs.pyspark

~/training_materials/sparkdev/solutions/LogIPs.scalaspark

InthisexerciseyouwillpracticeusingRDDsintheSparkShell.

Youwillstartbyreadingasimpletextfile.ThenyouwilluseSparktoexploreandtransformtheApachewebserveroutputlogsofthecustomerservicesiteofafictionalmobilephoneserviceprovidercalledLoudacre.

Loading and Viewing a Text File

1. Reviewthesimpletextfilewewillbeusingbyviewing(withoutediting)thefileinatexteditor.Thefileislocatedat:~/training_materials/sparkdev/data/frostroad.txt.

2. StarttheSparkShellifyouexiteditfromthepreviousexercise.YoumayuseeitherScala(spark-shell)orPython(pyspark).

3. DefineanRDDtobecreatedbyreadinginasimpletestfile.ForPython,enter:

pyspark> mydata = sc.textFile(\

"file:/home/training/training_materials/sparkdev/\

data/frostroad.txt")

OrforScala,enter:

Page 47: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

47

scala> val mydata = sc.textFile(

"file:/home/training/training_materials/sparkdev/data/frostro

ad.txt")

• Note:FortheremainderoftheHands-OnExercises,notethecolorcodingandpromptinexercisetextsnippetstofollowtheinstructionsforwhicheverlanguageyouareusing.)

4. NotethatSparkhasnotyetreadthefile.ItwillnotdosountilyouperformanoperationontheRDD.Trycountingthenumberoflinesinthedataset:

pyspark> mydata.count()

scala> mydata.count()

ThecountoperationcausestheRDDtobematerialized(createdandpopulated),afterwhichtheresult(23)isdisplayed.TheexamplebelowshowstheoutputforPyspark(Scalaproducesthesameresult,buttheoutputformatwilldifferslightly):

Out[2]: 23

5. TryexecutingthecollectoperationtodisplaythedataintheRDD.Notethatthisreturnsanddisplaystheentiredataset.ThisisconvenientforverysmallRDDslikethisone,butbecarefulusingcollectforlargedatasets,whicharecommonwhenusingSpark.

pyspark> mydata.collect()

scala> mydata.collect()

6. Usingcommandcompletion,youcanseealltheavailabletransformationsandoperationsyoucanperformonanRDD.Typemydata.andthenthe[TAB]key.

Page 48: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

48

Exploring the Loudacre Web Log Files

Inthisexercise,youwillbeusingdatain~/training_materials/sparkdev/data/weblogs.Initiallyyouwillworkwiththelogfilefromasingleday.Lateryouwillworkwiththefulldatasetconsistingofmanydaysworthoflogs.

7. Reviewoneofthe.logfilesinthedirectory.Notetheformatofthelines:

8. Inthepreviousexampleyouusedalocaldatafile.Intherealworld,youwillalmostalwaysbeworkingwithdataontheHDFSclusterinstead.CreateanHDFSdirectoryforthecourse,andthencopythedatasettoyourHDFShomedirectory.Inaseparateterminalwindow(notyourSparkshell)execute:

$ hdfs dfs -mkdir /loudacre

$ hdfs dfs -put \

~/training_materials/sparkdev/data/weblogs/ \

/loudacre/

9. InyourSparkShell,setavariableforthedatafilesoyoudonothavetoretypeiteachtime.

pyspark> logfile="/loudacre/weblogs/2013-09-15.log"

scala> val logfile="/loudacre/weblogs/2013-09-15.log"

10. CreateanRDDfromthedatafile.

116.180.70.237 - 128 [15/Sep/2013:23:59:53 +0100]

"GET /KBDOC-00031.html HTTP/1.0" 200 1388

"http://www.loudacre.com" "Loudacre CSR Browser"

IP Address User ID Request

Page 49: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

49

pyspark> logs = sc.textFile(logfile)

scala> val logs = sc.textFile(logfile)

11. CreateanRDDcontainingonlythoselinesthatarerequestsforJPGfiles.

pyspark> jpglogs=\

logs.filter(lambda x: ".jpg" in x)

scala> val jpglogs = logs.

filter(line => line.contains(".jpg"))

12. Viewthefirst10linesofthedatausingtake:

pyspark> jpglogs.take(10)

scala> jpglogs.take(10)

13. Sometimesyoudonotneedtostoreintermediatedatainavariable,inwhichcaseyoucancombinethestepsintoasinglelineofcode.Forinstance,ifallyouneedistocountthenumberofJPGrequests,youcanexecutethisinasinglecommand:

pyspark> sc.textFile(logfile).filter(lambda x: \

".jpg" in x).count()

scala> sc.textFile(logfile).filter(line =>

line.contains(".jpg")).count()

14. NowtryusingthemapfunctiontodefineanewRDD.Startwithaverysimplemapthatreturnsthelengthofeachlineinthelogfile.

Page 50: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

50

pyspark> logs.map(lambda s: len(s)).take(5)

scala> logs.map(line => line.length).take(5)

Thisprintsoutanarrayoffiveintegerscorrespondingtothefirstfivelinesinthefile.

15. That’snotveryuseful.Instead,trymappingtoanarrayofwordsforeachline:

pyspark> logs.map(lambda s: s.split()).take(5)

scala> logs.map(line => line.split(' ')).take(5)

Thistimeitprintsoutfivearrays,eachcontainingthewordsinthecorrespondinglogfileline.

16. Nowthatyouknowhowmapworks,defineanewRDDcontainingjusttheIPaddressesfromeachlineinthelogfile.(TheIPaddressisthefirstfieldineachline).

pyspark> ips = logs.map(lambda s: s.split()[0])

pyspark> ips.take(5)

scala> val ips = logs.map(line => line.split(' ')(0))

scala> ips.take(5)

17. AlthoughtakeandcollectareusefulwaystolookatdatainanRDD,theiroutputissometimesnotveryreadable.Fortunately,though,theyreturnarrays,whichyoucaniteratethrough:

pyspark> for x in ips.take(5): print x

scala> ips.take(5).foreach(println)

Page 51: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

51

18. Finally,savethelistofIPaddressesasatextfile:

pyspark> ips.saveAsTextFile("/loudacre/iplist")

scala> ips.saveAsTextFile("/loudacre/iplist")

19. Inaterminalwindow,listthecontentsoftheHDFSiplistdirectory(inyourHDFShomedirectory):

$ hdfs dfs -ls /loudacre/iplist

20. Youshouldseemultiplefiles.Theoneyoucareaboutispart-00000,whichshouldcontainthelistofIPaddresses.“Part”(partition)filesarenumberedbecausetheremayberesultsfrommultipletasksrunningonthecluster;youwilllearnmoreaboutthislater.

If You Have More Time

Ifyouhavemoretime,attemptthefollowingchallenges:

21. Challenge1:Asyoudidinthepreviousstep,savealistofIPaddresses,butthistime,usethewholeweblogdataset(weblogs/*)insteadofasingleday’slog.

• Tip:Youcanusetheup-arrowtoeditandexecutepreviouscommands.Youshouldonlyneedtomodifythelinesthatreadandsavethefiles.

22. Challenge2:UseRDDtransformationstocreateadatasetconsistingoftheIPaddressandcorrespondinguserIDforeachrequestforanHTMLfile.(Disregardrequestsforotherfiletypes).TheuserIDisthethirdfieldineachlogfileline.

Displaythedataintheformipaddress/userid,suchas:

Page 52: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

52

165.32.101.206/8

100.219.90.44/102

182.4.148.56/173

246.241.6.175/45395

175.223.172.207/4115

This is the end of the Exercise

Page 53: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

53

Hands-On Exercise: Process Data Files with Spark

Files and Data Used in This Exercise:

Data files (local):

~/training_materials/sparkdev/data/activations/*

~/training_materials/sparkdev/data/devicestatus.txt (Bonus)

Stubs: stubs/ActivationModels.pyspark

stubs/ActivationModels.scalaspark

Solutions: solutions/ActivationModels.pyspark

solutions/ActivationModels.scalaspark

solutions/DeviceStatusETL.pyspark (Bonus)

solutions/DeviceStatusETL.scalaspark (Bonus)

InthisexerciseyouwillparseasetofactivationrecordsinXMLformattoextracttheaccountnumbersandmodelnames.

OneofthecommonusesforSparkisdoingdataExtract/Transform/Loadoperations.Sometimesdataisstoredinline-orientedrecords,liketheweblogsinthepreviousexercise,butsometimesthedataisinamulti-lineformatthatmustbeprocessedasawholefile.Inthisexerciseyouwillpracticeworkingwithfile-basedinsteadofline-basedformats.

Reviewing the API Documentation for RDD Operations

VisittheSparkAPIpageyoubookmarkedpreviously.FollowthelinkatthetopfortheRDDclassandreviewthelistofavailablemethods.

The Data

1. Reviewthedatainactivations(inthecoursedatadirectory).EachXMLfilecontainsdataforallthedevicesactivatedbycustomersduringaspecificmonth.

Page 54: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

54

Sampleinputdata:

<activations>

<activation timestamp="1225499258" type="phone">

<account-number>316</account-number>

<device-id>

d61b6971-33e1-42f0-bb15-aa2ae3cd8680

</device-id>

<phone-number>5108307062</phone-number>

<model>iFruit 1</model>

</activation>

</activations>

2. CopythisdatatoHDFS:

$ hdfs dfs -put \

~/training_materials/sparkdev/data/activations \

/loudacre/

The Task

YourcodeshouldgothroughasetofactivationXMLfilesandextracttheaccountnumberanddevicemodelforeachactivation,andsavethelisttoafileasaccount_number:model.

Theoutputwilllooksomethinglike:

Page 55: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

55

1234:iFruit 1

987:Sorrento F00L

4566:iFruit 1

3. StartwiththeActivationModelsstubscript.(AstubisprovidedforScalaandPython;usewhicheverlanguageyouprefer.)NotethatforconvenienceyouhavebeenprovidedwithfunctionstoparsetheXML,asthatisnotthefocusofthisexercise.CopythestubcodeintotheSparkShell.

4. UsewholeTextFilestocreateanRDDfromtheactivationsdataset.TheresultingRDDwillconsistoftuples,inwhichthefirstvalueisthenameofthefile,andthesecondvalueisthecontentsofthefile(XML)asastring.

5. EachXMLfilecancontainmanyactivationrecords;useflatMaptomapthecontentsofeachfiletoacollectionofXMLrecordsbycallingtheprovidedgetactivationsfunction.getactivationstakesanXMLstring,parsesit,andreturnsacollectionofXMLrecords;flatMapmapseachrecordtoaseparateRDDelement.

6. Mapeachactivationrecordtoastringintheformataccount-number:model.Usetheprovidedgetaccountandgetmodelfunctionstofindthevaluesfromtheactivationrecord.

7. Savetheformattedstringstoatextfileinthedirectory/loudacre/account-models.

Bonus Exercise

AnothercommonpartoftheETLprocessisdatascrubbing.Inthisbonusexercise,youwillprocessdatainordertogetitintoastandardizedformatforlaterprocessing.

Reviewthecontentsofthedatafiledevicestatus.txt.ThisfilecontainsdatacollectedfrommobiledevicesonLoudacre’snetwork,includingdeviceID,currentstatus,locationandsoon.BecauseLoudacrepreviouslyacquiredothermobileprovider’snetworks,thedatafromdifferentsubnetworkshasadifferentformat.Notethattherecordsinthisfile

Page 56: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

56

havedifferentfielddelimiters:someusecommas,someusepipes(|)andsoon.Yourtasksareto:

• Loadthedataset

• Determinewhichdelimitertouse(hint:thecharacteratposition19isthefirstuseofthedelimiter)

• Filteroutanyrecordswhichdonotparsecorrectly(hint:eachrecordshouldhaveexactly14values)

• Extractthedate(firstfield),model(secondfield),deviceID(thirdfield),andlatitudeandlongitude(13thand14thfieldsrespectively)

• Thesecondfieldcontainsthedevicemanufacturerandmodelname(e.g.Ronin S2.)Splitthisfieldbyspacestoseparatethemanufacturerfromthemodel(e.g.,manufacturerRonin,modelS2.)

• Savetheextracteddatatocommadelimitedtextfilesinthe/loudacre/devicestatus_etldirectoryonHDFS.

• Confirmthatthedatainthefile(s)wassavedcorrectly.

This is the end of the Exercise

Page 57: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

57

Hands-On Exercise: Use Pair RDDs to Join Two Datasets

Files Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

Data files (local):

~/training_materials/sparkdev/data/accounts.csv

Solution: solutions/UserRequests.pyspark

solutions/UserRequests.scalaspark

InthisexerciseyouwillcontinueexploringtheLoudacrewebserverlogfiles,aswell

astheLoudacreuseraccountdata,usingkey-valuePairRDDs.

Exploring Web Log Files

Continueworkingwiththeweblogfiles,asinthepreviousexercise.

Tip:Inthisexerciseyouwillbereducingandjoininglargedatasets,whichcantakealotoftime.Youmaywishtoperformtheexercisesbelowusingasmallerdataset,consistingofonlyafewoftheweblogfiles,ratherthanallofthem.Rememberthatyoucanspecifyawildcard;textFile("/loudacre/weblogs/*6.log")wouldincludeonlyfilenamesendingwiththedigit6andhavingalogfileextension.

1. Usingmapandreduce,countthenumberofrequestsfromeachuser.

a. UsemaptocreateaPairRDDwiththeuserIDasthekey,andtheinteger1asthevalue.(TheuserIDisthethirdfieldineachline.)Yourdatawilllooksomethinglikethis:

Page 58: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

58

b. UsereducetosumthevaluesforeachuserID.YourRDDdatawillbesimilarto:

2. UsecountByKeytodeterminehowmanyusersvisitedthesiteforeachfrequency.Thatis,howmanyusersvisitedonce,twice,threetimesandsoon.

a. Usemaptoreversethekeyandvalue,likethis:

b. UsethecountByKeyactiontoreturnaMapoffrequency:user-countpairs.

3. CreateanRDDwheretheuseridisthekey,andthevalueisthelistofalltheIPaddressesthatuserhasconnectedfrom.(IPaddressisthefirstfieldineachrequestline.)

• Hint:Mapto(userid,ipaddress)andthenusegroupByKey.

(userid,1) (userid,1) (userid,1) …

(userid,5) (userid,7) (userid,2) …

(5,userid) (7,userid) (2,userid) …

(userid,20.1.34.55) (userid,245.33.1.1) (userid,65.50.196.141) …

Page 59: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

59

Joining Web Log Data with Account Data

4. Copythefileaccounts.csvdatafiletoHDFS:

$ hdfs dfs -put \

~/training_materials/sparkdev/data/accounts.csv \

/loudacre/

ThisdatasetconsistsofinformationaboutLoudacre’suseraccounts.ThefirstfieldineachlineistheuserID,whichcorrespondstotheuserIDinthewebserverlogs.Theotherfieldsincludeaccountdetailssuchascreationdate,firstandlastnameandsoon.

5. JointheaccountsdatawiththeweblogdatatoproduceadatasetkeyedbyuserIDwhichcontainstheuseraccountinformationandthenumberofwebsitehitsforthatuser.

a. Maptheaccountsdatatokey/value-listpairs:(userid,[values…])

b. JointhePairRDDwiththesetofuserid/hitcountscalculatedinthefirststep.

(userid,[20.1.34.55, 74.125.239.98]) (userid,[75.175.32.10, 245.33.1.1, 66.79.233.99]) (userid,[65.50.196.141]) …

(userid1,[userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…]) (userid2,[userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…]) (userid3,[userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…]) …

Page 60: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

60

c. DisplaytheuserID,hitcount,andfirstname(3rdvalue)andlastname(4thvalue)forthefirst5elements,e.g.:

userid1 4 Cheryl West

userid2 8 Elizabeth Kerns

userid3 1 Melissa Roman

Bonus Exercises

Ifyouhavemoretime,attemptthefollowingchallenges:

6. Challenge1:UsekeyBytocreateanRDDofaccountdatawiththepostalcode(9thfieldintheCSVfile)asthekey.

• Tip:AssignthisnewRDDtoavariableforuseinthenextchallenge

7. Challenge2:CreateapairRDDwithpostalcodeasthekeyandalistofnames(LastName,FirstName)inthatpostalcodeasthevalue.

• Hint:Firstnameandlastnamearethe4thand5thfieldsrespectively

• Optional:TryusingthemapValuesoperation

8. Challenge3:Sortthedatabypostalcode,thenforthefirstfivepostalcodes,displaythecodeandlistthenamesinthatpostalzone,suchas:

(userid1,([userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…],4)) (userid2,([userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…],8)) (userid3,([userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…],1)) …

Page 61: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

61

--- 85003

Jenkins,Thad

Rick,Edward

Lindsay,Ivy

--- 85004

Morris,Eric

Reiser,Hazel

Gregg,Alicia

Preston,Elizabeth

This is the end of the Exercise

Page 62: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

62

Hands-On Exercise: Write and Run a Spark Application

Files and Directories Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

Scala Project Directory: projects/countjpgs

Scala Classes: stubs.CountJPGs

solution.CountJPGs

Python Stub: stubs/CountJPGs.py

Python Solution: solutions/CountJPGs.py

Inthisexercise,youwillwriteyourownSparkapplicationinsteadofusingthe

interactiveSparkShellapplication.

WriteasimpleprogramthatcountsthenumberofJPGrequestsinaweblogfile.Thenameofthefileshouldbepassedintotheprogramasanargument.

Thisisthesametaskyoudidearlierinthe“GettingStartedWithRDDs”exercise.Thelogicisthesame,butthistimeyouwillneedtosetuptheSparkContextobjectyourself.

Dependingonwhichprogramminglanguageyouareusing,followtheappropriatesetofinstructionsbelowtowriteaSparkprogram.

Beforerunningyourprogram,besuretoexitfromtheSparkShell.

Page 63: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

63

Writing a Spark Application in Python

You may use any text editor you wish. If you don’t have an editor preference, you may

wish to use gedit, which includes language-specific support for Python.

1. Asimplestubfiletogetstartedhasbeenprovided:~/training_materials/sparkdev/stubs/countjpgs.py.ThisstubimportstherequiredSparkclassandsetsupyourmaincodeblock.Copythisstubtoyourworkareaandeditittocompletethisexercise.

2. SetupaSparkContextusingthefollowingcode:

sc = SparkContext()

3. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.

4. Attheendoftheprogram,besuretocall:

sc.stop()

5. Runtheprogramlocally,passingthenameofthelogfiletoprocess,suchas:

$ spark-submit CountJPGs.py /loudacre/weblogs/*

6. SkiptheScalaexercisesbelow,andskiptothesection“StarttheSparkStandaloneCluster.”

Writing a Spark Application in Scala

You may use any text editor you wish. If you don’t have an editor preference, you may

wish to use gedit, which includes language-specific support for Scala. If you are familiar

with the Idea IntelliJ IDE, you may choose to use that; the provided project directories

include IntelliJ configuration.

Page 64: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

64

AMavenprojecttogetstartedhasbeenprovidedintheprojects/countjpgsdirectory.

7. EdittheScalacodeinsrc/main/scala/stubs/CountJPGs.scala.

8. SetupaSparkContextusingthefollowingcode:

val sc = new SparkContext()

9. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.

10. Attheendoftheprogram,besuretocall:

sc.stop

11. Fromthecountjpgsworkingdirectory,buildyourprojectusingthefollowingcommand:

$ mvn package

12. Ifthebuildissuccessful,itwillgenerateaJARfilecalledcountjpgs-1.0.jarincountjpgs/target.Runtheprogramusingthefollowingcommand:

$ spark-submit \

--class stubs.CountJPGs \

target/countjpgs-1.0.jar /loudacre/weblogs/*

• Note:Use--class solution.CountJPGstorunthesolutioninstead.

Starting the Spark Standalone Cluster

13. Inaterminalwindow,starttheSparkMasterandSparkWorkerdaemons:

Page 65: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

65

$ sudo service spark-master start

$ sudo service spark-worker start

Note:Youcanstoptheservicesbyreplacingstartwithstop,orforcetheservicetorestartbyusingrestart.YoumayneedtodothisifyoususpendandrestarttheVM.

14. ViewtheSparkStandaloneClusterUI:StartFirefoxonyourVMandvisittheSparkMasterUIbyusingtheprovidedbookmarkorvisitinghttp://localhost:18080/.

15. YoushouldnotseeanyapplicationsintheRunningApplicationsorCompletedApplicationsareasbecauseyouhavenotrunanyapplicationsontheclusteryet.

16. Areal-worldSparkclusterwouldhaveseveralworkersconfigured.Inthisclasswehavejustone,runninglocally,whichisnamedbythedateitstarted,thehostitisrunningon,andtheportitislisteningon.Forexample:

17. ClickontheworkerIDlinktoviewtheSparkWorkerUIandnotethattherearenoexecutorscurrentlyrunningonthenode.

18. Intheprevioussection,youranyourapplicationlocally,becauseyoudidnotspecifyamasterwhenstartingit.Re-runtheprogram,specifyingtheclustermasterinordertorunitonthecluster.

ForPython:

$ spark-submit --master spark://localhost:7077 \

CountJPGs.py /loudacre/weblogs/*

ForScala:

Page 66: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

66

$ spark-submit \

--class stubs.CountJPGs \

--master spark://localhost:7077 \

target/countjpgs-1.0.jar /loudacre/weblogs/*

19. VisittheStandaloneSparkMasterUIandconfirmthattheprogramisrunningonthecluster.

This is the end of the Exercise

Page 67: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

67

Hands-On Exercise: Configure a Spark Application

Files Used in This Exercise:

Data files (HDFS): /loudacre/weblogs

Properties files (local): spark.conf

log4j.properties

Inthisexercise,youwillpracticesettingvariousSparkconfigurationoptions.

YouwillworkwiththeCountJPGsprogramyouwroteinthepriorexercise.

Setting Configuration Options at the Command Line

1. Re-runtheCountJPGsPythonorScalaprogramyouwroteinthepreviousexercise,thistimespecifyinganapplicationname.Forexample:

$ spark-submit --master spark://localhost:7077 \

--name 'Count JPGs' \

CountJPGs.py /loudacre/weblogs/*

$ spark-submit \

--class stubs.CountJPGs \

--master spark://localhost:7077 \

--name 'Count JPGs' \

target/countjpgs-1.0.jar /loudacre/weblogs/*

2. VisittheStandaloneSparkMasterUI(http://localhost:18080/)andnotetheapplicationnamelistedistheonespecifiedinthecommandline.

Page 68: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

68

3. Optional:Whiletheapplicationisrunning,visittheSparkApplicationUIandviewtheEnvironmenttab.Takenoteofthespark.*propertiessuchasmaster,appName,anddriverproperties.

Setting Configuration Options in a Configuration File

4. Changedirectoriestoyourworkingdirectory.(IfyouareworkinginScala,thatisthecountjpgsprojectdirectory.)

5. Usingatexteditor,createafileintheworkingdirectorycalledmyspark.conf,containingsettingsforthepropertiesshownbelow:

spark.app.name My Spark App

spark.master yarn-client

spark.executor.memory 400M

6. Re-runyourapplication,thistimeusingthepropertiesfileinsteadofusingthescriptoptionstoconfigureSparkproperties:

$ spark-submit --properties-file myspark.conf \

CountJPGs.py /loudacre/weblogs/*

$ spark-submit --properties-file myspark.conf \

--class stubs.CountJPGs \

target/countjpgs-1.0.jar /loudacre/weblogs/*

7. Whiletheapplicationisrunning,viewtheStandaloneSparkMasterUItoconfirmapplicationnameiscorrectlydisplayedas“MySparkApp,”suchas:

Page 69: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

69

Setting Logging Levels

8. Copythetemplatefile/etc/spark/conf/log4j.properties.templatetolog4j.propertiesinyourworkingdirectory.

9. Editlog4j.properties.Thefirstlinecurrentlyreads:

log4j.rootCategory=INFO, console

ReplaceINFOwithDEBUG:

log4j.rootCategory=DEBUG, console

10. Re-runyourSparkapplication.BecausethecurrentdirectoryisontheJavaclasspath,yourlog4.propertiesfilewillsettheloggingleveltoDEBUG.

11. NoticethattheoutputnowcontainsboththeINFOmessagesitdidbeforeandDEBUGmessages,similartowhatisshownbelow:

15/03/19 11:40:45 INFO MemoryStore: ensureFreeSpace(154293) called with

curMem=0, maxMem=311387750

15/03/19 11:40:45 INFO MemoryStore: Block broadcast_0 stored as values to

memory (estimated size 150.7 KB, free 296.8 MB)

15/03/19 11:40:45 DEBUG BlockManager: Put block broadcast_0 locally took

79 ms

15/03/19 11:40:45 DEBUG BlockManager: Put for block broadcast_0 without

replication took 79 ms

Debugloggingcanbeusefulwhendebugging,testing,oroptimizingyourcode,butinmostcasesgeneratesunnecessarilydistractingoutput.

12. Editthelog4j.propertiesfiletoreplaceDEBUGwithWARNandtryagain.ThistimenoticethatnoINFOorDEBUGmessagesaredisplayed;onlyWARNmessages.

13. YoucanalsosettheloglevelfortheSparkShellbyplacingthelog4j.propertiesfileinyourworkingdirectorybeforestartingtheshell.TrystartingtheshellfromthedirectoryinwhichyouplacedthefileandnotethatonlyWARNmessagesnowappear.

Page 70: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

70

Note:Duringtherestoftheexercises,youmaychangethesesettingsdependingonwhetheryoufindtheextraloggingmessageshelpfulordistracting.

This is the end of the Exercise

Page 71: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

71

Hands-On Exercise: View Jobs and Stages in the Spark Application UI

Files and Data Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

/loudacre/accounts.csv

Solutions: solutions/SparkStages.pyspark

solutions/SparkStages.scalaspark

InthisexerciseyouwillusetheSparkApplicationUItoviewtheexecutionstagesfor

ajob.

Inapreviousexercise,youwroteascriptintheSparkShelltojoindatafromtheaccountsdatasetwiththeweblogsdataset,inordertodeterminethetotalnumberofwebhitsforeveryaccount.Nowyouwillexplorethestagesandtasksinvolvedinthatjob.

Exploring Partitioning of File-Based RDDs

1. Start(orrestart,ifnecessary)theSparkShell.AlthoughyouwouldtypicallyrunaSparkapplicationonacluster,yourcourseVMclusterhasonlyasingleworkernodethatcansupportonlyasingleexecutor.Tosimulateamorerealisticmulti-nodecluster,runinlocalmodewithtwothreads.ForPython:

$ pyspark --master local[2]

orforScala:

$ spark-shell --master local[2]

Page 72: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

72

2. CreateanRDDbasedontheaccountsdatafile(/loudacre/accounts.csv)andthencalltoDebugStringontheRDD,whichdisplaysthenumberofpartitionsinparentheses()beforetheRDDID.HowmanypartitionsareintheresultingRDD?

pyspark> accounts=sc.textFile("/loudacre/accounts.csv")

pyspark> print accounts.toDebugString()

scala> val accounts=sc.

textFile("/loudacre/accounts.csv")

scala> accounts.toDebugString

3. Repeatthisprocess,butspecifyaminimumofthreeofpartitions:textFile(filename,3).DoestheRDDcorrectlyhavethreepartitions?

4. CreateanotherRDDbasedonalltheweblogsdatasetfiles(/loudacre/weblogs/*)andthencalltoDebugStringontheRDD.HowmanypartitionsareintheweblogsRDD?

pyspark> weblogs=sc.textFile("/loudacre/weblogs/*")

pyspark> print weblogs.toDebugString()

scala> val weblogs=sc.textFile("/loudacre/weblogs/*")

scala> weblogs.toDebugString

HowdoesthenumberoffilesinthedatasetcomparetothenumberofpartitionsintheRDD?

5. Repeatthisprocess,butspecifyonlyasubsetofthefiles:thoseforthemonthofOctoberin2013,/loudacre/weblogs/2013-10-*.log.

6. Bonus:useforeachPartitiontoprintoutthefirstrecordofeachpartition.

Page 73: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

73

Setting up the Job

7. First,createanRDDofaccounts,keyedbyIDandwithfirstname,lastnameforthevalue.

pyspark> accountsByID = accounts \

.map(lambda s: s.split(',')) \

.map(lambda values: \

(values[0],values[4] + ',' + values[3]))

scala> val accountsByID = accounts.

map(line => line.split(',')).

map(values => (values(0),values(4)+','+values(3)))

8. ConstructanRDDwiththetotalnumberofwebhitsforeachuserID:

pyspark> userreqs = weblogs \

.map(lambda line: line.split()) \

.map(lambda words: (words[2],1)) \

.reduceByKey(lambda v1,v2: v1+v2)

scala> val userreqs = weblogs.

map(line => line.split(' ')).

map(words => (words(2),1)).

reduceByKey((v1,v2) => v1 + v2)

9. ThenjointhetwoRDDsbyuserID,andconstructanewRDDbasedonfirstname,lastnameandtotalhits:

pyspark> accounthits = accountsByID.join(userreqs)\

.values()

Page 74: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

74

scala> val accounthits =

accountsByID.join(userreqs).values

10. Printtheresultsofaccounthits.toDebugStringandreviewtheoutput.Basedonthis,seeifyoucandetermine:

a. Howmanystagesareinthisjob?

b. Whichstagesaredependentonwhich?

c. Howmanytaskswilleachstageconsistof?

Runing and Reviewing the Job in the Spark Application UI

11. Inyourbrowser,visitingtheSparkApplicationUIbyusingtheprovidedtoolbarbookmark,orvisitinghttp://localhost:4040/inyourbrowser.

12. IntheSparkUI,makesuretheJobstabisselected.Nojobsareyetrunningsothelistwillbeempty.

13. Returntotheshellandstartthejobbyexecutinganaction(saveAsTextFile):

pyspark> accounthits.\

saveAsTextFile("/loudacre/userreqs")

scala> accounthits.

saveAsTextFile("/loudacre/userreqs")

14. ReloadtheSparkUIJobspageinyourbrowser.YourjobwillappearintheActiveJobslistuntilitcompletes,andthenitwilldisplayintheCompletedJobsList.

Page 75: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

75

15. Clickonthejobdescription(whichisthelastactioninthejob)toseethestages.Asthejobprogressesyoumaywanttorefreshthepageafewtimes.

Thingstonote:

a. Howmanystagesareinthejob?DoesitmatchthenumberyouexpectedfromtheRDD’stoDebugStringoutput?

b. Thestagesarenumbered,butnumbersdonotrelatetotheorderofexecution.Notethetimesthestagesweresubmittedtodeterminetheorder.DoestheordermatchwhatyouexpectedbasedonRDDdependency?

c. Howmanytasksareineachstage?Thenumberoftasksinthefirststagescorrespondstothenumberofpartitions.

d. TheShuffleReadandShuffleWritecolumnsindicatehowmuchdatawascopiedbetweentasks.Thisisusefultoknowbecausecopyingtoomuchdataacrossthenetworkcancauseperformanceissues.

16. Clickonthestagestoviewdetailsaboutthatstage.

Thingstonote:

a. TheSummaryMetricsareashowsyouhowmuchtimewasspendonvarioussteps.Thiscanhelpyounarrowdownperformanceproblems.

b. TheTasksarealistseachtask.TheLocalityLevelcolumnindicateswhethertheprocessranonthesamenodewherethepartitionwasphysicallystoredornot.RememberthatSparkwillattempttoalwaysruntaskswherethedatais,butmaynotalwaysbeableto,ifthenodeisbusy.

c. Inareal-worldcluster,theexecutorcolumnintheTaskareawoulddisplaythedifferentworkernodesthatranthetasks.Inthissingle-nodecluster,alltasksrunonthesamehost:localhost.

17. Whenthejobiscomplete,returntotheJobstabtoseethefinalstatisticsforthenumberoftasksexecutedandthetimethejobtook.

18. Optional:Tryre-runningthelastaction.YouwillneedtoeitherdeletethesaveAsTextFileoutputdirectoryinHDFS,orspecifyadifferentdirectoryname.You

Page 76: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

76

willprobablyfindthatthejobcompletesmuchfaster,andthatseveralstages(andthetasksinthem)showas“skipped.”

Bonusquestion:Whichtaskswereskippedandwhy?

LeavetheSparkShellrunningforthenextexercise.

This is the end of the Exercise

Page 77: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

77

Hands-On Exercise: Persist an RDD Files and Data Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

/loudacre/accounts.csv

Job Setup: solutions/SparkStages.pyspark

solutions/SparkStages.scalaspark

Inthisexerciseyouwillexploretheperformanceeffectofcaching(thatis,persisting

tomemory)anRDD.

1. MakesuretheSparkShellisstillrunningfromthelastexercise.Ifitisn’t,restartit(inlocalmodewithtwothreads)andpasteinthejobsetupcodefromthesolutionfileorthepreviousexercise.

2. Thistimetostartthejobyouaregoingtoperformaslightlydifferentactionthanlasttime:countthenumberofuseraccountswithatotalhitcountgreaterthanfive:

pyspark> accounthits\

.filter(lambda (firstlast,hitcount): hitcount > 5)\

.count()

scala> accounthits.filter(pair => pair._2 > 5).count()

3. Cache(persisttomemory)theRDDbycallingaccounthits.persist().

4. Inyourbrowser,viewtheSparkApplicationUIandselecttheStoragetab.Atthispoint,youhavemarkedyourRDDtobepersisted,buthavenotyetperformedanactionthatwouldcauseittobematerializedandpersisted,soyouwillnotyetseeanypersistedRDDs.

5. IntheSparkShell,executethecountagain.

Page 78: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

78

6. ViewtheRDD’stoDebugString.Noticethattheoutputindicatesthepersistencelevelselected.

7. ReloadtheStoragetabinyourbrowser,andthistimenotethattheRDDyoupersistedisshown.ClickontheRDDIDtoseedetailsaboutpartitionsandpersistence.

8. ClickontheExecutorstabandtakenoteoftheamountofmemoryusedandavailableforyouroneworkernode.

Notethattheclassroomenvironmenthasasingleworkernodewithasmallamountofmemoryallocated,soyoumayseethatnotallofthedatasetisactuallycachedinmemory.Intherealworld,forgoodperformanceaclusterwillhavemorenodes,eachwithmorememory,sothatmoreofyouractivedatacanbecached.

9. Optional:SettheRDD’spersistenceleveltoStorageLevel.DISK_ONLYandcomparethestoragereportintheSparkApplicationWebUI.Hint:BecauseyouhavealreadypersistedtheRDDatadifferentlevel,youwillneedtounpersist()firstbeforeyoucansetanewlevel.

This is the end of the Exercise

Page 79: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

79

Hands-On Exercise: Implement an Iterative Algorithm

Files and Data Used in This Exercise:

Data files (HDFS): /loudacre/devicestatus_etl/

Stubs: stubs/KMeansCoords.pyspark

stubs/KMeansCoords.scalaspark

Solutions: solutions/KMeansCoords.pyspark

solutions/KMeansCoords.scalaspark

Inthisexercise,youwillpracticeimplementingiterativealgorithmsinSparkby

calculatingk-meansforasetofpoints.

Reviewing the Data

Inthebonussectionofthe“UseRDDstoTransformaDataset”exercise,youusedSparktoextractthedate,maker,deviceID,latitudeandlongitudefromthedevicestatus.txtdatafile,andstoretheresultsintheHDFSdirectory/loudacre/devicestatus_etl.

Ifyoudidnothavetimetocompletethatbonusexercise,runthesolutionscriptnowfollowingthetwostepsbelow.

• Copy~/training_materials/sparkdev/data/devicestatus.txttothe/loudacre/directoryinHDFS.

• RuntheSparkscript~/training_materials/solutions/DeviceStatusETL(either.pysparkor.scalaparkdependingonwhichlanguageyouareusing)

Page 80: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

80

Examinethedatainthedataset.Notethatthelatitudeandlongitudearethe4thand5thfields,respectively,suchas:

2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-

28f2342679af,33.6894754264,-117.543308253

2014-03-15:10:10:20,MeeToo,ef8c7564-0a1a-4650-a655-

c8bbd5f8f943,37.4321088904,-121.485029632

Calculate k-means for Device Location

Ifyouarealreadyfamiliarwithcalculatingk-means,trydoingtheexerciseonyourown.Otherwise,followthestep-by-stepprocessbelow.

1. StartbycopyingtheprovidedKMeansCoordsstubfile,whichcontainsthefollowingconveniencefunctionsusedincalculatingk-means:

• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,returnstheindexinthearrayofthecenterclosesttothegivenpoint

• addPoints:giventwopoints,returnapointwhichisthesumofthetwopoints–thatis,(x1+x2,y1+y2)

• distanceSquared:giventwopoints,returnsthesquareddistanceofthetwo.Thisisacommoncalculationrequiredingraphanalysis.

2. SetthevariableK(thenumberofmeanstocalculate).ForthisuseK=5.

3. SetthevariableconvergeDist.Thiswillbeusedtodecidewhenthek-meanscalculationisdone–whentheamountthelocationsofthemeanschangesbetweeniterationsislessthanconvergeDist.A“perfect”solutionwouldbe0;thisnumberrepresentsa“goodenough”solution.Forthisexercise,useavalueof0.1.

4. Parsetheinputfile,whichisdelimitedbyacommacharacterwithin(latitude,longitude)pairs(the4thand5thfieldsineachline).Onlyincludeknownlocations(thatis,filterout(0,0)locations).Besuretopersist(cache)theresultingRDDbecauseyouwillaccessiteachtimethroughtheiteration.

Page 81: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

81

5. Createak-lengtharraycalledkPointsbytakingarandomsampleofklocationpointsfromtheRDDasstartingmeans(centerpoints);forexample:

data.takeSample(False, K, 42)

6. IterativelycalculateanewsetofKmeansuntilthetotaldistancebetweenthemeanscalculatedforthisiterationandthelastissmallerthanconvergeDist.Foreachiteration:

a. Foreachcoordinatepoint,usetheprovidedclosestPointfunctiontomapeachpointtotheindexinthekPointsarrayofthelocationclosesttothatpoint.TheresultingRDDshouldbekeyedbytheindex,andthevalueshouldbethepair:(point, 1).Thevalue“1”willlaterbeusedtocountthenumberofpointsclosesttoagivenmean;forexample:

(1, ((37.43210, -121.48502), 1))

(4, ((33.11310, -111.33201), 1))

(0, ((39.36351, -119.40003), 1))

(1, ((40.00019, -116.44829), 1))

b. Reducetheresult:foreachcenterinthekPointsarray,sumthelatitudesandlongitudes,respectively,ofallthepointsclosesttothatcenter,andthenumberofclosestpoints.Forexample:

(0, ((2638919.87653,-8895032.182481), 74693)))

(1, ((3654635.24961,-12197518.55688), 101268))

(2, ((1863384.99784,-5839621.052003), 48620))

(3, ((4887181.82600,-14674125.94873), 126114))

(4, ((2866039.85637,-9608816.13682), 81162))

c. ThereducedRDDshouldhave(atmost)kmembers.Mapeachtoanewcenterpointbycalculatingtheaveragelatitudeandlongitudeforeachsetofclosest

Page 82: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

82

points:thatis,map(index,(totalX,totalY),n)to(index,(totalX/n, totalY/n))

d. Collectthesenewpointsintoalocalmaporarraykeyedbyindex.

e. UsetheprovideddistanceSquaredmethodtocalculatehowmucheachcenter“moved”betweenthecurrentiterationandthelast.Thatis,foreachcenterinkPoints,calculatethedistancebetweenthatpointandthecorrespondingnewpoint,andsumthosedistances.Thatisthedeltabetweeniterations;whenthedeltaislessthanconvergeDist,stopiterating.

f. CopythenewcenterpointstothekPointsarrayinpreparationforthenextiteration.

7. Whentheiterationiscomplete,displaythefinalkcenterpoints.

This is the end of the Exercise

Page 83: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

83

Hands-On Exercise: Use Broadcast Variables

Files Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

Data files (local):

~/training_materials/sparkdev/data/targetmodels.txt

Stubs: stubs/TargetModels.pyspark

stubs/TargetModels.scalaspark

Solutions: solutions/TargetModels.pyspark

solutions/TargetModels.scalaspark

Inthisexercise,youwillfilterwebrequeststoincludeonlythosefromdevices

includedinalistoftargetmodels.

Loudacrewantstodosomeanalysisonwebtrafficproducedfromspecificdevices.Thelistoftargetmodelsisin~training_materials/sparkdev/data/targetmodels.txt

Filterthewebserverlogstoincludeonlythoserequestsfromdevicesinthelist.Themodelnameofthedevicewillbeinthelineinthelogfile.Useabroadcastvariabletopassthelistoftargetdevicestotheworkersthatwillrunthefiltertasks.

Hint:Usethestubfileforthisexercisein~/training_materials/sparkdev/stubsforthecodetoloadinthelistoftargetmodels.

This is the end of the Exercise

Page 84: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

84

Hands-On Exercise: Use Accumulators Files Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

Solutions: solutions/RequestAccumulator.pyspark

solutions/RequestAccumulator.scalaspark

Inthisexercise,youwillcountthenumberofdifferenttypesoffilesrequestedinasetofwebserverlogs.

Usingaccumulators,countthenumberofeachtypeoffile(HTML,CSSandJPG)requestedinthewebserverlogfiles.

Hint:usethefileextensionstringtodeterminethetypeofrequest,suchas.html,.css,or.jpg.

This is the end of the Exercise

Page 85: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

85

Hands-On Exercise: Use Spark SQL for ETL

Files and Data Used in this Exercise

MySQL table: loudacre.webpage

Output HDFS directory: /loudacre/webpage_files

Solutions: solutions/SparkSQL-webpage-files.pyspark

solutions/SparkSQL-webpage-files.scalaspark

Inthisexercise,youwilluseSparkSQLtoloaddatafromMySQL,processit,andstoreittoHDFS.

Reviewing the Data in MySQL

ReviewthedatacurrentlyintheMySQLloudacre.mysqltable.

1. Listthecolumnsandtypesinthetable:

$ mysql -utraining -ptraining loudacre \

-e"DESCRIBE webpage"

2. Viewthefirstfewrowsfromthetable:

$ mysql -utraining -ptraining loudacre \

-e"SELECT * FROM webpage LIMIT 5"

Notethatthedataintheassociated_filescolumnisacomma-delimitedstring.LoudacrewouldliketomakethisdataavailableinanImpalatable,butinordertoperformtherequiredanalysis,theassociated_filesdatamustbeextractedandnormalized.YourgoalinthenextsectionistouseSparkSQLtoextractthedatainthecolumn,splitthestring,andcreateanewdatasetinHDFScontainingeachwebpagenumber,anditsassociatedfilesinseparaterows.

Page 86: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

86

Loading the Data from MySQL

3. Ifnecessary,starttheSparkShell.

4. ImporttheSQLContextclassdefinition,anddefineaSQLcontext:

pyspark> from pyspark.sql import SQLContext

pyspark> sqlCtx = SQLContext(sc)

scala> import org.apache.spark.sql.SQLContext

scala> val sqlCtx = new SQLContext(sc)

5. CreateanewDataFramebasedonthewebpagetablefromthedatabase:

pyspark> webpages=sqlCtx.load(source="jdbc", \

url="jdbc:mysql://localhost/loudacre?user=training&password=t

raining", \

dbtable="webpage")

scala> val webpages=sqlCtx.load("jdbc",

Map("url"->

"jdbc:mysql://localhost/loudacre?user=training&password=train

ing",

"dbtable" -> "webpage"))

6. ExaminetheschemaofthenewDataFramebycallingwebpages.printSchema().

7. CreateanewDataFramebyselectingtheweb_page_numandassociated_filescolumnsfromtheexistingDataFrame:

python> assocfiles = \

webpages.select(webpages.web_page_num,\

webpages.associated_files)

Page 87: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

87

scala> val assocfiles =

webpages.select(webpages("web_page_num"),webpages("associated

_files"))

8. InordertomanipulatethedatausingSpark,converttheDataFrameintoatoaPairRDDusingthemapmethod.TheinputintothemapmethodisaRowobject.Theykeyistheweb_page_numvalue(thefirstvalueintherow),andthevalueistheassociated_filesstring(thesecondvalueintherow).

InPython,youcandynamicallyreferencethecolumnvalueoftherowbyname:

pyspark> afilesrdd = assocfiles.map(lambda row: \

(row.web_page_num,row.associated_files))

InScala,usethecorrectgetmethodforthetypeofvaluewiththecolumnindex:

scala> val afilesrdd = assocfiles.map(row =>

(row.getInt(0),row.getString(1)))

9. NowthatyouhaveanRDD,youcanusethefamiliarflatMapValuestransformationtosplitandextractthefilenamesintheassociated_filescolumn:

pyspark> afilesrdd2 = afilesrdd\

.flatMapValues(lambda filestring:filestring.split(','))

scala> val afilesrdd2 =

afilesrdd.flatMapValues(filestring =>

filestring.split(','))

10. CreateanewDataFramefromtheRDD:

pyspark> afiledf = sqlCtx.createDataFrame(afilesrdd2)

Page 88: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

88

scala> val afiledf = sqlCtx.createDataFrame(afilesrdd2)

11. CallprintSchemaonthenewDataFrame.NotethatSparkSQLgavethecolumnsgenericnames:_1and_2.

12. CreateanewDataFramebyrenamingthecolumnstoreflectthedatatheyhold.

InPython,usethewithColumnRenamedmethodtorenamethetwocolumns:

pyspark> finaldf = afiledf. \

withColumnRenamed('_1','web_page_num'). \

withColumnRenamed('_2','associated_file')

InScala,youcanusethetoDFshortcutmethodtocreateanewDataFramebasedonanexistingonewiththecolumnsrenamed:

scala> val finaldf = afiledf.

toDF("web_page_num","associated_file")

13. CallprintSchematoconfirmthatthenewDataFramehasthecorrectcolumnnames.

14. YourfinalDataFramecontainstheprocesseddata,socallfinaldf.collect()toconfirmthedataiscorrect.

15. Optional:SavethefinalDataFrameinParquetformat(thedefault)in/loudacre/webpage_files.ThecodeisthesameinScalaandPython.

> finaldf.save("/loudacre/webpage_files")

ConfirmthatthedatawassavedinHDFSiscorrect.NotethatthedatawillbeinParquetfileformat,whichisabinaryformat.Thismeansthatwhenyouviewthefile,onlysomeofthecontentwillbeinreadablestringform.Thisisexpectedbehavior.

This is the end of the Exercise

Page 89: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

89

Appendix A: Enabling iPython Notebook iPythonNotebookisinstalledontheVMforthiscourse.Touseitinsteadofthecommand-lineversionofiPython,followthesesteps:

1. Openthefollowingfileforediting:/home/training/.bashrc

2. Uncommentoutthefollowingline(removetheleading#).

# export PYSPARK_DRIVER_PYTHON_OPTS='notebook ……..jax'

3. Savethefile.

4. Openanewterminalwindow.(Mustbeanewterminalsoitloadsyouredited.bashrcfile).

5. Enterpysparkintheterminal.Thiswillcauseabrowserwindowtoopen,andyoushouldseethefollowingwebpage:

6. OntherighthandsideofthepageselectPython2fromtheNewmenu

Page 90: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

90

7. EntersomeSparkcodesuchasthefollowingandusetheplaybuttontoexecuteyourSparkcode.

8. Noticetheoutputdisplayed.

This is the end of the Appendix

Page 91: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

91

Data Model Reference

Notethatnotalloftheinformationbelowappliestothiscustomizedversionofthecourse.

Tables Imported from MySQL

ThefollowingdepictsthestructureoftheMySQLtablesimportedintoHDFSusingSqoop.Theprimarykeycolumnfromthedatabase,ifany,isdenotedbyboldtext:

customers:201,375records(importedto/dualcore/customers)

Index Field Description Example0 cust_id CustomerID 1846532 1 fname Firstname Sam 2 lname Lastname Jones 3 address Addressofresidence 456 Clue Road 4 city City Silicon Sands 5 state State CA 6 zipcode Postalcode 94306

employees:61,712records(importedto/dualcore/employeesandlaterusedasanexternaltableinHive)

Index Field Description Example0 emp_id EmployeeID BR5331404 1 fname Firstname Betty 2 lname Lastname Richardson 3 address Addressofresidence 123 Shady Lane 4 city City Anytown 5 state State CA 6 zipcode PostalCode 90210 7 job_title Employee’sjobtitle Vice President 8 email e-mailaddress [email protected] 9 active Isactivelyemployed? Y 10 salary Annualpay(indollars) 136900

orders:1,662,951records(importedto/dualcore/orders)

Index Field Description Example0 order_id OrderID 3213254

Page 92: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

92

1 cust_id CustomerID 1846532 2 order_date Date/timeoforder 2013-05-31 16:59:34

order_details:3,333,244records(importedto/dualcore/order_details)

Index Field Description Example0 order_id OrderID 3213254 1 prod_id ProductID 1754836

products:1,114records(importedto/dualcore/products)

Index Field Description Example0 prod_id ProductID 1273641 1 brand Brandname Foocorp 2 name Nameofproduct 4-port USB Hub 3 price Retailsalesprice,incents 1999 4 cost Wholesalecost,incents 1463 5 shipping_wt Shippingweight(inpounds) 1

suppliers:66records(importedto/dualcore/suppliers)

Index Field Description Example0 supp_id SupplierID 1000 1 fname Firstname ACME Inc. 2 lname Lastname Sally Jones 3 address Addressofoffice 123 Oak Street 4 city City New Athens 5 state State IL

6 zipcode Postalcode 62264 7 phone Officephonenumber (618) 555-5914

Page 93: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

93

Hive/Impala Tables

Thefollowingisarecordcountfortablesthatarecreatedorqueriedduringthehands-onexercises.UsetheDESCRIBE tablenamecommandtoseethetablestructure.

TableName RecordCount

ads 788,952

cart_items 33,812

cart_orders 12,955

cart_shipping 12,955

cart_zipcodes 12,955

checkout_sessions 12,955

customers 201,375

employees 61,712

loyalty_program 311

order_details 3,333,244

orders 1,662,951

products 1,114

ratings 21,997

web_logs 412,860

Page 94: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

94

Other Data Added to HDFS

ThefollowingdescribesthestructureofotherimportantdatasetsaddedtoHDFS.

CombinedAdCampaignData:(788,952recordstotal),storedintwodirectories:

• /dualcore/ad_data1(438,389records)

• /dualcore/ad_data2(350,563records).

Index Field Description Example0 campaign_id Uniquelyidentifiesourad A3 1 date Dateofaddisplay 05/23/2013 2 time Timeofaddisplay 15:39:26 3 keyword Keywordthattriggeredad tablet 4 display_site Domainwhereadshown news.example.com 5 placement LocationofadonWebpage INLINE 6 was_clicked Whetheradwasclicked 1 7 cpc Costperclick,incents 106

access.log:412,860records(uploadedto/dualcore/access.log)Thisfileisusedtopopulatetheweb_logstableinHive.NotethattheRFC931andUsernamefieldsareseldompopulatedinlogfilesformodernpublicWebsitesandareignoredinourRegexSerDe.

Index Field/Description Example0 IPaddress 192.168.1.151 RFC931(Ident) -2 Username -3 Date/Time [22/May/2013:15:01:46 -0800]4 Request "GET /foo?bar=1 HTTP/1.1"5 Statuscode 2006 Bytestransferred 7627 Referer "http://dualcore.com/"8 Useragent(browser) "Mozilla/4.0 [en] (WinNT; I)"9 Cookie(sessionID) "SESSION=8763723145"

Page 95: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

95

Regular Expression Reference Thefollowingisabrieftutorialintendedfortheconvenienceofstudentswhodon’thaveexperienceusingregularexpressionsormayneedarefresher.AmorecompletereferencecanbefoundinthedocumentationforJava’sPatternclass:

http://tiny.cloudera.com/regexpattern

Introduction to Regular Expressions

Regularexpressionsareusedforpatternmatching.Therearetwokindsofpatternsinregularexpressions:literalsandmetacharacters.Literalvaluesareusedtomatchprecisepatternswhilemetacharactershavespecialmeaning;forexample,adotwillmatchanysinglecharacter.Here'sthecompletelistofmetacharacters,followedbyexplanationsofthosethatarecommonlyused:

< ( [ { \ ^ - = $ ! | ] } ) ? * + . >

Literalcharactersareanycharactersnotlistedasametacharacter.They'rematchedexactly,butifyouwanttomatchametacharacter,youmustescapeitwithabackslash.Sinceabackslashisitselfametacharacter,itmustalsobeescapedwithabackslash.Forexample,youwouldusethepattern\\.tomatchaliteraldot.

Regularexpressionssupportpatternsmuchmoreflexiblethansimplyusingadottomatchanycharacter.Thefollowingexplainshowtousecharacterclassestorestrictwhichcharactersarematched.

Character Classes [057] Matchesanysingledigitthatiseither0,5,or7[0-9] Matchesanysingledigitbetween0and9[3-6] Matchesanysingledigitbetween3and6[a-z] Matchesanysinglelowercaseletter[C-F] MatchesanysingleuppercaseletterbetweenCandF

Forexample,thepattern[C-F][3-6]wouldmatchthestringD3orF5butwouldfailtomatchG3orC7.

Therearealsosomebuilt-incharacterclassesthatareshortcutsforcommonsetsofcharacters.

Page 96: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

96

Predefined Character Classes \\d Matchesanysingledigit\\w Matchesanywordcharacter(lettersofanycase,plusdigitsorunderscore)\\s Matchesanywhitespacecharacter(space,tab,newline,etc.)

Forexample,thepattern\\d\\d\\d\\wwouldmatchthestring314dor934Xbutwouldfailtomatch93XorZ871.

Sometimesit'seasiertochoosewhatyoudon'twanttomatchinsteadofwhatyoudowanttomatch.Thesethreecanbenegatedbyusinganuppercaseletterinstead.

Negated Predefined Character Classes \\D Matchesanysinglenon-digitcharacter\\W Matchesanynon-wordcharacter\\S Matchesanynon-whitespacecharacter

Forexample,thepattern\\D\\D\\WwouldmatchthestringZX#or@ Pbutwouldfailtomatch93Xor36_.

Themetacharactersshownabovematcheachexactlyonecharacter.Youcanspecifythemmultipletimestomatchmorethanonecharacter,butregularexpressionssupporttheuseofquantifierstoeliminatethisrepetition.

Matching Quantifiers {5} Precedingcharactermayoccurexactlyfivetimes{0,6} Precedingcharactermayoccurbetweenzeroandsixtimes? Precedingcharacterisoptional(mayoccurzerooronetimes)+ Precedingcharactermayoccuroneormoretimes* Precedingcharactermayoccurzeroormoretimes

Bydefault,quantifierstrytomatchasmanycharactersaspossible.Ifyouusedthepatternore.+aonthestringDualcore has a store in Florida,youmightbesurprisedtolearnthatitmatchesore has a store in Floridaratherthanore haorore in Floridaasyoumighthaveexpected.Thisisbecausematchesa"greedy"bydefault.Addingaquestionmarkmakesthequantifiermatchasfewcharactersaspossibleinstead,sothepatternore.+?aonthisstringwouldmatchore ha.

Page 97: Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

97

Finally,therearetwospecialmetacharactersthatmatchzerocharacters.Theyareusedtoensurethatastringmatchesapatternonlywhenitoccursatthebeginningorendofastring.

Boundary Matching Metacharacters ^ Matchesonlyatthebeginningofastring $ Matchesonlyattheendingofastring

NOTE:Whenusedinsidesquarebrackets(whichdenoteacharacterclass),the^characterisinterpreteddifferently.Inthatcontext,itnegatesthematch.Therefore,specifyingthepattern[^0-9]isequivalenttousingthepredefinedcharacterclass\\ddescribedearlier.