Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.

1

Cloudera Custom Training Hands-On Exercises

GeneralNotes...........................................................................................................................................3Hands-OnExercise:DataIngestWithHadoopTools..................................................................6Hands-OnExercise:RunningQueriesfromtheShell,Scripts,andHue............................13Hands-OnExercise:DataManagement........................................................................................17Hands-OnExercise:RelationalAnalysis......................................................................................24Hands-OnExercise:WorkingwithImpala..................................................................................26Hands-OnExercise:AnalyzingTextandComplexDataWithHive.....................................29Hands-OnExercise:DataTransformationwithHive..............................................................35Hands-OnExercise:ViewtheSparkDocumentation...............................................................43Hands-OnExercise:UsetheSparkShell......................................................................................44Hands-OnExercise:UseRDDstoTransformaDataset...........................................................46Hands-OnExercise:ProcessDataFileswithSpark..................................................................53Hands-OnExercise:UsePairRDDstoJoinTwoDatasets......................................................57


2

Hands-OnExercise:WriteandRunaSparkApplication........................................................62Hands-OnExercise:ConfigureaSparkApplication.................................................................67Hands-OnExercise:ViewJobsandStagesintheSparkApplicationUI.............................71Hands-OnExercise:PersistanRDD...............................................................................................77Hands-OnExercise:ImplementanIterativeAlgorithm.........................................................79Hands-OnExercise:UseBroadcastVariables............................................................................83Hands-OnExercise:UseAccumulators........................................................................................84Hands-OnExercise:UseSparkSQLforETL................................................................................85AppendixA:EnablingiPythonNotebook....................................................................................89DataModelReference........................................................................................................................91RegularExpressionReference........................................................................................................95


3

General Notes Cloudera’strainingcoursesuseavirtualmachine(VM)witharecentversionofCDHalreadyinstalledandconfiguredforyou.TheVMrunsinpseudo-distributedmode,aconfigurationthatenablesaHadoopclustertorunonasinglemachine.

Points to Note While Working in the VM

1. TheVMissettoautomaticallyloginastheusertraining.Ifyoulogout,youcanlogbackinastheusertrainingwiththepasswordtraining.Therootpasswordisalsotraining,thoughyoucanprefixanycommandwithsudotorunitasroot.

2. Exercisesoftencontainstepswithcommandsthatlooklikethis:

$ hdfs dfs -put accounting_reports_taxyear_2013 \

/user/training/tax_analysis/

The$symbolrepresentsthecommandprompt.Donotincludethischaracterwhencopyingandpastingcommandsintoyourterminalwindow.Also,thebackslash(\)signifiesthatthecommandcontinuesonthenextline.Youmayeitherenterthecodeasshown(ontwolines),oromitthebackslashandtypethecommandonasingleline.

SomecommandsaretobeexecutedinthePythonorScalaSparkShells;thosearecolorcodedandshownwithpyspark>(blue)orscala>(red)prompts,respectively.Linuxcommandstepsthatapplytoonlyonelanguageortheotherarealsocolorcoded,butstillprecededwiththe$prompt.

3. AlthoughmanystudentsarecomfortableusingUNIXtexteditorslikevioremacs,somemightpreferagraphicaltexteditor.Toinvokethegraphicaleditorfromthecommandline,typegeditfollowedbythepathofthefileyouwishtoedit.Appending&tothecommandallowsyoutotypeadditionalcommandswhiletheeditorisstillopen.Hereisanexampleofhowtoeditafilenamedmyfile.txt:

$ gedit myfile.txt &


4

Class-Specific VM Customization

YourVMisusedinseveralofCloudera’strainingclasses.Thisparticularclassdoesnotrequiresomeoftheservicesthatstartbydefault,whileotherservicesthatdonotstartbydefaultarerequiredforthisclass.Beforestartingthecourseexercises,runthecoursesetupscript:

$ ~/scripts/analyst/training_setup_da.sh

Youmaysafelyignoreanymessagesaboutservicesthathavealreadybeenstartedorshutdown.Youonlyneedtorunthisscriptonce.

Points to Note During the Exercises

SampleSolutions

Ifyouneedahintorwanttocheckyourwork,thesample_solutionsubdirectorywithineachexercisedirectorycontainscompletecodesamples.

Catch-upScript

Ifyouareunabletocompleteanexercise,wehaveprovidedascripttocatchyouupautomatically.Eachexercisehasinstructionsforrunningthecatch-upscript.

$ADIREnvironmentVariable

$ADIRisashortcutthatpointstothe/home/training/training_materials/ analystdirectory,whichcontainsthecodeanddatayouwilluseintheexercises.

FewerStep-by-StepInstructionsasYouWorkThroughTheseExercises

Astheexercisesprogress,andyougainmorefamiliaritywiththetoolsyou’reusing,weprovidefewerstep-by-stepinstructions.Youshouldfeelfreetoaskyourinstructorforassistanceatanytime,ortoconsultwithyourfellowstudents.


5

BonusExercises

Manyoftheexercisescontainoneormoreoptional“bonus”sections.Weencourageyoutoworkthroughtheseiftimeremainsafteryoufinishthemainexerciseandwouldlikeanadditionalchallengetopracticewhatyouhavelearned.


6

Hands-On Exercise: Data Ingest With Hadoop Tools InthisexerciseyouwillpracticeusingtheHadoopcommandlineutilitytointeract

withHadoop’sDistributedFilesystem(HDFS)anduseSqooptoimporttablesfroma

relationaldatabasetoHDFS.

To begin, you must launch the Data Analyst VM.

BesureyouhaverunthesetupscriptasdescribedintheGeneralNotessectionabove.Ifyouhavenotrunityet,dosonow:

$ ~/scripts/analyst/training_setup_da.sh

Step 1: Exploring HDFS using the Hue File Browser

1. StarttheFirefoxWebbrowseronyourVMbyclickingontheiconinthesystemtoolbar:

2. InFirefox,clickontheHuebookmarkinthebookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)

3. Afterafewseconds,youshouldseeHue’shomescreen.Thefirsttimeyoulogin,youwillbepromptedtocreateanewusernameandpassword.Entertraininginboththeusernameandpasswordfields,andthenclickthe“SignIn”button.

4. WheneveryoulogintoHueaTipspopupwillappear.Tostopitappearinginthefuture,checktheDonotshowthisdialogagainoptionbeforedismissingthepopup.


7

5. ClickFileBrowserintheHuetoolbar.YourHDFShomedirectory(/user/training)displays.(SinceyouruserIDontheclusteristraining,yourhomedirectoryinHDFSis/user/training.)Thedirectorycontainsnofilesordirectoriesyet.

6. Createatemporarysub-directory:selectthe+NewmenuandclickDirectory.

7. EnterdirectorynametestandclicktheCreatebutton.Yourhomedirectorynowcontainsadirectorycalledtest.

8. Clickontesttoviewthecontentsofthatdirectory;currentlyitcontainsnofilesorsubdirectories.

9. UploadafiletothedirectorybyselectingUploadàFiles.

10. ClickSelectFilestobringupafilebrowser.Bydefault,the/home/training/Desktopfolderdisplays.Clickthehomedirectorybutton(training)thennavigatetothecoursedatadirectory:training_materials/analyst/data.

11. ChooseanyofthedatafilesinthatdirectoryandclicktheOpenbutton.

12. ThefileyouselectedwillbeloadedintothecurrentHDFSdirectory.Clickthefilenametoseethefile’scontents.BecauseHDFSisdesigntostoreverylargefiles,Huewillnotdisplaytheentirefile,justthefirstpageofdata.Youcanclickthearrowbuttonsorusethescrollbartoseemoreofthedata.

13. ReturntothetestdirectorybyclickingViewfilelocationinthelefthandpanel.

14. Abovethelistoffilesinyourcurrentdirectoryisthefullpathofthedirectoryyouarecurrentlydisplaying.Youcanclickonanydirectoryinthepath,oronthefirstslash(/)


8

togotothetoplevel(root)directory.Clicktrainingtoreturntoyourhomedirectory.

15. Deletethetemporarytestdirectoryyoucreated,includingthefileinit,byselectingthecheckboxnexttothedirectorynamethenclickingtheMovetotrashbutton.(ConfirmthatyouwanttodeletebyclickingYes.)

Step 2: Exploring HDFS using the command line

4. Youcanusethehdfs dfscommandtointeractwithHDFSfromthecommandline.CloseorminimizeFirefox,thenopenaterminalwindowbyclickingtheiconinthesystemtoolbar:

16. Intheterminalwindow,enter:

$ hdfs dfs

Thisdisplaysahelpmessagedescribingallsubcommandsassociatedwithhdfs dfs.

17. Runthefollowingcommand:

$ hdfs dfs -ls /

ThisliststhecontentsoftheHDFSrootdirectory.Oneofthedirectorieslistedis/user.Eachuserontheclusterhasa‘home’directorybelow/usercorrespondingtohisorheruserID.

18. Ifyoudonotspecifyapath,hdfs dfsassumesyouarereferringtoyourhomedirectory:


9

$ hdfs dfs -ls

19. Notethe/dualcore directory.Mostofyourworkinthiscoursewillbeinthatdirectory.Trycreatingatemporarysubdirectoryin/dualcore:

$ hdfs dfs -mkdir /dualcore/test1

20. Next,addaWebserverlogfiletothisnewdirectoryinHDFS:

$ hdfs dfs -put $ADIR/data/access.log /dualcore/test1/

Overwriting Files in Hadoop

Unlike the UNIX shell, Hadoop won’t overwrite files and directories. This feature helps

protect users from accidentally replacing data that may have taken hours to produce. If

you need to replace a file or directory in HDFS, you must first remove the existing one.

Please keep this in mind in case you make a mistake and need to repeat a step during

the Hands-On Exercises.

To remove a file:

$ hdfs dfs -rm /dualcore/example.txt

To remove a directory and all its files and subdirectories (recursively):

$ hdfs dfs -rm -r /dualcore/example/

21. Verifythelaststepbylistingthecontentsofthe/dualcore/test1directory.Youshouldobservethattheaccess.logfileispresentandoccupies106,339,468bytesofspaceinHDFS:

$ hdfs dfs -ls /dualcore/test1

22. Removethetemporarydirectoryanditscontents:


10

$ hdfs dfs -rm -r /dualcore/test1

Step 3: Importing Database Tables into HDFS with Sqoop

Dualcorestoresinformationaboutitsemployees,customers,products,andordersinaMySQLdatabase.Inthenextfewsteps,youwillexaminethisdatabasebeforeusingSqooptoimportitstablesintoHDFS.

5. Inaterminalwindow,logintoMySQLandselectthedualcoredatabase:

$ mysql --user=training --password=training dualcore

23. Next,listtheavailabletablesinthedualcoredatabase(mysql>representstheMySQLclientpromptandisnotpartofthecommand):

mysql> SHOW TABLES;

24. Reviewthestructureoftheemployeestableandexamineafewofitsrecords:

mysql> DESCRIBE employees;

mysql> SELECT emp_id, fname, lname, state, salary FROM

employees LIMIT 10;

25. ExitMySQLbytypingquit,andthenhittheenterkey:

mysql> quit

Data Model Reference

For your convenience, you will find a reference section depicting the structure for the

tables you will use in the exercises at the end of this Exercise Manual.

26. Next,runthefollowingcommand,whichimportstheemployeestableintothe/dualcoredirectorycreatedearlierusingtabcharacterstoseparateeachfield:


11

$ sqoop import \

--connect jdbc:mysql://localhost/dualcore \

--username training --password training \

--fields-terminated-by '\t' \

--warehouse-dir /dualcore \

--table employees

Hiding Passwords

Typing the database password on the command line is a potential security risk since

others may see it. An alternative to using the --password argument is to use -P and let

Sqoop prompt you for the password, which is then not visible when you type it.

Sqoop Code Generation

After running the sqoop import command above, you may notice a new file named

employee.java in your local directory. This is an artifact of Sqoop’s code generation

and is really only of interest to Java developers, so you can ignore it.

27. RevisethepreviouscommandandimportthecustomerstableintoHDFS.

28. RevisethepreviouscommandandimporttheproductstableintoHDFS.

29. RevisethepreviouscommandandimporttheorderstableintoHDFS.


12

30. Next,importtheorder_detailstableintoHDFS.Thecommandisslightlydifferentbecausethistableonlyholdsreferencestorecordsintheordersandproductstable,andlacksaprimarykeyofitsown.Consequently,youwillneedtospecifythe--split-byoptionandinstructSqooptodividetheimportworkamongtasksbasedonvaluesintheorder_idfield.Analternativeistousethe-m 1optiontoforceSqooptoimportallthedatawithasingletask,butthiswouldsignificantlyreduceperformance.

$ sqoop import \




--warehouse-dir /dualcore \

--table order_details \

--split-by=order_id

This is the end of the Exercise


13

Hands-On Exercise: Running Queries from the Shell, Scripts, and Hue

Exercise directory: $ADIR/exercises/queries

InthisexerciseyouwillpracticeusingtheHuequeryeditorandtheImpalaandHive

shellstoexecutesimplequeries.Theseexercisesusethetablesthathavebeen

populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngestWith

HadoopTools”exercise.

IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:

$ ~/scripts/analyst/catchup.sh

Step #1: Explore the customers table using Hue

OnewaytorunImpalaandHivequeriesisthroughyourWebbrowserusingHue’sQueryEditors.Thisisespeciallyconvenientifyouusemorethanonecomputer–orifyouuseadevice(suchasatablet)thatisn’tcapableofrunningtheImpalaorBeelineshellsitself–becauseitdoesnotrequireanysoftwareotherthanabrowser.

6. StarttheFirefoxWebbrowserifitisn’trunning,thenclickontheHuebookmarkintheFirefoxbookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)

7. Afterafewseconds,youshouldseeHue’shomescreen.Ifyoudon’tcurrentlyhaveanactivesession,youwillfirstbepromptedtologin.Entertraininginboththeusernameandpasswordfields,andthenclicktheSignInbutton.

8. SelecttheQueryEditorsmenuintheHuetoolbar.NotethattherearequeryeditorsforbothImpalaandHive(aswellasothertoolssuchasPig.)TheinterfaceisverysimilarforbothHiveandImpala.Fortheseexercises,selecttheImpalaqueryeditor.


14

9. ThisisthefirsttimewehaverunImpalasinceweimportedthedatausingSqoop.TellImpalatoreloadtheHDFSmetadataforthetablebyenteringthefollowingcommandinthequeryarea,thenclickingExecute.

INVALIDATE METADATA

10. Makesurethedefaultdatabaseisselectedinthedatabaselistontheleftsideofthepage.

11. Belowtheselecteddatabaseisalistofthetablesinthatdatabase.Selectthecustomerstabletoviewthecolumnsinthetable.

12. ClickthePreviewSampleDataicon( )nexttothetablenametoviewsampledatafromthetable.Whenyouaredone,clicktoOKbuttontoclosethewindow.

Step #2: Run a Query Using Hue

Dualcoreranacontestinwhichcustomerspostedvideosofinterestingwaystousetheirnewtablets.A$5,000prizewillbeawardedtothecustomerwhosevideoreceivedthehighestrating.

However,theregistrationdatawaslostduetoanRDBMScrash,andtheonlyinformationwehaveisfromthevideos.Thewinningcustomerintroducedherselfonlyas“BridgetfromKansasCity”inhervideo.

Youwillneedtorunaquerythatidentifiesthewinner’srecordinourcustomerdatabasesothatwecansendherthe$5,000prize.

13. AllyouknowaboutthewinneristhathernameisBridgetandshelivesinKansasCity.IntheImpalaQueryEditor,enteraqueryinthetextareatofindthewinningcustomer.UsetheLIKEoperatortodoawildcardsearchfornamessuchas"Bridget","Bridgette"or"Bridgitte".Remembertofilteronthecustomer'scity.

14. Afterenteringthequery,clicktheExecutebutton.

Whilethequeryisexecuting,theLogtabdisplaysongoinglogoutputfromthequery.Whenthequeryiscomplete,theResultstabopens,displayingtheresultsofthequery.

Question:Whichcustomerdidyourqueryidentifyasthewinnerofthe$5,000prize?


15

Step #3: Run a Query from the Impala Shell

Runatop-NquerytoidentifythethreemostexpensiveproductsthatDualcorecurrentlyoffers.

15. Startaterminalwindowifyoudon’tcurrentlyhaveonerunning.

16. OntheLinuxcommandlineintheterminalwindow,starttheImpalashell:

$ impala-shell

ImpaladisplaystheURLoftheImpalaserverintheshellcommandprompt,e.g.:

[localhost.localdomain:21000] >

17. Attheprompt,reviewtheschemaoftheproductstablebyentering

DESCRIBE products;

RememberthatSQLcommandsintheshellmustbeterminatedbyasemi-colon(;),unlikeintheHuequeryeditor.

18. Showasampleof10recordsfromtheproductstable:

SELECT * FROM products LIMIT 10;

19. Executeaquerythatdisplaysthethreemostexpensiveproducts.Hint:UseORDER BY.

20. Whenyouaredone,exittheImpalashell:

exit;

Step #4: Run a Script in the Impala Shell

TherulesforthecontestdescribedearlierrequirethatthewinnerboughttheadvertisedtabletfromDualcorebetweenMay1,2013andMay31,2013.Beforewecanauthorizeouraccountingdepartmenttopaythe$5,000prize,youmustensurethatBridgetiseligible.


16

Sincethisqueryinvolvesjoiningdatafromseveraltables,andwehavenotyetcoveredJOIN,you’vebeenprovidedwithascriptintheexercisedirectory.

21. Changetothedirectoryforthishands-onexercise:

$ cd $ADIR/exercises/queries

22. Reviewthecodeforthequery:

$ cat verify_tablet_order.sql

23. Executethescriptusingtheshell’s-foption:

$ impala-shell -f verify_tablet_order.sql

Question:DidBridgetordertheadvertisedtabletinMay?

Step #5: Run a Query Using Beeline

24. AttheLinuxcommandlineinaterminalwindow,startBeeline:

$ beeline -u jdbc:hive2://localhost:10000

BeelinedisplaystheURLoftheHiveserverintheshellcommandprompt,e.g.:

0: jdbc:hive2://localhost:10000>

25. ExecuteaquerytofindalltheGigabuxbrandproductswhosepriceislessthan1000($10).

26. ExittheBeelineshellbyentering

!exit



17

Hands-On Exercise: Data Management Exercisedirectory:$ADIR/exercises/data_mgmt

Inthisexerciseyouwillpracticeusingseveralcommontechniquesforcreatingand

populatingtables.

IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:


Step #1: Review Existing Tables using the Metastore Manager

1. InFirefox,visittheHuehomepage,andthenchooseDataBrowsersàMetastoreTablesintheHuetoolbar.

2. Makesuredefaultdatabaseisselected.

3. Selectthecustomerstabletodisplaythetablebrowserandreviewthelistofcolumns.

4. SelecttheSampletabtoviewthefirsthundredrowsofdata.

Step #2: Create and Load a New Table using the Metastore Manager

Createandthenloadatablewithproductratingsdata.

5. Beforecreatingthetable,reviewthefilescontainingtheproductratingsdata.Thefilesarein/home/training/training_materials/analyst/data.Youcanusetheheadcommandinaterminalwindowtoseethefirstfewlines:


18

$ head $ADIR/data/ratings_2012.txt

$ head $ADIR/data/ratings_2013.txt

6. Copythedatafilestothe/dualcoredirectoryinHDFS.YoumayuseeithertheHueFileBrowser,orthehdfscommandintheterminalwindow:

$ hdfs dfs -put $ADIR/data/ratings_2012.txt /dualcore/

$ hdfs dfs -put $ADIR/data/ratings_2013.txt /dualcore/

7. ReturntotheMetastoreManagerinHue.Selectthedefaultdatabasetoviewthetablebrowser.

8. ClickonCreateanewtablemanuallytostartthetabledefinitionwizard.

9. Thefirstwizardstepistospecifythetable’sname(required)andadescription(optional).Entertablenameratings,thenclickNext.

10. InthenextstepyoucanchoosewhetherthetablewillbestoredasaregulartextfileoruseacustomSerializer/Deserializer,orSerDe.SerDeswillbecoveredlaterinthecourse.Fornow,selectDelimited,thenclickNext.

11. Thenextstepallowsyoutochangethedefaultdelimiters.Forasimpletable,onlythefieldterminatorisrelevant;collectionandmapdelimitersareusedforcomplexdatainHive,andwillbecoveredlaterinthecourse.SelectTab(\t)forthefieldterminator,thenclickNext.

12. Inthenextstep,chooseafileformat.FileFormatswillbecoveredlaterinthecourse.Fornow,selectTextFile,thenclickNext.

13. Inthenextstep,youcanchoosewhethertostorethefileinthedefaultdatawarehousedirectoryoradifferentlocation.MakesuretheUsedefaultlocationboxischecked,thenclickNext.

14. Thenextstepinthewizardletsyouaddcolumns.Thefirstcolumnoftheratingstableisthetimestampofthetimethattheratingwasposted.Entercolumnnamepostedandchoosecolumntypetimestamp.


19

15. YoucanaddadditionalcolumnsbyclickingtheAddacolumnbutton.Repeatthestepsabovetoenteracolumnnameandtypeforallthecolumnsfortheratingstable:

FieldName FieldTypeposted timestamp cust_id int prod_id int rating tinyint message string

16. Whenyouhaveaddedallthecolumns,scrolldownandclickCreatetable.ThiswillstartajobtodefinethetableintheMetastore,andcreatethewarehousedirectoryinHDFStostorethedata.

17. Whenthejobiscomplete,thenewtablewillappearinthetablebrowser.

18. Optional:UsetheHueFileBrowserorthehdfscommandtoviewthe/user/hive/warehousedirectorytoconfirmcreationoftheratingssubdirectory.

19. Nowthatthetableiscreated,youcanloaddatafromafile.OnewaytodothisisinHue.ClickImportTableunderActions.

20. IntheImportdatadialogbox,enterorbrowsetotheHDFSlocationofthe2012productratingsdatafile:/dualcore/ratings_2012.txt.ThenclickSubmit.(Youwillloadthe2013ratingsinamoment.)

21. Next,verifythatthedatawasloadedbyselectingtheSampletabinthetablebrowserfortheratingstable.Theresultsshouldlooklikethis:

22. Tryingqueryingthedatainthetable.InHue,switchtotheImpalaQueryEditor.


20

23. Initiallythenewtablewillnotappear.YoumustfirstreloadImpala’smetadatacachebyenteringandexecutingthecommandbelow.(Impalametadatacachingwillbecoveredindepthlaterinthecourse.)

INVALIDATE METADATA;

24. Ifthetabledoesnotappearinthetablelistontheleft,clicktheReloadbutton: .(Thisrefreshesthepage,notthemetadataitself.)

25. Tryexecutingaquery,suchascountingthenumberofratings:

SELECT COUNT(*) FROM ratings;

Thetotalnumberofrecordsshouldbe464.

26. AnotherwaytoloaddataintoatableisusingtheLOAD DATAcommand.Loadthe2013ratingsdata:

LOAD DATA INPATH '/dualcore/ratings_2013.txt' INTO TABLE

ratings;

27. TheLOAD DATA INPATHcommandmovesthefiletothetable’sdirectory.UsingtheHueFileBrowserorhdfscommand,verifythatthefileisnolongerpresentintheoriginaldirectory:

$ hdfs dfs -ls /dualcore/ratings_2013.txt

28. Optional:Verifythatthe2013dataisshownalongsidethe2012datainthetable’swarehousedirectory.

29. Finally,counttherecordsintheratingstabletoensurethatall21,997areavailable:

SELECT COUNT(*) FROM ratings;


21

Step #3: Create an External Table Using CREATE TABLE

YouimporteddatafromtheemployeestableinMySQLintoHDFSinanearlierexercise.Nowwewanttobeabletoquerythisdata.SincethedataalreadyexistsinHDFS,thisisagoodopportunitytouseanexternaltable.

InthelastexerciseyoupracticedcreatingatableusingtheMetastoreManager;thistime,useanImpalaSQLstatement.YoumayuseeithertheImpalashell,ortheImpalaQueryEditorinHue.

30. WriteandexecuteaCREATE TABLE statementtocreateanexternaltableforthetab-delimitedrecordsinHDFSat/dualcore/employees.Thedataformatisshownbelow:

FieldName FieldTypeemp_id STRING fname STRING lname STRING address STRING city STRING state STRING zipcode STRING job_title STRING email STRING active STRING salary INT

31. Runthefollowingquerytoverifythatyouhavecreatedthetablecorrectly.

SELECT job_title, COUNT(*) AS num

FROM employees

GROUP BY job_title

ORDER BY num DESC

LIMIT 3;

ItshouldshowthatSalesAssociate,Cashier,andAssistantManagerarethethreemostcommonjobtitlesatDualcore.


22

Bonus Exercise #1: Use Sqoop’s Hive Import Option to Create a Table

Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.

YouusedSqoopinanearlierexercisetoimportdatafromMySQLintoHDFS.SqoopcanalsocreateaHive/Impalatablewiththesamefieldsasthesourcetableinadditiontoimportingtherecords,whichsavesyoufromhavingtowriteaCREATE TABLEstatement.

32. Inaterminalwindow,executethefollowingcommandtoimportthesupplierstablefromMySQLasanewmanagedtable:

$ sqoop import \




--table suppliers \

--hive-import

33. Itisalwaysagoodideatovalidatedataafteraddingit.ExecutethefollowingquerytocountthenumberofsuppliersinTexas.YoumayuseeithertheImpalashellortheHueImpalaQueryEditor.RemembertoinvalidatethemetadatacachesothatImpalacanfindthenewtable.

INVALIDATE METADATA;

SELECT COUNT(*) FROM suppliers WHERE state='TX';

Thequeryshouldshowthatninerecordsmatch.

Bonus Exercise #2: Alter a Table

Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.Youcancompareyourworkagainstthefilesfoundinthebonus_02/sample_solution/subdirectory.


23

InthisexerciseyouwillmodifythesupplierstableyouimportedusingSqoopinthepreviousexercise.YoumaycompletetheseexercisesusingeithertheImpalashellortheImpalaqueryeditorinHue.

34. UseALTER TABLEtorenamethecompanycolumntoname.

35. UsetheDESCRIBEcommandonthesupplierstabletoverifythechange.

36. UseALTER TABLEtorenametheentiretabletovendors.

37. AlthoughtheALTER TABLEcommandoftenrequiresthatwemakeacorrespondingchangetothedatainHDFS,renamingatableorcolumndoesnot.Youcanverifythisbyrunningaqueryonthetableusingthenewnames,e.g.:

SELECT supp_id, name FROM vendors LIMIT 10;



24

Hands-On Exercise: Relational Analysis Exercisedirectory:$ADIR/exercises/relational_analysis

Inthisexerciseyouwillwritequeriestoanalyzedataintablesthathavebeen

populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngest”exercise.

IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:


SeveralanalysisquestionsaredescribedbelowandyouwillneedtowritetheSQLcodetoanswerthem.Youcanusewhichevertoolyouprefer–ImpalaorHive–usingwhichevermethodyoulikebest,includingshell,script,ortheHueQueryEditor,torunyourqueries.

Step #1: Calculate Top N Products

• WhichtopthreeproductshasDualcoresoldmoreofthananyother?Hint:RememberthatifyouuseaGROUP BYclause,youmustgroupbyallfieldslistedintheSELECTclausethatarenotpartofanaggregatefunction.

Step #2: Calculate Order Total

• Whichordershadthehighesttotal?

Step #3: Calculate Revenue and Profit

• WriteaquerytoshowDualcore’srevenue(totalpriceofproductssold)andprofit(priceminuscost)bydate.

o Hint:Theorder_datecolumnintheorder_detailstableisoftypeTIMESTAMP.UseTO_DATEtogetjustthedateportionofthevalue.


25

Thereareseveralwaysyoucouldwritethesequeries.Onepossiblesolutionforeachisinthesample_solution/directory.

Bonus Exercise #1: Rank Daily Profits by Month

Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.

• Writeaquerytoshowhoweachday’sprofitrankscomparedtootherdayswithinthesameyearandmonth.

o Hint:Usethepreviousexercise’ssolutionasasub-query;findtheROW_NUMBERoftheresultswithineachyearandmonth

Thereareseveralwaysyoucouldwritethisquery.Onepossiblesolutionisinthebonus_01/sample_solution/directory.



26

Hands-On Exercise: Working with Impala

InthisexerciseyouwillexplorethequeryexecutionplanforvarioustypesofqueriesinImpala.



Step #1: Review Query Execution Plans

1. Reviewtheexecutionplanforthefollowingquery.YoumayuseeithertheImpalaQueryEditorinHueortheImpalashellcommandlinetool.

SELECT * FROM products;

2. Notethatthequeryexplanationincludesawarningthattableandcolumnstatsarenotavailablefortheproductstable.Computethestatsbyexecuting

COMPUTE STATS products;

3. Nowviewthequeryplanagain,thistimewithoutthewarning.

4. Thepreviousquerywasaverysimplequeryagainstasingletable.Tryreviewingthequeryplanofamorecomplexquery.Thefollowingqueryreturnsthetop3productssold.BeforeEXPLAINingthequery,computestatsonthetablestobequeried.


27

SELECT brand, name, COUNT(p.prod_id) AS sold

FROM products p

JOIN order_details d

ON (p.prod_id = d.prod_id)

GROUP BY brand, name, p.prod_id

ORDER BY sold DESC

LIMIT 3;

Questions:Howmanystagesarethereinthisquery?Whataretheestimatedper-hostmemoryrequirementsforthisquery?Whatisthetotalsizeofallpartitionstobescanned?

5. Thetablesinthequeriesaboveareallhaveonlyasinglepartition.Tryreviewingthequeryplanforapartitionedtable.Recallthatinthe“DataStorageandPerformance”exercise,youcreatedanadstablepartitionedonthenetworkcolumn.Comparethequeryplansforthefollowingtwoqueries.Thefirstcalculatesthetotalcostofclickedadseachadcampaign;theseconddoesthesame,butforalladsononeofadnetworks.

SELECT campaign_id, SUM(cpc)

FROM ads

WHERE was_clicked=1

GROUP BY campaign_id

ORDER BY campaign_id;

SELECT campaign_id, SUM(cpc)

FROM ads

WHERE network=1

GROUP BY campaign_id

ORDER BY campaign_id;

Questions:Whatistheestimateper-hostmemoryrequirementsforthetwoqueries?Whatexplainsthedifference?


28

Bonus Exercise #1: Review the Query Summary


ThisexercisemustbecompletedintheImpalaShellcommandlinetool,becauseitusesfeaturesnotyetavailableinHue.Refertothe“RunningQueriesfromtheShell,Scripts,andHue”exerciseforhowtousetheshellifneeded.

6. Tryexecutingoneofthequeriesyouexaminedabove,e.g.

SELECT brand, name, COUNT(p.prod_id) AS sold

FROM products p

JOIN order_details d

ON (p.prod_id = d.prod_id)

GROUP BY brand, name, p.prod_id

ORDER BY sold DESC

LIMIT 3;

7. Afterthequerycompletes,executetheSUMMARYcommand:

SUMMARY;

8. Questions:Whichstagetookthelongestaveragetimetocomplete?Whichtookthemostmemory?



29

Hands-On Exercise: Analyzing Text and Complex Data With Hive

Exercisedirectory:$ADIR/exercises/complex_data

Inthisexercise,youwill

• UseHive'sabilitytostorecomplexdatatoworkwithdatafromacustomerloyalty

program

• UseaRegexSerDetoloadweblogdataintoHive

• UseHive’stextprocessingfeaturestoanalyzecustomers’commentsandproduct

ratings,uncoverproblemsandproposepotentialsolutions.



Step #1: Create, Load and Query a Table with Complex Data

Dualcorerecentlystartedaloyaltyprogramtorewardourbestcustomers.Acolleaguehasalreadyprovideduswithasampleofthedatathatcontainsinformationaboutcustomerswhohavesignedupfortheprogram,includingtheirphonenumbers(asamap),alistofpastorderIDs(asanarray),andastructthatsummarizestheminimum,maximum,average,andtotalvalueofpastorders.Youwillcreatethetable,populateitwiththeprovideddata,andthenrunafewqueriestopracticereferencingthesetypesoffields.

YoumayuseeithertheBeelineshellorHue’sHiveQueryEditortocompletetheseexercises.

1. Createatablewiththefollowingcharacteristics:


30

Name:loyalty_programType:EXTERNALColumns:

FieldName FieldTypecust_id STRING fname STRING lname STRING email STRING level STRING phone MAP<STRING,STRING> order_ids ARRAY<INT> order_value STRUCT<min:INT, max:INT, avg:INT,

total:INT> FieldTerminator:|(verticalbar) Collectionitemterminator: ,(comma) MapKeyTerminator: : (colon)Location:/dualcore/loyalty_program

2. Examinethedatain$ADIR/data/loyalty_data.txttoseehowitcorrespondstothefieldsinthetable.

3. LoadthedatafilebyplacingitintotheHDFSdatawarehousedirectoryforthenewtable.YoucanuseeithertheHueFileBrowser,orthehdfscommand:

$ hdfs dfs -put $ADIR/data/loyalty_data.txt \

/dualcore/loyalty_program/

4. RunaquerytoselecttheHOMEphonenumber(hint:mapkeysarecase-sensitive)forcustomerID1200866.Youshouldsee408-555-4914astheresult.

5. Selectthethirdelementfromtheorder_idsarrayforcustomerID1200866(hint:elementsareindexedfromzero).Thequeryshouldreturn5278505.

6. Selectthetotalattributefromtheorder_valuestructforcustomerID1200866.Thequeryshouldreturn401874.


31

Step #2: Create and Populate the Web Logs Table

Manyinterestinganalysescanbedoneondatafromtheusageofawebsite.Thefirststepistoloadthesemi-structureddataintheweblogfilesintoaHivetable.Typicallogfileformatsarenotdelimited,soyouwillneedtousetheRegexSerDeandspecifyapatternHivecanusetoparselinesintoindividualfieldsyoucanthenquery.

7. Examinethecreate_web_logs.hqlscripttogetanideaofhowitusesaRegexSerDetoparselinesinthelogfile(anexampleloglineisshowninthecommentatthetopofthefile).Whenyouhaveexaminedthescript,runittocreatethetable.YoucanpastethecodeintotheHiveQueryEditor,oruseHCatalog:

$ hcat -f $ADIR/exercises/complex_data/create_web_logs.hql

8. Populatethetablebyaddingthelogfiletothetable’sdirectoryinHDFS:

$ hdfs dfs -put $ADIR/data/access.log /dualcore/web_logs/

9. VerifythatthedataisloadedcorrectlybyrunningthisquerytoshowthetopthreeitemsuserssearchedforonourWebsite:

SELECT term, COUNT(term) AS num FROM

(SELECT LOWER(REGEXP_EXTRACT(request,

'/search\\?phrase=(\\S+)', 1)) AS term

FROM web_logs

WHERE request REGEXP '/search\\?phrase=') terms

GROUP BY term

ORDER BY num DESC

LIMIT 3;

Youshouldseethatitreturnstablet(303),ram(153)andwifi(148).

Note:TheREGEXPoperator,whichisavailableinsomeSQLdialects,issimilartoLIKE,butusesregularexpressionsformorepowerfulpatternmatching.TheREGEXPoperatorissynonymouswiththeRLIKEoperator.


32

Bonus Exercise #1: Analyze Numeric Product Ratings


CustomerratingsandfeedbackaregreatsourcesofinformationforbothcustomersandretailerslikeDualcore.

However,customercommentsaretypicallyfree-formtextandmustbehandleddifferently.Fortunately,Hiveprovidesextensivesupportfortextprocessing.

Beforedelvingintotextprocessing,you’llbeginbyanalyzingthenumericratingscustomershaveassignedtovariousproducts.Inthenextbonusexercise,youwillusetheseresultsindoingtextanalysis.

10. ReviewtheratingstablestructureusingtheHiveQueryEditororusingtheDESCRIBEcommandintheBeelineshell.

11. Wewanttofindtheproductthatcustomerslikemost,butmustguardagainstbeingmisledbyproductsthathavefewratingsassigned.Runthefollowingquerytofindtheproductwiththehighestaverageamongallthosewithatleast50ratings:

SELECT prod_id, FORMAT_NUMBER(avg_rating, 2) AS avg_rating

FROM (SELECT prod_id, AVG(rating) AS avg_rating,

COUNT(*) AS num

FROM ratings

GROUP BY prod_id) rated

WHERE num >= 50

ORDER BY avg_rating DESC

LIMIT 1;

12. Rewrite,andthenexecute,thequeryabovetofindtheproductwiththelowestaverageamongproductswithatleast50ratings.YoushouldseethattheresultisproductID1274673withanaverageratingof1.10.


33

Bonus Exercise #2: Analyze Rating Comments

Weobservedearlierthatcustomersareverydissatisfiedwithoneoftheproductswesell.Althoughnumericratingscanhelpidentifywhichproductthatis,theydon’ttelluswhycustomersdon’tliketheproduct.Althoughwecouldsimplyreadthroughallthecommentsassociatedwiththatproducttolearnthisinformation,thatapproachdoesn’tscale.Next,youwilluseHive’stextprocessingsupporttoanalyzethecomments.

13. Thefollowingquerynormalizesallcommentsonthatproducttolowercase,breaksthemintoindividualwordsusingtheSENTENCESfunction,andpassesthosetotheNGRAMSfunctiontofindthefivemostcommonbigrams(two-wordcombinations).Runthequery:

SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(message)), 2, 5))

AS bigrams

FROM ratings

WHERE prod_id = 1274673;

14. Mostofthesewordsaretoocommontoprovidemuchinsight,thoughtheword“expensive”doesstandoutinthelist.Modifythepreviousquerytofindthefivemostcommontrigrams(three-wordcombinations),andthenrunthatqueryinHive.

15. Amongthepatternsyouseeintheresultisthephrase“tentimesmore.”Thismightberelatedtothecomplaintsthattheproductistooexpensive.Nowthatyou’veidentifiedaspecificphrase,lookatafewcommentsthatcontainitbyrunningthisquery:

SELECT message

FROM ratings

WHERE prod_id = 1274673

AND message LIKE '%ten times more%'

LIMIT 3;

Youshouldseethreecommentsthatsay,“Whydoestheredonecosttentimesmorethantheothers?”


34

16. Wecaninferthatcustomersarecomplainingaboutthepriceofthisitem,butthecommentalonedoesn’tprovideenoughdetail.Oneofthewords(“red”)inthatcommentwasalsofoundinthelistoftrigramsfromtheearlierquery.Writeandexecuteaquerythatwillfindalldistinctcommentscontainingtheword“red”thatareassociatedwithproductID1274673.

17. Thepreviousstepshouldhavedisplayedtwocomments:

18. “Whatissospecialaboutred?”

19. “Whydoestheredonecosttentimesmorethantheothers?”

Thesecondcommentimpliesthatthisproductisoverpricedrelativetosimilarproducts.WriteandrunaquerythatwilldisplaytherecordforproductID1274673intheproductstable.

20. Yourqueryshouldhaveshownthattheproductwasa“16GBUSBFlashDrive(Red)”fromthe“Orion”brand.Next,runthisquerytoidentifysimilarproducts:

SELECT *

FROM products

WHERE name LIKE '%16 GB USB Flash Drive%'

AND brand='Orion';

Thequeryresultsshowthatwehavethreealmostidenticalproducts,buttheproductwiththenegativereviews(theredone)costsabouttentimesasmuchastheothers,justassomeofthecommentssaid.

Basedonthecostandpricecolumns,itappearsthatdoingtextprocessingontheproductratingshashelpedusuncoverapricingerror.



35

Hands-On Exercise: Data Transformation with Hive

Exercisedirectory:$ADIR/exercises/transform

InthisexerciseyouwillexplorethedatafromDualcore’sWebserverthatyouloaded

inthe“AnalyzingTextandComplexData”exercise.Queriesonthatdatawillreveal

thatmanycustomersabandontheirshoppingcartsbeforecompletingthecheckout

process.Youwillcreateseveraladditionaltables,usingdatafromaTRANSFORMscript

andasuppliedUDF,whichyouwilluselatertoanalyzehowDualcorecouldturnthis

problemintoanopportunity.



Step #1: Analyze Customer Checkouts

AsonmanyWebsites,Dualcore’scustomersaddproductstotheirshoppingcartsandthenfollowa“checkout”processtocompletetheirpurchase.Wewanttofigureoutifcustomerswhostartthecheckoutprocessarecompletingit.Sinceeachpartofthefour-stepcheckoutprocesscanbeidentifiedbyitsURLinthelogs,wecanusearegularexpressiontoidentifythem:

Step RequestURL Description1 /cart/checkout/step1-viewcart Viewlistofitemsaddedtocart2 /cart/checkout/step2-shippingcost Notifycustomerofshippingcost3 /cart/checkout/step3-payment Gatherpaymentinformation4 /cart/checkout/step4-receipt Showreceiptforcompletedorder


36

Note:Becausetheweb_logstableusesaRegexSerDes,whichisafeaturenotsupportedbyImpala,thisstepmustbecompletedinHive.YoumayuseeithertheBeelineshellortheHiveQueryEditorinHue.

1. RunthefollowingqueryinHivetoshowthenumberofrequestsforeachstepofthecheckoutprocess:

SELECT COUNT(*), request

FROM web_logs

WHERE request REGEXP '/cart/checkout/step\\d.+'

GROUP BY request;

Theresultsofthisqueryhighlightamajorproblem.Aboutoneoutofeverythreecustomersabandonstheircartafterthesecondstep.Thismightmeanmillionsofdollarsinlostrevenue,solet’sseeifwecandeterminethecause.

2. Thelogfile’scookiefieldstoresavaluethatuniquelyidentifieseachusersession.Sincenotallsessionsinvolvecheckoutsatall,createanewtablecontainingthesessionIDandnumberofcheckoutstepscompletedforjustthosesessionsthatdo:

CREATE TABLE checkout_sessions AS

SELECT cookie, ip_address, COUNT(request) AS steps_completed

FROM web_logs

WHERE request REGEXP '/cart/checkout/step\\d.+'

GROUP BY cookie, ip_address;

3. Runthisquerytoshowthenumberofpeoplewhoabandonedtheircartaftereachstep:

SELECT steps_completed, COUNT(cookie) AS num

FROM checkout_sessions

GROUP BY steps_completed;

Youshouldseethatmostcustomerswhoabandonedtheirorderdidsoafterthesecondstep,whichiswhentheyfirstlearnhowmuchitwillcosttoshiptheirorder.


37

4. Optional:Becausethenewcheckout_sessionstabledoesnotuseaSerDes,itcanbequeriedinImpala.TryrunningthesamequeryasinthepreviousstepinImpala.Whathappens?

Step #2: Use TRANSFORM for IP Geolocation

Basedonwhatyou'vejustseen,itseemslikelythatcustomersabandontheircartsduetohighshippingcosts.Theshippingcostisbasedonthecustomer'slocationandtheweightoftheitemsthey'veordered.Althoughthisinformationisn'tinthedatabase(sincetheorderwasn'tcompleted),wecangatherenoughdatafromthelogstoestimatethem.

Wedon'thavethecustomer'saddress,butwecanuseaprocessknownas"IPgeolocation"tomapthecomputer'sIPaddressinthelogfiletoanapproximatephysicallocation.Sincethisisn'tabuilt-incapabilityofHive,you'lluseaprovidedPythonscripttoTRANSFORMtheip_addressfieldfromthecheckout_sessionstabletoaZIPcode,aspartofHiveQLstatementthatcreatesanewtablecalledcart_zipcodes.

Regarding TRANSFORM and UDF Examples in this Exercise

During this exercise, you will use a Python script for IP geolocation and a UDF to

calculate shipping costs. Both are implemented merely as a simulation – compatible with

the fictitious data we use in class and intended to work even when Internet access is

unavailable. The focus of these exercises is on how to use external scripts and UDFs,

rather than how the code for the examples works internally.

5. Examinethecreate_cart_zipcodes.hqlscriptandobservethefollowing:

a. Itcreatesanewtablecalledcart_zipcodesbasedonselectstatement.

b. Thatselectstatementtransformstheip_address,cookie,andsteps_completedfieldsfromthecheckout_sessionstableusingaPythonscript.

c. ThenewtablecontainstheZIPcodeinsteadofanIPaddress,plustheothertwofieldsfromtheoriginaltable.

6. Examinetheipgeolocator.pyscriptandobservethefollowing:


38

a. RecordsarereadfromHiveonstandardinput.

b. Thescriptsplitsthemintoindividualfieldsusingatabdelimiter.

c. Theip_addrfieldisconvertedtozipcode,butthecookieandsteps_completedfieldsarepassedthroughunmodified.

d. Thethreefieldsineachoutputrecordaredelimitedwithtabsareprintedtostandardoutput.

7. CopythePythonfiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:

$ hdfs dfs -put $ADIR/exercises/transform/ipgeolocator.py \

/dualcore/

8. Runthescripttocreatethecart_zipcodestable.YoucaneitherpastethecodeintotheHiveQueryEditor,oruseBeelineinaterminalwindow:

$ beeline -u jdbc:hive2://localhost:10000 \

-f $ADIR/exercises/transform/create_cart_zipcodes.hql

Step #3: Extract List of Products Added to Each Cart

Asdescribedearlier,estimatingtheshippingcostalsorequiresalistofitemsinthecustomer’scart.YoucanidentifyproductsaddedtothecartsincetherequestURLlookslikethis(onlytheproductIDchangesfromonerecordtothenext):/cart/additem?productid=1234567

9. WriteaHiveQLstatementtocreateatablecalledcart_itemswithtwofields:cookieandprod_idbasedondataselectedtheweb_logstable.Keepthefollowinginmindwhenwritingyourstatement:

a. Theprod_idfieldshouldcontainonlytheseven-digitproductID(hint:usetheREGEXP_EXTRACTfunction)


39

b. UseaWHEREclausewithREGEXPusingthesameregularexpressionasabove,sothatyouonlyincluderecordswherecustomersareaddingitemstothecart.

c. Ifyouneedahintonhowtowritethestatement,lookatthecreate_cart_items.hqlfileintheexercise’ssample_solutiondirectory.

10. Verifythecontentsofthenewtablebyrunningthisquery:

SELECT COUNT(DISTINCT cookie) FROM cart_items

WHERE prod_id=1273905;

Ifthisdoesn’treturn47,thencompareyourstatementtothecreate_cart_items.hqlfile,makethenecessarycorrections,andthenre-runyourstatement(afterdroppingthecart_itemstable).

Step #4: Create Tables to Join Web Logs with Product Data

YounowhavetablesrepresentingtheZIPcodesandproductsassociatedwithcheckoutsessions,butyou'llneedtojointhesewiththeproductstabletogettheweightoftheseitemsbeforeyoucanestimateshippingcosts.Inordertodosomemoreanalysislater,we’llalsoincludetotalsellingpriceandtotalwholesalecostinadditiontothetotalshippingweightforallitemsinthecart.

11. RunthefollowingHiveQLtocreateatablecalledcart_orderswiththeinformation:


40

CREATE TABLE cart_orders AS

SELECT z.cookie, steps_completed, zipcode,

SUM(shipping_wt) AS total_weight,

SUM(price) AS total_price,

SUM(cost) AS total_cost

FROM cart_zipcodes z

JOIN cart_items i

ON (z.cookie = i.cookie)

JOIN products p

ON (i.prod_id = p.prod_id)

GROUP BY z.cookie, zipcode, steps_completed;

Step #5: Create a Table Using a UDF to Estimate Shipping Cost

Wefinallyhavealltheinformationweneedtoestimatetheshippingcostforeachabandonedorder.Oneofthedevelopersonourteamhasalreadywritten,compiled,andpackagedaHiveUDFthatwillcalculatetheshippingcostgivenaZIPcodeandthetotalweightofallitemsintheorder.

12. BeforeyoucanuseaUDF,youmustmakeitavailabletoHive.First,copythefiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:


41

$ hdfs dfs -put \

$ADIR/exercises/transform/geolocation_udf.jar \

/dualcore/

13. Next,registerthefunctionwithHiveandprovidethenameoftheUDFclassaswellasthealiasyouwanttouseforthefunction.RuntheHivecommandbelowtoassociatetheUDFwiththealiasCALC_SHIPPING_COST:

CREATE TEMPORARY FUNCTION CALC_SHIPPING_COST

AS'com.cloudera.hive.udf.UDFCalcShippingCost'

USING JAR 'hdfs:/dualcore/geolocation_udf.jar';

14. Nowcreateanewtablecalledcart_shippingthatwillcontainthesessionID,numberofstepscompleted,totalretailprice,totalwholesalecost,andtheestimatedshippingcostforeachorderbasedondatafromthecart_orderstable:

CREATE TABLE cart_shipping AS

SELECT cookie, steps_completed, total_price, total_cost,

CALC_SHIPPING_COST(zipcode, total_weight) AS shipping_cost

FROM cart_orders;

15. Finally,verifyyourtablebyrunningthefollowingquerytocheckarecord:

SELECT * FROM cart_shipping WHERE cookie='100002920697';

Thisshouldshowthatsessionashavingtwocompletedsteps,atotalretailpriceof$263.77,atotalwholesalecostof$236.98,andashippingcostof$9.09.

Note:Thetotal_price,total_cost,andshipping_costcolumnsinthecart_shippingtablecontainthenumberofcentsasintegers.Besuretodivideresultscontainingmonetaryamountsby100togetdollarsandcents.


42

YoumaynowshutdowntheDataAnalystVMandlaunchtheSparkVM.Runthefollowingscript:

$ ~/scripts/sparkdev/training_setup_sparkdev.sh



43

Hands-On Exercise: View the Spark Documentation

Inthisexercise,youwillfamiliarizeyourselfwiththeSparkdocumentation.

You must now shut down the Data Analyst VM and launch the Spark VM, if

you have not already done so. IMPORTANT:Inordertoprepareforthisexercise,youmustrunthefollowingcommandbeforecontinuing:

$ ~/scripts/sparkdev/training_setup_sparkdev.sh

1. StartFirefoxinyourVirtualMachineandvisittheSparkdocumentationonyourlocalmachine,usingtheprovidedbookmarkoropeningtheURLfile:/usr/lib/spark/docs/_site/index.html

2. FromtheProgrammingGuidesmenu,selecttheSparkProgrammingGuide.Brieflyreviewtheguide.Youmaywishtobookmarkthepageforlaterreview.

3. FromtheAPIDocsmenu,selecteitherScalaorPython,dependingonyourlanguagepreference.BookmarktheAPIpageforuseduringclass.Laterexerciseswillreferyoutothisdocumentation.



44

Hands-On Exercise: Use the Spark Shell

Inthisexercise,youwillstarttheSparkShellandviewtheSparkContextobject.

YoumaychoosetodothisexerciseusingeitherScalaorPython.FollowtheinstructionsbelowforPython,orskiptothenextsectionforScala.

MostofthelaterexercisesassumeyouareusingPython,butScalasolutionsareprovidedonyourvirtualmachine,soyoushouldfeelfreetouseScalaifyouprefer.

Using the Python Spark Shell

1. Inaterminalwindow,startthepysparkshell:

$ pyspark

YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseetheIn[n]>promptafterafewseconds,hitReturnafewtimestoclearthescreenoutput.

Note: Your environment is set up to use IPython shell by default. If you would prefer to

use the regular Python shell, set IPYTHON=0 before starting pyspark.

4. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:

pyspark> sc

Pysparkwilldisplayinformationaboutthescobjectsuchas:

<pyspark.context.SparkContext at 0x2724490>

5. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.

6. YoucanexittheshellbyhittingCtrl-Dorbytypingexit.


45

Using the Scala Spark Shell

7. Inaterminalwindow,starttheScalaSparkShell:

$ spark-shell

YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseethescala>promptafterafewseconds,hitEnterafewtimestoclearthescreenoutput.

8. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:

scala> sc

Scalawilldisplayinformationaboutthescobjectsuchas:

res0: org.apache.spark.SparkContext =

org.apache.spark.SparkContext@2f0301fa

9. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.

10. YoucanexittheshellbyhittingCtrl-Dortypingexit.



46

Hands-On Exercise: Use RDDs to Transform a Dataset

Files and Data Used in This Exercise:

Data files (local):

~/training_materials/sparkdev/data/frostroad.txt

~/training_materials/sparkdev/data/weblogs/2013-09-15.log

Solutions:

~/training_materials/sparkdev/solutions/LogIPs.pyspark

~/training_materials/sparkdev/solutions/LogIPs.scalaspark

InthisexerciseyouwillpracticeusingRDDsintheSparkShell.

Youwillstartbyreadingasimpletextfile.ThenyouwilluseSparktoexploreandtransformtheApachewebserveroutputlogsofthecustomerservicesiteofafictionalmobilephoneserviceprovidercalledLoudacre.

Loading and Viewing a Text File

1. Reviewthesimpletextfilewewillbeusingbyviewing(withoutediting)thefileinatexteditor.Thefileislocatedat:~/training_materials/sparkdev/data/frostroad.txt.

2. StarttheSparkShellifyouexiteditfromthepreviousexercise.YoumayuseeitherScala(spark-shell)orPython(pyspark).

3. DefineanRDDtobecreatedbyreadinginasimpletestfile.ForPython,enter:

pyspark> mydata = sc.textFile(\

"file:/home/training/training_materials/sparkdev/\

data/frostroad.txt")

OrforScala,enter:


47

scala> val mydata = sc.textFile(

"file:/home/training/training_materials/sparkdev/data/frostro

ad.txt")

• Note:FortheremainderoftheHands-OnExercises,notethecolorcodingandpromptinexercisetextsnippetstofollowtheinstructionsforwhicheverlanguageyouareusing.)

4. NotethatSparkhasnotyetreadthefile.ItwillnotdosountilyouperformanoperationontheRDD.Trycountingthenumberoflinesinthedataset:

pyspark> mydata.count()

scala> mydata.count()

ThecountoperationcausestheRDDtobematerialized(createdandpopulated),afterwhichtheresult(23)isdisplayed.TheexamplebelowshowstheoutputforPyspark(Scalaproducesthesameresult,buttheoutputformatwilldifferslightly):

Out[2]: 23

5. TryexecutingthecollectoperationtodisplaythedataintheRDD.Notethatthisreturnsanddisplaystheentiredataset.ThisisconvenientforverysmallRDDslikethisone,butbecarefulusingcollectforlargedatasets,whicharecommonwhenusingSpark.

pyspark> mydata.collect()

scala> mydata.collect()

6. Usingcommandcompletion,youcanseealltheavailabletransformationsandoperationsyoucanperformonanRDD.Typemydata.andthenthe[TAB]key.


48

Exploring the Loudacre Web Log Files

Inthisexercise,youwillbeusingdatain~/training_materials/sparkdev/data/weblogs.Initiallyyouwillworkwiththelogfilefromasingleday.Lateryouwillworkwiththefulldatasetconsistingofmanydaysworthoflogs.

7. Reviewoneofthe.logfilesinthedirectory.Notetheformatofthelines:

8. Inthepreviousexampleyouusedalocaldatafile.Intherealworld,youwillalmostalwaysbeworkingwithdataontheHDFSclusterinstead.CreateanHDFSdirectoryforthecourse,andthencopythedatasettoyourHDFShomedirectory.Inaseparateterminalwindow(notyourSparkshell)execute:

$ hdfs dfs -mkdir /loudacre

$ hdfs dfs -put \

~/training_materials/sparkdev/data/weblogs/ \

/loudacre/

9. InyourSparkShell,setavariableforthedatafilesoyoudonothavetoretypeiteachtime.

pyspark> logfile="/loudacre/weblogs/2013-09-15.log"

scala> val logfile="/loudacre/weblogs/2013-09-15.log"

10. CreateanRDDfromthedatafile.

116.180.70.237 - 128 [15/Sep/2013:23:59:53 +0100]

"GET /KBDOC-00031.html HTTP/1.0" 200 1388

"http://www.loudacre.com" "Loudacre CSR Browser"

IP Address User ID Request


49

pyspark> logs = sc.textFile(logfile)

scala> val logs = sc.textFile(logfile)

11. CreateanRDDcontainingonlythoselinesthatarerequestsforJPGfiles.

pyspark> jpglogs=\

logs.filter(lambda x: ".jpg" in x)

scala> val jpglogs = logs.

filter(line => line.contains(".jpg"))

12. Viewthefirst10linesofthedatausingtake:

pyspark> jpglogs.take(10)

scala> jpglogs.take(10)

13. Sometimesyoudonotneedtostoreintermediatedatainavariable,inwhichcaseyoucancombinethestepsintoasinglelineofcode.Forinstance,ifallyouneedistocountthenumberofJPGrequests,youcanexecutethisinasinglecommand:

pyspark> sc.textFile(logfile).filter(lambda x: \

".jpg" in x).count()

scala> sc.textFile(logfile).filter(line =>

line.contains(".jpg")).count()

14. NowtryusingthemapfunctiontodefineanewRDD.Startwithaverysimplemapthatreturnsthelengthofeachlineinthelogfile.


50

pyspark> logs.map(lambda s: len(s)).take(5)

scala> logs.map(line => line.length).take(5)

Thisprintsoutanarrayoffiveintegerscorrespondingtothefirstfivelinesinthefile.

15. That’snotveryuseful.Instead,trymappingtoanarrayofwordsforeachline:

pyspark> logs.map(lambda s: s.split()).take(5)

scala> logs.map(line => line.split(' ')).take(5)

Thistimeitprintsoutfivearrays,eachcontainingthewordsinthecorrespondinglogfileline.

16. Nowthatyouknowhowmapworks,defineanewRDDcontainingjusttheIPaddressesfromeachlineinthelogfile.(TheIPaddressisthefirstfieldineachline).

pyspark> ips = logs.map(lambda s: s.split()[0])

pyspark> ips.take(5)

scala> val ips = logs.map(line => line.split(' ')(0))

scala> ips.take(5)

17. AlthoughtakeandcollectareusefulwaystolookatdatainanRDD,theiroutputissometimesnotveryreadable.Fortunately,though,theyreturnarrays,whichyoucaniteratethrough:

pyspark> for x in ips.take(5): print x

scala> ips.take(5).foreach(println)


51

18. Finally,savethelistofIPaddressesasatextfile:

pyspark> ips.saveAsTextFile("/loudacre/iplist")

scala> ips.saveAsTextFile("/loudacre/iplist")

19. Inaterminalwindow,listthecontentsoftheHDFSiplistdirectory(inyourHDFShomedirectory):

$ hdfs dfs -ls /loudacre/iplist

20. Youshouldseemultiplefiles.Theoneyoucareaboutispart-00000,whichshouldcontainthelistofIPaddresses.“Part”(partition)filesarenumberedbecausetheremayberesultsfrommultipletasksrunningonthecluster;youwilllearnmoreaboutthislater.

If You Have More Time

Ifyouhavemoretime,attemptthefollowingchallenges:

21. Challenge1:Asyoudidinthepreviousstep,savealistofIPaddresses,butthistime,usethewholeweblogdataset(weblogs/*)insteadofasingleday’slog.

• Tip:Youcanusetheup-arrowtoeditandexecutepreviouscommands.Youshouldonlyneedtomodifythelinesthatreadandsavethefiles.

22. Challenge2:UseRDDtransformationstocreateadatasetconsistingoftheIPaddressandcorrespondinguserIDforeachrequestforanHTMLfile.(Disregardrequestsforotherfiletypes).TheuserIDisthethirdfieldineachlogfileline.

Displaythedataintheformipaddress/userid,suchas:


52

165.32.101.206/8

100.219.90.44/102

182.4.148.56/173

246.241.6.175/45395

175.223.172.207/4115

…



53

Hands-On Exercise: Process Data Files with Spark


Data files (local):

~/training_materials/sparkdev/data/activations/*

~/training_materials/sparkdev/data/devicestatus.txt (Bonus)

Stubs: stubs/ActivationModels.pyspark

stubs/ActivationModels.scalaspark

Solutions: solutions/ActivationModels.pyspark

solutions/ActivationModels.scalaspark

solutions/DeviceStatusETL.pyspark (Bonus)

solutions/DeviceStatusETL.scalaspark (Bonus)

InthisexerciseyouwillparseasetofactivationrecordsinXMLformattoextracttheaccountnumbersandmodelnames.

OneofthecommonusesforSparkisdoingdataExtract/Transform/Loadoperations.Sometimesdataisstoredinline-orientedrecords,liketheweblogsinthepreviousexercise,butsometimesthedataisinamulti-lineformatthatmustbeprocessedasawholefile.Inthisexerciseyouwillpracticeworkingwithfile-basedinsteadofline-basedformats.

Reviewing the API Documentation for RDD Operations

VisittheSparkAPIpageyoubookmarkedpreviously.FollowthelinkatthetopfortheRDDclassandreviewthelistofavailablemethods.

The Data

1. Reviewthedatainactivations(inthecoursedatadirectory).EachXMLfilecontainsdataforallthedevicesactivatedbycustomersduringaspecificmonth.


54

Sampleinputdata:

<activations>

<activation timestamp="1225499258" type="phone">

<account-number>316</account-number>

<device-id>

d61b6971-33e1-42f0-bb15-aa2ae3cd8680

</device-id>

<phone-number>5108307062</phone-number>

<model>iFruit 1</model>

</activation>

…

</activations>

2. CopythisdatatoHDFS:

$ hdfs dfs -put \

~/training_materials/sparkdev/data/activations \

/loudacre/

The Task

YourcodeshouldgothroughasetofactivationXMLfilesandextracttheaccountnumberanddevicemodelforeachactivation,andsavethelisttoafileasaccount_number:model.

Theoutputwilllooksomethinglike:


55

1234:iFruit 1

987:Sorrento F00L

4566:iFruit 1

…

3. StartwiththeActivationModelsstubscript.(AstubisprovidedforScalaandPython;usewhicheverlanguageyouprefer.)NotethatforconvenienceyouhavebeenprovidedwithfunctionstoparsetheXML,asthatisnotthefocusofthisexercise.CopythestubcodeintotheSparkShell.

4. UsewholeTextFilestocreateanRDDfromtheactivationsdataset.TheresultingRDDwillconsistoftuples,inwhichthefirstvalueisthenameofthefile,andthesecondvalueisthecontentsofthefile(XML)asastring.

5. EachXMLfilecancontainmanyactivationrecords;useflatMaptomapthecontentsofeachfiletoacollectionofXMLrecordsbycallingtheprovidedgetactivationsfunction.getactivationstakesanXMLstring,parsesit,andreturnsacollectionofXMLrecords;flatMapmapseachrecordtoaseparateRDDelement.

6. Mapeachactivationrecordtoastringintheformataccount-number:model.Usetheprovidedgetaccountandgetmodelfunctionstofindthevaluesfromtheactivationrecord.

7. Savetheformattedstringstoatextfileinthedirectory/loudacre/account-models.

Bonus Exercise

AnothercommonpartoftheETLprocessisdatascrubbing.Inthisbonusexercise,youwillprocessdatainordertogetitintoastandardizedformatforlaterprocessing.

Reviewthecontentsofthedatafiledevicestatus.txt.ThisfilecontainsdatacollectedfrommobiledevicesonLoudacre’snetwork,includingdeviceID,currentstatus,locationandsoon.BecauseLoudacrepreviouslyacquiredothermobileprovider’snetworks,thedatafromdifferentsubnetworkshasadifferentformat.Notethattherecordsinthisfile


56

havedifferentfielddelimiters:someusecommas,someusepipes(|)andsoon.Yourtasksareto:

• Loadthedataset

• Determinewhichdelimitertouse(hint:thecharacteratposition19isthefirstuseofthedelimiter)

• Filteroutanyrecordswhichdonotparsecorrectly(hint:eachrecordshouldhaveexactly14values)

• Extractthedate(firstfield),model(secondfield),deviceID(thirdfield),andlatitudeandlongitude(13thand14thfieldsrespectively)

• Thesecondfieldcontainsthedevicemanufacturerandmodelname(e.g.Ronin S2.)Splitthisfieldbyspacestoseparatethemanufacturerfromthemodel(e.g.,manufacturerRonin,modelS2.)

• Savetheextracteddatatocommadelimitedtextfilesinthe/loudacre/devicestatus_etldirectoryonHDFS.

• Confirmthatthedatainthefile(s)wassavedcorrectly.



57

Hands-On Exercise: Use Pair RDDs to Join Two Datasets

Files Used in This Exercise:

Data files (HDFS): /loudacre/weblogs/*

Data files (local):

~/training_materials/sparkdev/data/accounts.csv

Solution: solutions/UserRequests.pyspark

solutions/UserRequests.scalaspark

InthisexerciseyouwillcontinueexploringtheLoudacrewebserverlogfiles,aswell

astheLoudacreuseraccountdata,usingkey-valuePairRDDs.

Exploring Web Log Files

Continueworkingwiththeweblogfiles,asinthepreviousexercise.

Tip:Inthisexerciseyouwillbereducingandjoininglargedatasets,whichcantakealotoftime.Youmaywishtoperformtheexercisesbelowusingasmallerdataset,consistingofonlyafewoftheweblogfiles,ratherthanallofthem.Rememberthatyoucanspecifyawildcard;textFile("/loudacre/weblogs/*6.log")wouldincludeonlyfilenamesendingwiththedigit6andhavingalogfileextension.

1. Usingmapandreduce,countthenumberofrequestsfromeachuser.

a. UsemaptocreateaPairRDDwiththeuserIDasthekey,andtheinteger1asthevalue.(TheuserIDisthethirdfieldineachline.)Yourdatawilllooksomethinglikethis:


58

b. UsereducetosumthevaluesforeachuserID.YourRDDdatawillbesimilarto:

2. UsecountByKeytodeterminehowmanyusersvisitedthesiteforeachfrequency.Thatis,howmanyusersvisitedonce,twice,threetimesandsoon.

a. Usemaptoreversethekeyandvalue,likethis:

b. UsethecountByKeyactiontoreturnaMapoffrequency:user-countpairs.

3. CreateanRDDwheretheuseridisthekey,andthevalueisthelistofalltheIPaddressesthatuserhasconnectedfrom.(IPaddressisthefirstfieldineachrequestline.)

• Hint:Mapto(userid,ipaddress)andthenusegroupByKey.

(userid,1) (userid,1) (userid,1) …

(userid,5) (userid,7) (userid,2) …

(5,userid) (7,userid) (2,userid) …

(userid,20.1.34.55) (userid,245.33.1.1) (userid,65.50.196.141) …


59

Joining Web Log Data with Account Data

4. Copythefileaccounts.csvdatafiletoHDFS:

$ hdfs dfs -put \

~/training_materials/sparkdev/data/accounts.csv \

/loudacre/

ThisdatasetconsistsofinformationaboutLoudacre’suseraccounts.ThefirstfieldineachlineistheuserID,whichcorrespondstotheuserIDinthewebserverlogs.Theotherfieldsincludeaccountdetailssuchascreationdate,firstandlastnameandsoon.

5. JointheaccountsdatawiththeweblogdatatoproduceadatasetkeyedbyuserIDwhichcontainstheuseraccountinformationandthenumberofwebsitehitsforthatuser.

a. Maptheaccountsdatatokey/value-listpairs:(userid,[values…])

b. JointhePairRDDwiththesetofuserid/hitcountscalculatedinthefirststep.

(userid,[20.1.34.55, 74.125.239.98]) (userid,[75.175.32.10, 245.33.1.1, 66.79.233.99]) (userid,[65.50.196.141]) …

(userid1,[userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…]) (userid2,[userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…]) (userid3,[userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…]) …


60

c. DisplaytheuserID,hitcount,andfirstname(3rdvalue)andlastname(4thvalue)forthefirst5elements,e.g.:

userid1 4 Cheryl West

userid2 8 Elizabeth Kerns

userid3 1 Melissa Roman

Bonus Exercises

Ifyouhavemoretime,attemptthefollowingchallenges:

6. Challenge1:UsekeyBytocreateanRDDofaccountdatawiththepostalcode(9thfieldintheCSVfile)asthekey.

• Tip:AssignthisnewRDDtoavariableforuseinthenextchallenge

7. Challenge2:CreateapairRDDwithpostalcodeasthekeyandalistofnames(LastName,FirstName)inthatpostalcodeasthevalue.

• Hint:Firstnameandlastnamearethe4thand5thfieldsrespectively

• Optional:TryusingthemapValuesoperation

8. Challenge3:Sortthedatabypostalcode,thenforthefirstfivepostalcodes,displaythecodeandlistthenamesinthatpostalzone,suchas:

(userid1,([userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…],4)) (userid2,([userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…],8)) (userid3,([userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…],1)) …


61

--- 85003

Jenkins,Thad

Rick,Edward

Lindsay,Ivy

…

--- 85004

Morris,Eric

Reiser,Hazel

Gregg,Alicia

Preston,Elizabeth

…



62

Hands-On Exercise: Write and Run a Spark Application

Files and Directories Used in This Exercise:


Scala Project Directory: projects/countjpgs

Scala Classes: stubs.CountJPGs

solution.CountJPGs

Python Stub: stubs/CountJPGs.py

Python Solution: solutions/CountJPGs.py

Inthisexercise,youwillwriteyourownSparkapplicationinsteadofusingthe

interactiveSparkShellapplication.

WriteasimpleprogramthatcountsthenumberofJPGrequestsinaweblogfile.Thenameofthefileshouldbepassedintotheprogramasanargument.

Thisisthesametaskyoudidearlierinthe“GettingStartedWithRDDs”exercise.Thelogicisthesame,butthistimeyouwillneedtosetuptheSparkContextobjectyourself.

Dependingonwhichprogramminglanguageyouareusing,followtheappropriatesetofinstructionsbelowtowriteaSparkprogram.

Beforerunningyourprogram,besuretoexitfromtheSparkShell.


63

Writing a Spark Application in Python

You may use any text editor you wish. If you don’t have an editor preference, you may

wish to use gedit, which includes language-specific support for Python.

1. Asimplestubfiletogetstartedhasbeenprovided:~/training_materials/sparkdev/stubs/countjpgs.py.ThisstubimportstherequiredSparkclassandsetsupyourmaincodeblock.Copythisstubtoyourworkareaandeditittocompletethisexercise.

2. SetupaSparkContextusingthefollowingcode:

sc = SparkContext()

3. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.

4. Attheendoftheprogram,besuretocall:

sc.stop()

5. Runtheprogramlocally,passingthenameofthelogfiletoprocess,suchas:

$ spark-submit CountJPGs.py /loudacre/weblogs/*

6. SkiptheScalaexercisesbelow,andskiptothesection“StarttheSparkStandaloneCluster.”

Writing a Spark Application in Scala

You may use any text editor you wish. If you don’t have an editor preference, you may

wish to use gedit, which includes language-specific support for Scala. If you are familiar

with the Idea IntelliJ IDE, you may choose to use that; the provided project directories

include IntelliJ configuration.


64

AMavenprojecttogetstartedhasbeenprovidedintheprojects/countjpgsdirectory.

7. EdittheScalacodeinsrc/main/scala/stubs/CountJPGs.scala.

8. SetupaSparkContextusingthefollowingcode:

val sc = new SparkContext()

9. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.

10. Attheendoftheprogram,besuretocall:

sc.stop

11. Fromthecountjpgsworkingdirectory,buildyourprojectusingthefollowingcommand:

$ mvn package

12. Ifthebuildissuccessful,itwillgenerateaJARfilecalledcountjpgs-1.0.jarincountjpgs/target.Runtheprogramusingthefollowingcommand:

$ spark-submit \

--class stubs.CountJPGs \

target/countjpgs-1.0.jar /loudacre/weblogs/*

• Note:Use--class solution.CountJPGstorunthesolutioninstead.

Starting the Spark Standalone Cluster

13. Inaterminalwindow,starttheSparkMasterandSparkWorkerdaemons:


65

$ sudo service spark-master start

$ sudo service spark-worker start

Note:Youcanstoptheservicesbyreplacingstartwithstop,orforcetheservicetorestartbyusingrestart.YoumayneedtodothisifyoususpendandrestarttheVM.

14. ViewtheSparkStandaloneClusterUI:StartFirefoxonyourVMandvisittheSparkMasterUIbyusingtheprovidedbookmarkorvisitinghttp://localhost:18080/.

15. YoushouldnotseeanyapplicationsintheRunningApplicationsorCompletedApplicationsareasbecauseyouhavenotrunanyapplicationsontheclusteryet.

16. Areal-worldSparkclusterwouldhaveseveralworkersconfigured.Inthisclasswehavejustone,runninglocally,whichisnamedbythedateitstarted,thehostitisrunningon,andtheportitislisteningon.Forexample:

17. ClickontheworkerIDlinktoviewtheSparkWorkerUIandnotethattherearenoexecutorscurrentlyrunningonthenode.

18. Intheprevioussection,youranyourapplicationlocally,becauseyoudidnotspecifyamasterwhenstartingit.Re-runtheprogram,specifyingtheclustermasterinordertorunitonthecluster.

ForPython:

$ spark-submit --master spark://localhost:7077 \

CountJPGs.py /loudacre/weblogs/*

ForScala:


66

$ spark-submit \


--master spark://localhost:7077 \


19. VisittheStandaloneSparkMasterUIandconfirmthattheprogramisrunningonthecluster.



67

Hands-On Exercise: Configure a Spark Application


Data files (HDFS): /loudacre/weblogs

Properties files (local): spark.conf

log4j.properties

Inthisexercise,youwillpracticesettingvariousSparkconfigurationoptions.

YouwillworkwiththeCountJPGsprogramyouwroteinthepriorexercise.

Setting Configuration Options at the Command Line

1. Re-runtheCountJPGsPythonorScalaprogramyouwroteinthepreviousexercise,thistimespecifyinganapplicationname.Forexample:

$ spark-submit --master spark://localhost:7077 \

--name 'Count JPGs' \


$ spark-submit \


--master spark://localhost:7077 \

--name 'Count JPGs' \


2. VisittheStandaloneSparkMasterUI(http://localhost:18080/)andnotetheapplicationnamelistedistheonespecifiedinthecommandline.


68

3. Optional:Whiletheapplicationisrunning,visittheSparkApplicationUIandviewtheEnvironmenttab.Takenoteofthespark.*propertiessuchasmaster,appName,anddriverproperties.

Setting Configuration Options in a Configuration File

4. Changedirectoriestoyourworkingdirectory.(IfyouareworkinginScala,thatisthecountjpgsprojectdirectory.)

5. Usingatexteditor,createafileintheworkingdirectorycalledmyspark.conf,containingsettingsforthepropertiesshownbelow:

spark.app.name My Spark App

spark.master yarn-client

spark.executor.memory 400M

6. Re-runyourapplication,thistimeusingthepropertiesfileinsteadofusingthescriptoptionstoconfigureSparkproperties:

$ spark-submit --properties-file myspark.conf \


$ spark-submit --properties-file myspark.conf \



7. Whiletheapplicationisrunning,viewtheStandaloneSparkMasterUItoconfirmapplicationnameiscorrectlydisplayedas“MySparkApp,”suchas:


69

Setting Logging Levels

8. Copythetemplatefile/etc/spark/conf/log4j.properties.templatetolog4j.propertiesinyourworkingdirectory.

9. Editlog4j.properties.Thefirstlinecurrentlyreads:

log4j.rootCategory=INFO, console

ReplaceINFOwithDEBUG:

log4j.rootCategory=DEBUG, console

10. Re-runyourSparkapplication.BecausethecurrentdirectoryisontheJavaclasspath,yourlog4.propertiesfilewillsettheloggingleveltoDEBUG.

11. NoticethattheoutputnowcontainsboththeINFOmessagesitdidbeforeandDEBUGmessages,similartowhatisshownbelow:

15/03/19 11:40:45 INFO MemoryStore: ensureFreeSpace(154293) called with

curMem=0, maxMem=311387750

15/03/19 11:40:45 INFO MemoryStore: Block broadcast_0 stored as values to

memory (estimated size 150.7 KB, free 296.8 MB)

15/03/19 11:40:45 DEBUG BlockManager: Put block broadcast_0 locally took

79 ms

15/03/19 11:40:45 DEBUG BlockManager: Put for block broadcast_0 without

replication took 79 ms

Debugloggingcanbeusefulwhendebugging,testing,oroptimizingyourcode,butinmostcasesgeneratesunnecessarilydistractingoutput.

12. Editthelog4j.propertiesfiletoreplaceDEBUGwithWARNandtryagain.ThistimenoticethatnoINFOorDEBUGmessagesaredisplayed;onlyWARNmessages.

13. YoucanalsosettheloglevelfortheSparkShellbyplacingthelog4j.propertiesfileinyourworkingdirectorybeforestartingtheshell.TrystartingtheshellfromthedirectoryinwhichyouplacedthefileandnotethatonlyWARNmessagesnowappear.


70

Note:Duringtherestoftheexercises,youmaychangethesesettingsdependingonwhetheryoufindtheextraloggingmessageshelpfulordistracting.



71

Hands-On Exercise: View Jobs and Stages in the Spark Application UI



/loudacre/accounts.csv

Solutions: solutions/SparkStages.pyspark

solutions/SparkStages.scalaspark

InthisexerciseyouwillusetheSparkApplicationUItoviewtheexecutionstagesfor

ajob.

Inapreviousexercise,youwroteascriptintheSparkShelltojoindatafromtheaccountsdatasetwiththeweblogsdataset,inordertodeterminethetotalnumberofwebhitsforeveryaccount.Nowyouwillexplorethestagesandtasksinvolvedinthatjob.

Exploring Partitioning of File-Based RDDs

1. Start(orrestart,ifnecessary)theSparkShell.AlthoughyouwouldtypicallyrunaSparkapplicationonacluster,yourcourseVMclusterhasonlyasingleworkernodethatcansupportonlyasingleexecutor.Tosimulateamorerealisticmulti-nodecluster,runinlocalmodewithtwothreads.ForPython:

$ pyspark --master local[2]

orforScala:

$ spark-shell --master local[2]


72

2. CreateanRDDbasedontheaccountsdatafile(/loudacre/accounts.csv)andthencalltoDebugStringontheRDD,whichdisplaysthenumberofpartitionsinparentheses()beforetheRDDID.HowmanypartitionsareintheresultingRDD?

pyspark> accounts=sc.textFile("/loudacre/accounts.csv")

pyspark> print accounts.toDebugString()

scala> val accounts=sc.

textFile("/loudacre/accounts.csv")

scala> accounts.toDebugString

3. Repeatthisprocess,butspecifyaminimumofthreeofpartitions:textFile(filename,3).DoestheRDDcorrectlyhavethreepartitions?

4. CreateanotherRDDbasedonalltheweblogsdatasetfiles(/loudacre/weblogs/*)andthencalltoDebugStringontheRDD.HowmanypartitionsareintheweblogsRDD?

pyspark> weblogs=sc.textFile("/loudacre/weblogs/*")

pyspark> print weblogs.toDebugString()

scala> val weblogs=sc.textFile("/loudacre/weblogs/*")

scala> weblogs.toDebugString

HowdoesthenumberoffilesinthedatasetcomparetothenumberofpartitionsintheRDD?

5. Repeatthisprocess,butspecifyonlyasubsetofthefiles:thoseforthemonthofOctoberin2013,/loudacre/weblogs/2013-10-*.log.

6. Bonus:useforeachPartitiontoprintoutthefirstrecordofeachpartition.


73

Setting up the Job

7. First,createanRDDofaccounts,keyedbyIDandwithfirstname,lastnameforthevalue.

pyspark> accountsByID = accounts \

.map(lambda s: s.split(',')) \

.map(lambda values: \

(values[0],values[4] + ',' + values[3]))

scala> val accountsByID = accounts.

map(line => line.split(',')).

map(values => (values(0),values(4)+','+values(3)))

8. ConstructanRDDwiththetotalnumberofwebhitsforeachuserID:

pyspark> userreqs = weblogs \

.map(lambda line: line.split()) \

.map(lambda words: (words[2],1)) \

.reduceByKey(lambda v1,v2: v1+v2)

scala> val userreqs = weblogs.

map(line => line.split(' ')).

map(words => (words(2),1)).

reduceByKey((v1,v2) => v1 + v2)

9. ThenjointhetwoRDDsbyuserID,andconstructanewRDDbasedonfirstname,lastnameandtotalhits:

pyspark> accounthits = accountsByID.join(userreqs)\

.values()


74

scala> val accounthits =

accountsByID.join(userreqs).values

10. Printtheresultsofaccounthits.toDebugStringandreviewtheoutput.Basedonthis,seeifyoucandetermine:

a. Howmanystagesareinthisjob?

b. Whichstagesaredependentonwhich?

c. Howmanytaskswilleachstageconsistof?

Runing and Reviewing the Job in the Spark Application UI

11. Inyourbrowser,visitingtheSparkApplicationUIbyusingtheprovidedtoolbarbookmark,orvisitinghttp://localhost:4040/inyourbrowser.

12. IntheSparkUI,makesuretheJobstabisselected.Nojobsareyetrunningsothelistwillbeempty.

13. Returntotheshellandstartthejobbyexecutinganaction(saveAsTextFile):

pyspark> accounthits.\

saveAsTextFile("/loudacre/userreqs")

scala> accounthits.

saveAsTextFile("/loudacre/userreqs")

14. ReloadtheSparkUIJobspageinyourbrowser.YourjobwillappearintheActiveJobslistuntilitcompletes,andthenitwilldisplayintheCompletedJobsList.


75

15. Clickonthejobdescription(whichisthelastactioninthejob)toseethestages.Asthejobprogressesyoumaywanttorefreshthepageafewtimes.

Thingstonote:

a. Howmanystagesareinthejob?DoesitmatchthenumberyouexpectedfromtheRDD’stoDebugStringoutput?

b. Thestagesarenumbered,butnumbersdonotrelatetotheorderofexecution.Notethetimesthestagesweresubmittedtodeterminetheorder.DoestheordermatchwhatyouexpectedbasedonRDDdependency?

c. Howmanytasksareineachstage?Thenumberoftasksinthefirststagescorrespondstothenumberofpartitions.

d. TheShuffleReadandShuffleWritecolumnsindicatehowmuchdatawascopiedbetweentasks.Thisisusefultoknowbecausecopyingtoomuchdataacrossthenetworkcancauseperformanceissues.

16. Clickonthestagestoviewdetailsaboutthatstage.

Thingstonote:

a. TheSummaryMetricsareashowsyouhowmuchtimewasspendonvarioussteps.Thiscanhelpyounarrowdownperformanceproblems.

b. TheTasksarealistseachtask.TheLocalityLevelcolumnindicateswhethertheprocessranonthesamenodewherethepartitionwasphysicallystoredornot.RememberthatSparkwillattempttoalwaysruntaskswherethedatais,butmaynotalwaysbeableto,ifthenodeisbusy.

c. Inareal-worldcluster,theexecutorcolumnintheTaskareawoulddisplaythedifferentworkernodesthatranthetasks.Inthissingle-nodecluster,alltasksrunonthesamehost:localhost.

17. Whenthejobiscomplete,returntotheJobstabtoseethefinalstatisticsforthenumberoftasksexecutedandthetimethejobtook.

18. Optional:Tryre-runningthelastaction.YouwillneedtoeitherdeletethesaveAsTextFileoutputdirectoryinHDFS,orspecifyadifferentdirectoryname.You


76

willprobablyfindthatthejobcompletesmuchfaster,andthatseveralstages(andthetasksinthem)showas“skipped.”

Bonusquestion:Whichtaskswereskippedandwhy?

LeavetheSparkShellrunningforthenextexercise.



77

Hands-On Exercise: Persist an RDD Files and Data Used in This Exercise:


/loudacre/accounts.csv

Job Setup: solutions/SparkStages.pyspark

solutions/SparkStages.scalaspark

Inthisexerciseyouwillexploretheperformanceeffectofcaching(thatis,persisting

tomemory)anRDD.

1. MakesuretheSparkShellisstillrunningfromthelastexercise.Ifitisn’t,restartit(inlocalmodewithtwothreads)andpasteinthejobsetupcodefromthesolutionfileorthepreviousexercise.

2. Thistimetostartthejobyouaregoingtoperformaslightlydifferentactionthanlasttime:countthenumberofuseraccountswithatotalhitcountgreaterthanfive:

pyspark> accounthits\

.filter(lambda (firstlast,hitcount): hitcount > 5)\

.count()

scala> accounthits.filter(pair => pair._2 > 5).count()

3. Cache(persisttomemory)theRDDbycallingaccounthits.persist().

4. Inyourbrowser,viewtheSparkApplicationUIandselecttheStoragetab.Atthispoint,youhavemarkedyourRDDtobepersisted,buthavenotyetperformedanactionthatwouldcauseittobematerializedandpersisted,soyouwillnotyetseeanypersistedRDDs.

5. IntheSparkShell,executethecountagain.


78

6. ViewtheRDD’stoDebugString.Noticethattheoutputindicatesthepersistencelevelselected.

7. ReloadtheStoragetabinyourbrowser,andthistimenotethattheRDDyoupersistedisshown.ClickontheRDDIDtoseedetailsaboutpartitionsandpersistence.

8. ClickontheExecutorstabandtakenoteoftheamountofmemoryusedandavailableforyouroneworkernode.

Notethattheclassroomenvironmenthasasingleworkernodewithasmallamountofmemoryallocated,soyoumayseethatnotallofthedatasetisactuallycachedinmemory.Intherealworld,forgoodperformanceaclusterwillhavemorenodes,eachwithmorememory,sothatmoreofyouractivedatacanbecached.

9. Optional:SettheRDD’spersistenceleveltoStorageLevel.DISK_ONLYandcomparethestoragereportintheSparkApplicationWebUI.Hint:BecauseyouhavealreadypersistedtheRDDatadifferentlevel,youwillneedtounpersist()firstbeforeyoucansetanewlevel.



79

Hands-On Exercise: Implement an Iterative Algorithm


Data files (HDFS): /loudacre/devicestatus_etl/

Stubs: stubs/KMeansCoords.pyspark

stubs/KMeansCoords.scalaspark

Solutions: solutions/KMeansCoords.pyspark

solutions/KMeansCoords.scalaspark

Inthisexercise,youwillpracticeimplementingiterativealgorithmsinSparkby

calculatingk-meansforasetofpoints.

Reviewing the Data

Inthebonussectionofthe“UseRDDstoTransformaDataset”exercise,youusedSparktoextractthedate,maker,deviceID,latitudeandlongitudefromthedevicestatus.txtdatafile,andstoretheresultsintheHDFSdirectory/loudacre/devicestatus_etl.

Ifyoudidnothavetimetocompletethatbonusexercise,runthesolutionscriptnowfollowingthetwostepsbelow.

• Copy~/training_materials/sparkdev/data/devicestatus.txttothe/loudacre/directoryinHDFS.

• RuntheSparkscript~/training_materials/solutions/DeviceStatusETL(either.pysparkor.scalaparkdependingonwhichlanguageyouareusing)


80

Examinethedatainthedataset.Notethatthelatitudeandlongitudearethe4thand5thfields,respectively,suchas:

2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-

28f2342679af,33.6894754264,-117.543308253

2014-03-15:10:10:20,MeeToo,ef8c7564-0a1a-4650-a655-

c8bbd5f8f943,37.4321088904,-121.485029632

Calculate k-means for Device Location

Ifyouarealreadyfamiliarwithcalculatingk-means,trydoingtheexerciseonyourown.Otherwise,followthestep-by-stepprocessbelow.

1. StartbycopyingtheprovidedKMeansCoordsstubfile,whichcontainsthefollowingconveniencefunctionsusedincalculatingk-means:

• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,returnstheindexinthearrayofthecenterclosesttothegivenpoint

• addPoints:giventwopoints,returnapointwhichisthesumofthetwopoints–thatis,(x1+x2,y1+y2)

• distanceSquared:giventwopoints,returnsthesquareddistanceofthetwo.Thisisacommoncalculationrequiredingraphanalysis.

2. SetthevariableK(thenumberofmeanstocalculate).ForthisuseK=5.

3. SetthevariableconvergeDist.Thiswillbeusedtodecidewhenthek-meanscalculationisdone–whentheamountthelocationsofthemeanschangesbetweeniterationsislessthanconvergeDist.A“perfect”solutionwouldbe0;thisnumberrepresentsa“goodenough”solution.Forthisexercise,useavalueof0.1.

4. Parsetheinputfile,whichisdelimitedbyacommacharacterwithin(latitude,longitude)pairs(the4thand5thfieldsineachline).Onlyincludeknownlocations(thatis,filterout(0,0)locations).Besuretopersist(cache)theresultingRDDbecauseyouwillaccessiteachtimethroughtheiteration.


81

5. Createak-lengtharraycalledkPointsbytakingarandomsampleofklocationpointsfromtheRDDasstartingmeans(centerpoints);forexample:

data.takeSample(False, K, 42)

6. IterativelycalculateanewsetofKmeansuntilthetotaldistancebetweenthemeanscalculatedforthisiterationandthelastissmallerthanconvergeDist.Foreachiteration:

a. Foreachcoordinatepoint,usetheprovidedclosestPointfunctiontomapeachpointtotheindexinthekPointsarrayofthelocationclosesttothatpoint.TheresultingRDDshouldbekeyedbytheindex,andthevalueshouldbethepair:(point, 1).Thevalue“1”willlaterbeusedtocountthenumberofpointsclosesttoagivenmean;forexample:

(1, ((37.43210, -121.48502), 1))

(4, ((33.11310, -111.33201), 1))

(0, ((39.36351, -119.40003), 1))

(1, ((40.00019, -116.44829), 1))

…

b. Reducetheresult:foreachcenterinthekPointsarray,sumthelatitudesandlongitudes,respectively,ofallthepointsclosesttothatcenter,andthenumberofclosestpoints.Forexample:

(0, ((2638919.87653,-8895032.182481), 74693)))

(1, ((3654635.24961,-12197518.55688), 101268))

(2, ((1863384.99784,-5839621.052003), 48620))

(3, ((4887181.82600,-14674125.94873), 126114))

(4, ((2866039.85637,-9608816.13682), 81162))

c. ThereducedRDDshouldhave(atmost)kmembers.Mapeachtoanewcenterpointbycalculatingtheaveragelatitudeandlongitudeforeachsetofclosest


82

points:thatis,map(index,(totalX,totalY),n)to(index,(totalX/n, totalY/n))

d. Collectthesenewpointsintoalocalmaporarraykeyedbyindex.

e. UsetheprovideddistanceSquaredmethodtocalculatehowmucheachcenter“moved”betweenthecurrentiterationandthelast.Thatis,foreachcenterinkPoints,calculatethedistancebetweenthatpointandthecorrespondingnewpoint,andsumthosedistances.Thatisthedeltabetweeniterations;whenthedeltaislessthanconvergeDist,stopiterating.

f. CopythenewcenterpointstothekPointsarrayinpreparationforthenextiteration.

7. Whentheiterationiscomplete,displaythefinalkcenterpoints.



83

Hands-On Exercise: Use Broadcast Variables



Data files (local):

~/training_materials/sparkdev/data/targetmodels.txt

Stubs: stubs/TargetModels.pyspark

stubs/TargetModels.scalaspark

Solutions: solutions/TargetModels.pyspark

solutions/TargetModels.scalaspark

Inthisexercise,youwillfilterwebrequeststoincludeonlythosefromdevices

includedinalistoftargetmodels.

Loudacrewantstodosomeanalysisonwebtrafficproducedfromspecificdevices.Thelistoftargetmodelsisin~training_materials/sparkdev/data/targetmodels.txt

Filterthewebserverlogstoincludeonlythoserequestsfromdevicesinthelist.Themodelnameofthedevicewillbeinthelineinthelogfile.Useabroadcastvariabletopassthelistoftargetdevicestotheworkersthatwillrunthefiltertasks.

Hint:Usethestubfileforthisexercisein~/training_materials/sparkdev/stubsforthecodetoloadinthelistoftargetmodels.



84

Hands-On Exercise: Use Accumulators Files Used in This Exercise:


Solutions: solutions/RequestAccumulator.pyspark

solutions/RequestAccumulator.scalaspark

Inthisexercise,youwillcountthenumberofdifferenttypesoffilesrequestedinasetofwebserverlogs.

Usingaccumulators,countthenumberofeachtypeoffile(HTML,CSSandJPG)requestedinthewebserverlogfiles.

Hint:usethefileextensionstringtodeterminethetypeofrequest,suchas.html,.css,or.jpg.



85

Hands-On Exercise: Use Spark SQL for ETL

Files and Data Used in this Exercise

MySQL table: loudacre.webpage

Output HDFS directory: /loudacre/webpage_files

Solutions: solutions/SparkSQL-webpage-files.pyspark

solutions/SparkSQL-webpage-files.scalaspark

Inthisexercise,youwilluseSparkSQLtoloaddatafromMySQL,processit,andstoreittoHDFS.

Reviewing the Data in MySQL

ReviewthedatacurrentlyintheMySQLloudacre.mysqltable.

1. Listthecolumnsandtypesinthetable:

$ mysql -utraining -ptraining loudacre \

-e"DESCRIBE webpage"

2. Viewthefirstfewrowsfromthetable:

$ mysql -utraining -ptraining loudacre \

-e"SELECT * FROM webpage LIMIT 5"

Notethatthedataintheassociated_filescolumnisacomma-delimitedstring.LoudacrewouldliketomakethisdataavailableinanImpalatable,butinordertoperformtherequiredanalysis,theassociated_filesdatamustbeextractedandnormalized.YourgoalinthenextsectionistouseSparkSQLtoextractthedatainthecolumn,splitthestring,andcreateanewdatasetinHDFScontainingeachwebpagenumber,anditsassociatedfilesinseparaterows.


86

Loading the Data from MySQL

3. Ifnecessary,starttheSparkShell.

4. ImporttheSQLContextclassdefinition,anddefineaSQLcontext:

pyspark> from pyspark.sql import SQLContext

pyspark> sqlCtx = SQLContext(sc)

scala> import org.apache.spark.sql.SQLContext

scala> val sqlCtx = new SQLContext(sc)

5. CreateanewDataFramebasedonthewebpagetablefromthedatabase:

pyspark> webpages=sqlCtx.load(source="jdbc", \

url="jdbc:mysql://localhost/loudacre?user=training&password=t

raining", \

dbtable="webpage")

scala> val webpages=sqlCtx.load("jdbc",

Map("url"->

"jdbc:mysql://localhost/loudacre?user=training&password=train

ing",

"dbtable" -> "webpage"))

6. ExaminetheschemaofthenewDataFramebycallingwebpages.printSchema().

7. CreateanewDataFramebyselectingtheweb_page_numandassociated_filescolumnsfromtheexistingDataFrame:

python> assocfiles = \

webpages.select(webpages.web_page_num,\

webpages.associated_files)


87

scala> val assocfiles =

webpages.select(webpages("web_page_num"),webpages("associated

_files"))

8. InordertomanipulatethedatausingSpark,converttheDataFrameintoatoaPairRDDusingthemapmethod.TheinputintothemapmethodisaRowobject.Theykeyistheweb_page_numvalue(thefirstvalueintherow),andthevalueistheassociated_filesstring(thesecondvalueintherow).

InPython,youcandynamicallyreferencethecolumnvalueoftherowbyname:

pyspark> afilesrdd = assocfiles.map(lambda row: \

(row.web_page_num,row.associated_files))

InScala,usethecorrectgetmethodforthetypeofvaluewiththecolumnindex:

scala> val afilesrdd = assocfiles.map(row =>

(row.getInt(0),row.getString(1)))

9. NowthatyouhaveanRDD,youcanusethefamiliarflatMapValuestransformationtosplitandextractthefilenamesintheassociated_filescolumn:

pyspark> afilesrdd2 = afilesrdd\

.flatMapValues(lambda filestring:filestring.split(','))

scala> val afilesrdd2 =

afilesrdd.flatMapValues(filestring =>

filestring.split(','))

10. CreateanewDataFramefromtheRDD:

pyspark> afiledf = sqlCtx.createDataFrame(afilesrdd2)


88

scala> val afiledf = sqlCtx.createDataFrame(afilesrdd2)

11. CallprintSchemaonthenewDataFrame.NotethatSparkSQLgavethecolumnsgenericnames:_1and_2.

12. CreateanewDataFramebyrenamingthecolumnstoreflectthedatatheyhold.

InPython,usethewithColumnRenamedmethodtorenamethetwocolumns:

pyspark> finaldf = afiledf. \

withColumnRenamed('_1','web_page_num'). \

withColumnRenamed('_2','associated_file')

InScala,youcanusethetoDFshortcutmethodtocreateanewDataFramebasedonanexistingonewiththecolumnsrenamed:

scala> val finaldf = afiledf.

toDF("web_page_num","associated_file")

13. CallprintSchematoconfirmthatthenewDataFramehasthecorrectcolumnnames.

14. YourfinalDataFramecontainstheprocesseddata,socallfinaldf.collect()toconfirmthedataiscorrect.

15. Optional:SavethefinalDataFrameinParquetformat(thedefault)in/loudacre/webpage_files.ThecodeisthesameinScalaandPython.

> finaldf.save("/loudacre/webpage_files")

ConfirmthatthedatawassavedinHDFSiscorrect.NotethatthedatawillbeinParquetfileformat,whichisabinaryformat.Thismeansthatwhenyouviewthefile,onlysomeofthecontentwillbeinreadablestringform.Thisisexpectedbehavior.



89

Appendix A: Enabling iPython Notebook iPythonNotebookisinstalledontheVMforthiscourse.Touseitinsteadofthecommand-lineversionofiPython,followthesesteps:

1. Openthefollowingfileforediting:/home/training/.bashrc

2. Uncommentoutthefollowingline(removetheleading#).

# export PYSPARK_DRIVER_PYTHON_OPTS='notebook ……..jax'

3. Savethefile.

4. Openanewterminalwindow.(Mustbeanewterminalsoitloadsyouredited.bashrcfile).

5. Enterpysparkintheterminal.Thiswillcauseabrowserwindowtoopen,andyoushouldseethefollowingwebpage:

6. OntherighthandsideofthepageselectPython2fromtheNewmenu


90

7. EntersomeSparkcodesuchasthefollowingandusetheplaybuttontoexecuteyourSparkcode.

8. Noticetheoutputdisplayed.

This is the end of the Appendix


91

Data Model Reference

Notethatnotalloftheinformationbelowappliestothiscustomizedversionofthecourse.

Tables Imported from MySQL

ThefollowingdepictsthestructureoftheMySQLtablesimportedintoHDFSusingSqoop.Theprimarykeycolumnfromthedatabase,ifany,isdenotedbyboldtext:

customers:201,375records(importedto/dualcore/customers)

Index Field Description Example0 cust_id CustomerID 1846532 1 fname Firstname Sam 2 lname Lastname Jones 3 address Addressofresidence 456 Clue Road 4 city City Silicon Sands 5 state State CA 6 zipcode Postalcode 94306

employees:61,712records(importedto/dualcore/employeesandlaterusedasanexternaltableinHive)

Index Field Description Example0 emp_id EmployeeID BR5331404 1 fname Firstname Betty 2 lname Lastname Richardson 3 address Addressofresidence 123 Shady Lane 4 city City Anytown 5 state State CA 6 zipcode PostalCode 90210 7 job_title Employee’sjobtitle Vice President 8 email e-mailaddress [email protected] 9 active Isactivelyemployed? Y 10 salary Annualpay(indollars) 136900

orders:1,662,951records(importedto/dualcore/orders)

Index Field Description Example0 order_id OrderID 3213254


92

1 cust_id CustomerID 1846532 2 order_date Date/timeoforder 2013-05-31 16:59:34

order_details:3,333,244records(importedto/dualcore/order_details)

Index Field Description Example0 order_id OrderID 3213254 1 prod_id ProductID 1754836

products:1,114records(importedto/dualcore/products)

Index Field Description Example0 prod_id ProductID 1273641 1 brand Brandname Foocorp 2 name Nameofproduct 4-port USB Hub 3 price Retailsalesprice,incents 1999 4 cost Wholesalecost,incents 1463 5 shipping_wt Shippingweight(inpounds) 1

suppliers:66records(importedto/dualcore/suppliers)

Index Field Description Example0 supp_id SupplierID 1000 1 fname Firstname ACME Inc. 2 lname Lastname Sally Jones 3 address Addressofoffice 123 Oak Street 4 city City New Athens 5 state State IL

6 zipcode Postalcode 62264 7 phone Officephonenumber (618) 555-5914


93

Hive/Impala Tables

Thefollowingisarecordcountfortablesthatarecreatedorqueriedduringthehands-onexercises.UsetheDESCRIBE tablenamecommandtoseethetablestructure.

TableName RecordCount

ads 788,952

cart_items 33,812

cart_orders 12,955

cart_shipping 12,955

cart_zipcodes 12,955

checkout_sessions 12,955

customers 201,375

employees 61,712

loyalty_program 311

order_details 3,333,244

orders 1,662,951

products 1,114

ratings 21,997

web_logs 412,860


94

Other Data Added to HDFS

ThefollowingdescribesthestructureofotherimportantdatasetsaddedtoHDFS.

CombinedAdCampaignData:(788,952recordstotal),storedintwodirectories:

• /dualcore/ad_data1(438,389records)

• /dualcore/ad_data2(350,563records).

Index Field Description Example0 campaign_id Uniquelyidentifiesourad A3 1 date Dateofaddisplay 05/23/2013 2 time Timeofaddisplay 15:39:26 3 keyword Keywordthattriggeredad tablet 4 display_site Domainwhereadshown news.example.com 5 placement LocationofadonWebpage INLINE 6 was_clicked Whetheradwasclicked 1 7 cpc Costperclick,incents 106

access.log:412,860records(uploadedto/dualcore/access.log)Thisfileisusedtopopulatetheweb_logstableinHive.NotethattheRFC931andUsernamefieldsareseldompopulatedinlogfilesformodernpublicWebsitesandareignoredinourRegexSerDe.

Index Field/Description Example0 IPaddress 192.168.1.151 RFC931(Ident) -2 Username -3 Date/Time [22/May/2013:15:01:46 -0800]4 Request "GET /foo?bar=1 HTTP/1.1"5 Statuscode 2006 Bytestransferred 7627 Referer "http://dualcore.com/"8 Useragent(browser) "Mozilla/4.0 [en] (WinNT; I)"9 Cookie(sessionID) "SESSION=8763723145"


95

Regular Expression Reference Thefollowingisabrieftutorialintendedfortheconvenienceofstudentswhodon’thaveexperienceusingregularexpressionsormayneedarefresher.AmorecompletereferencecanbefoundinthedocumentationforJava’sPatternclass:

http://tiny.cloudera.com/regexpattern

Introduction to Regular Expressions

Regularexpressionsareusedforpatternmatching.Therearetwokindsofpatternsinregularexpressions:literalsandmetacharacters.Literalvaluesareusedtomatchprecisepatternswhilemetacharactershavespecialmeaning;forexample,adotwillmatchanysinglecharacter.Here'sthecompletelistofmetacharacters,followedbyexplanationsofthosethatarecommonlyused:

< ( [ { \ ^ - = $ ! | ] } ) ? * + . >

Literalcharactersareanycharactersnotlistedasametacharacter.They'rematchedexactly,butifyouwanttomatchametacharacter,youmustescapeitwithabackslash.Sinceabackslashisitselfametacharacter,itmustalsobeescapedwithabackslash.Forexample,youwouldusethepattern\\.tomatchaliteraldot.

Regularexpressionssupportpatternsmuchmoreflexiblethansimplyusingadottomatchanycharacter.Thefollowingexplainshowtousecharacterclassestorestrictwhichcharactersarematched.

Character Classes [057] Matchesanysingledigitthatiseither0,5,or7[0-9] Matchesanysingledigitbetween0and9[3-6] Matchesanysingledigitbetween3and6[a-z] Matchesanysinglelowercaseletter[C-F] MatchesanysingleuppercaseletterbetweenCandF

Forexample,thepattern[C-F][3-6]wouldmatchthestringD3orF5butwouldfailtomatchG3orC7.

Therearealsosomebuilt-incharacterclassesthatareshortcutsforcommonsetsofcharacters.


96

Predefined Character Classes \\d Matchesanysingledigit\\w Matchesanywordcharacter(lettersofanycase,plusdigitsorunderscore)\\s Matchesanywhitespacecharacter(space,tab,newline,etc.)

Forexample,thepattern\\d\\d\\d\\wwouldmatchthestring314dor934Xbutwouldfailtomatch93XorZ871.

Sometimesit'seasiertochoosewhatyoudon'twanttomatchinsteadofwhatyoudowanttomatch.Thesethreecanbenegatedbyusinganuppercaseletterinstead.

Negated Predefined Character Classes \\D Matchesanysinglenon-digitcharacter\\W Matchesanynon-wordcharacter\\S Matchesanynon-whitespacecharacter

Forexample,thepattern\\D\\D\\WwouldmatchthestringZX#or@ Pbutwouldfailtomatch93Xor36_.

Themetacharactersshownabovematcheachexactlyonecharacter.Youcanspecifythemmultipletimestomatchmorethanonecharacter,butregularexpressionssupporttheuseofquantifierstoeliminatethisrepetition.

Matching Quantifiers {5} Precedingcharactermayoccurexactlyfivetimes{0,6} Precedingcharactermayoccurbetweenzeroandsixtimes? Precedingcharacterisoptional(mayoccurzerooronetimes)+ Precedingcharactermayoccuroneormoretimes* Precedingcharactermayoccurzeroormoretimes

Bydefault,quantifierstrytomatchasmanycharactersaspossible.Ifyouusedthepatternore.+aonthestringDualcore has a store in Florida,youmightbesurprisedtolearnthatitmatchesore has a store in Floridaratherthanore haorore in Floridaasyoumighthaveexpected.Thisisbecausematchesa"greedy"bydefault.Addingaquestionmarkmakesthequantifiermatchasfewcharactersaspossibleinstead,sothepatternore.+?aonthisstringwouldmatchore ha.


97

Finally,therearetwospecialmetacharactersthatmatchzerocharacters.Theyareusedtoensurethatastringmatchesapatternonlywhenitoccursatthebeginningorendofastring.

Boundary Matching Metacharacters ^ Matchesonlyatthebeginningofastring $ Matchesonlyattheendingofastring

NOTE:Whenusedinsidesquarebrackets(whichdenoteacharacterclass),the^characterisinterpreteddifferently.Inthatcontext,itnegatesthematch.Therefore,specifyingthepattern[^0-9]isequivalenttousingthepredefinedcharacterclass\\ddescribedearlier.

Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to the directory by selecting ... Return to the test directory by clicking View file

Documents