Top Banner
218

Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Jul 06, 2018

Download

Documents

phungdien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 2: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 3: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

LearningApacheMahoutClassification

Page 4: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TableofContents

LearningApacheMahoutClassification

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Downloadingthecolorimagesofthisbook

Errata

Piracy

Questions

1.ClassificationinDataAnalysis

Introducingtheclassification

Applicationoftheclassificationsystem

Workingoftheclassificationsystem

Classificationalgorithms

Modelevaluationtechniques

Theconfusionmatrix

TheReceiverOperatingCharacteristics(ROC)graph

AreaundertheROCcurve

Page 5: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Theentropymatrix

Summary

2.ApacheMahout

IntroducingApacheMahout

AlgorithmssupportedinMahout

ReasonsforMahoutbeingagoodchoiceforclassification

InstallingMahout

BuildingMahoutfromsourceusingMaven

InstallingMaven

BuildingMahoutcode

SettingupadevelopmentenvironmentusingEclipse

SettingupMahoutforaWindowsuser

Summary

3.LearningLogisticRegression/SGDUsingMahout

Introducingregression

Understandinglinearregression

Costfunction

Gradientdescent

Logisticregression

StochasticGradientDescent

UsingMahoutforlogisticregression

Summary

4.LearningtheNaïveBayesClassificationUsingMahout

IntroducingconditionalprobabilityandtheBayesrule

UnderstandingtheNaïveBayesalgorithm

Understandingthetermsusedintextclassification

UsingtheNaïveBayesalgorithminApacheMahout

Summary

5.LearningtheHiddenMarkovModelUsingMahout

Deterministicandnondeterministicpatterns

TheMarkovprocess

Page 6: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingtheHiddenMarkovModel

UsingMahoutfortheHiddenMarkovModel

Summary

6.LearningRandomForestUsingMahout

Decisiontree

Randomforest

UsingMahoutforRandomforest

StepstousetheRandomforestalgorithminMahout

Summary

7.LearningMultilayerPerceptronUsingMahout

Neuralnetworkandneurons

MultilayerPerceptron

MLPimplementationinMahout

UsingMahoutforMLP

StepstousetheMLPalgorithminMahout

Summary

8.MahoutChangesintheUpcomingRelease

Mahoutnewchanges

MahoutScalaandSparkbindings

ApacheSpark

UsingMahout’sSparkshell

H2Oplatformintegration

Summary

9.BuildinganE-mailClassificationSystemUsingApacheMahout

Spame-maildataset

CreatingthemodelusingtheAssassindataset

Programtouseaclassifiermodel

Testingtheprogram

Secondusecaseasanexercise

TheASFe-maildataset

Classifierstuning

Page 7: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Summary

Index

Page 8: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 9: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

LearningApacheMahoutClassification

Page 10: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 11: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

LearningApacheMahoutClassificationCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1210215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78355-495-9

www.packtpub.com

Page 12: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 13: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

CreditsAuthor

AshishGupta

Reviewers

SivaPrakash

TharinduRusira

VishnuViswanath

CommissioningEditor

AkramHussain

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

MerwynD’souza

TechnicalEditors

MonicaJohn

NovinaKewalramani

ShrutiRawool

CopyEditors

SarangChari

GladsonMonteiro

AartiSaldanha

RashmiSawant

ProjectCoordinator

NehaBhatnagar

Proofreaders

SimranBhogal

SteveMaguire

Indexer

MonicaAjmeraMehta

Graphics

SheetalAute

Page 14: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

AbhinashSahu

ProductionCoordinator

ConidonMiranda

CoverWork

ConidonMiranda

Page 15: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 16: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

AbouttheAuthorAshishGuptahasbeenworkinginthefieldofsoftwaredevelopmentforthelast8years.Hehasworkedindifferentcompanies,suchasSAPLabsandCaterpillar,asasoftwaredeveloper.Whileworkingforastart-upwherehewasresponsibleforpredictingpotentialcustomersfornewfashionapparelsusingsocialmedia,hedevelopedaninterestinthefieldofmachinelearning.Sincethen,hehasworkedonusingbigdatatechnologiesandmachinelearningfordifferentindustries,includingretail,finance,insurance,andsoon.Hehasapassionforlearningnewtechnologiesandsharingtheknowledgethusgainedwithothers.HehasorganizedmanybootcampsfortheApacheMahoutandHadoopecosystem.

Firstofall,Iwouldliketothankopensourcecommunitiesfortheircontinuouseffortsindevelopinggreatsoftwareforall.IwouldliketothankMerwynD’SouzaandReshmaRaman,myeditorsforthisproject.Specialthankstothereviewersofthisbook.

Nothingcanbeaccomplishedwithoutthesupportoffamily,friends,andlovedones.Iwouldliketothankmyfriends,family,andespeciallymywifeandmysonfortheircontinuoussupportthroughoutthewritingofthisbook.

Page 17: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 18: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

AbouttheReviewersSivaPrakashisworkingasatechleadinBangalore.Hehasextensivedevelopmentexperienceintheanalysis,design,development,implementation,andmaintenanceofvariousdesktop,mobile,andweb-basedapplications.Helovestrekking,traveling,music,readingbooks,andblogging.

YoucanfindhimonLinkedInathttps://www.linkedin.com/in/techsivam.

TharinduRusiraiscurrentlyacomputerscienceandengineeringundergraduateattheUniversityofMoratuwa,SriLanka.Asastudentresearcher,hehasstronginterestsinmachinelearning,compilers,andhigh-performancecomputing.

TharinduhasalsoworkedasaresearchanddevelopmentsoftwareengineeringinternatZaiziAsia(Pvt)Ltd.,wherehefirststartedusingApacheMahoutduringtheimplementationofanenterprise-levelcontentmanagementandinformationretrievalsystem.

HeseesthepotentialofApacheMahoutasascalablemachinelearninglibraryforindustry-levelimplementationsandhasevencontributedtotheMahout0.9release,thelateststablereleaseofMahout.

HeisavailableonLinkedInathttps://www.linkedin.com/in/trusira.

VishnuViswanathisaseniorbigdatadeveloperwhohasmanyyearsofindustrialexpertiseinthearenaofmachinelearning.Heisatechenthusiastandispassionateaboutbigdataandhasexpertiseonmostbig-data-relatedtechnologies.

YoucanfindhimonLinkedInathttp://in.linkedin.com/in/vishnuviswanath25.

Page 19: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 20: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

www.PacktPub.com

Page 21: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 22: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 23: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 24: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 25: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

PrefaceThankstotheprogressmadeinthehardwareindustries,ourstoragecapacityhasincreased,andbecauseofthis,therearemanyorganizationswhowanttostorealltypesofeventsforanalyticspurposes.Thishasgivenbirthtoaneweraofmachinelearning.Thefieldofmachinelearningisverycomplexandwritingthesealgorithmsisnotapieceofcake.ApacheMahoutprovidesuswithreadymadealgorithmsintheareaofmachinelearningandsavesusfromthecomplextaskofalgorithmimplementation.

TheintentionofthisbookistocoverclassificationalgorithmsavailableinApacheMahout.Whetheryouhavealreadyworkedonclassificationalgorithmsusingsomeothertoolorarecompletelynewtothefield,thisbookwillhelpyou.So,startreadingthisbooktoexploretheclassificationalgorithmsinoneofthemostpopularopensourceprojectswhichenjoysstrongcommunitysupport:ApacheMahout.

Page 26: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WhatthisbookcoversChapter1,ClassificationinDataAnalysis,providesanintroductiontotheclassificationconceptindataanalysis.Thischapterwillcoverthebasicsofclassification,similaritymatrix,andalgorithmsavailableinthisarea.

Chapter2,ApacheMahout,providesanintroductiontoApacheMahoutanditsinstallationprocess.Further,thischapterwilltalkaboutwhyitisagoodchoiceforclassification.

Chapter3,LearningLogisticRegression/SGDUsingMahout,discusseslogisticregressionandStochasticGradientDescent,andhowdeveloperscanuseMahouttouseSGD.

Chapter4,LearningtheNaïveBayesClassificationUsingMahout,discussestheBayesTheorem,NaïveBayesclassification,andhowwecanuseMahouttobuildNaïveBayesclassifier.

Chapter5,LearningtheHiddenMarkovModelUsingMahout,coverstheHMMandhowtouseMahout’sHMMalgorithms.

Chapter6,LearningRandomForestUsingMahout,discussestheRandomforestalgorithmindetail,andhowtouseMahout’sRandomforestimplementation.

Chapter7,LearningMultilayerPerceptronUsingMahout,discussesMahoutasanearlylevelimplementationofaneuralnetwork.WewilldiscussMultilayerPerceptroninthischapter.Further,wewilluseMahout’simplementationofMLP.

Chapter8,MahoutChangesintheUpcomingRelease,discussesMahoutasaworkinprogress.WewilldiscussthenewmajorchangesintheupcomingreleaseofMahout.

Chapter9,BuildinganE-mailClassificationSystemUsingApacheMahout,providestwousecasesofe-mailclassification—spammailclassificationande-mailclassificationbasedontheprojectthemailbelongsto.Wewillcreatethemodel,andusethismodelinaprogramthatwillsimulatetherealworkingenvironment.

Page 27: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 28: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WhatyouneedforthisbookTousetheexamplesinthisbook,youshouldhavethefollowingsoftwareinstalledonyoursystem:

Java1.6orhigherEclipseHadoopMahout;wewilldiscusstheinstallationinChapter2,ApacheMahout,ofthisbookMaven,dependingonhowyouinstallMahout

Page 29: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 30: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WhothisbookisforIfyouareadatascientistwhohassomeexperiencewiththeHadoopecosystemandmachinelearningmethodsandwanttotryoutclassificationonlargedatasetsusingMahout,thisbookisidealforyou.KnowledgeofJavaisessential.

Page 31: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 32: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.”

Ablockofcodeissetasfollows:

publicstaticMap<String,Integer>readDictionary(Configurationconf,

PathdictionaryPath){

Map<String,Integer>dictionary=newHashMap<String,Integer>();

for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,

IntWritable>(dictionaryPath,true,conf)){

dictionary.put(pair.getFirst().toString(),

pair.getSecond().get());

}

returndictionary;

}

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

publicstaticMap<String,Integer>readDictionary(Configurationconf,

PathdictionaryPath){

Map<String,Integer>dictionary=newHashMap<String,Integer>();

for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,

IntWritable>(dictionaryPath,true,conf)){

dictionary.put(pair.getFirst().toString(),

pair.getSecond().get());

}

returndictionary;

}

Anycommand-lineinputoroutputiswrittenasfollows:

hadoopfs-mkdir/user/hue/KDDTrain

hadoopfs-mkdir/user/hue/KDDTest

hadoopfs–put/tmp/KDDTrain+_20Percent.arff/user/hue/KDDTrain

hadoopfs–put/tmp/KDDTest+.arff/user/hue/KDDTest

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 33: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 34: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 35: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

Page 36: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 37: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 38: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 39: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

DownloadingthecolorimagesofthisbookWealsoprovideyouwithaPDFfilethathascolorimagesofthescreenshots/diagramsusedinthisbook.Thecolorimageswillhelpyoubetterunderstandthechangesintheoutput.Youcandownloadthisfilefromhttp://www.packtpub.com/sites/default/files/downloads/4959OS_ColoredImages.pdf.

Page 40: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

Page 41: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

Page 42: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.

Page 43: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 44: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter1.ClassificationinDataAnalysisInthelastdecade,wesawahugegrowthinsocialnetworkingande-commercesites.IamsurethatyoumusthavegotinformationaboutthisbookonFacebook,Twitter,orsomeothersite.Chancesarealsohighthatyouarereadingane-copyofthisbookafterorderingitonyourphoneortablet.

ThismustgiveyouanideaofhowmuchdatawearegeneratingovertheInterneteverysingleday.Now,inordertoobtainallnecessaryinformationfromthedata,wenotonlycreatedatabutalsostorethisdata.Thisdataisextremelyusefultogetsomeimportantinsightsintothebusiness.Theanalysisofthisdatacanincreasethecustomerbaseandcreateprofitsfortheorganization.Taketheexampleofane-commercesite.Youvisitthesitetobuysomebook.Yougetinformationaboutbooksonrelatedtopicsorthesametopic,publisher,orwriter,andthishelpsyoutotakebetterdecisions,whichalsohelpsthesitetoknowmoreaboutitscustomers.Thiswilleventuallyleadtoanincreaseinsales.

Findingrelateditemsorsuggestinganewitemtotheuserisallpartofthedatascienceinwhichweanalyzethedataandtrytogetusefulpatterns.

Dataanalysisistheprocessofinspectinghistoricaldataandcreatingmodelstogetusefulinformationthatisrequiredtohelpindecisionmaking.Itishelpfulinmanyindustries,suchase-commerce,banking,finance,healthcare,telecommunications,retail,oceanography,andmanymore.

Let’staketheexampleofaweatherforecastingsystem.Itisasystemthatcanpredictthestateoftheatmosphereataparticularlocation.Inthisprocess,scientistscollecthistoricaldataoftheatmosphereofthatlocationandtrytocreateamodelbasedonittopredicthowtheatmospherewillevolveoveraperiodoftime.

Inmachinelearning,classificationistheautomationofthedecision-makingprocessthatlearnsfromexamplesofthepastandemulatesthosedecisionsautomatically.Emulatingthedecisionsautomaticallyisacoreconceptinpredictiveanalytics.Inthischapter,wewilllookatthefollowingpoints:

UnderstandingclassificationWorkingofclassificationsystemsClassificationalgorithmsModelevaluationmethods

Page 45: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingtheclassificationThewordclassificationalwaysremindsusofourbiologyclass,wherewelearnedabouttheclassificationofanimals.Welearnedaboutdifferentcategoriesofanimals,suchasmammals,reptiles,birds,amphibians,andsoon.

Ifyourememberhowthesecategoriesaredefined,youwillrealizethattherewerecertainpropertiesthatscientistsfoundinexistinganimals,andbasedontheseproperties,theycategorizedanewanimal.

Otherreal-lifeexamplesofclassificationcouldbe,forinstance,whenyouvisitthedoctor.He/sheasksyoucertainquestions,andbasedonyouranswers,he/sheisabletoidentifywhetheryouhaveacertaindiseaseornot.

Classificationisthecategorizationofpotentialanswers,andinmachinelearning,wewanttoautomatethisprocess.Biologicalclassificationisanexampleofmulticlassclassificationandfindingthediseaseisanexampleofbinaryclassification.

Indataanalysis,wewanttousemachinelearningconcepts.Toanalyzethedata,wewanttobuildasystemthatcanhelpustofindoutwhichclassanindividualitembelongsto.Usually,theseclassesaremutuallyexclusive.Arelatedprobleminthisareaisfindingouttheprobabilitythatanindividualbelongstoacertainclass.

Classificationisasupervisedlearningtechnique.Inthistechnique,machines—basedonhistoricaldata—learnandgainthecapabilitiestopredicttheunknown.Inmachinelearning,anotherpopulartechniqueisunsupervisedlearning.Insupervisedlearning,wealreadyknowtheoutputcategories,butinunsupervisedlearning,weknownothingabouttheoutput.Let’sunderstandthiswithaquickexample:supposewehaveafruitbasket,andwewanttoclassifyfruits.Whenwesayclassify,itmeansthatinthetrainingdata,wealreadyhaveoutputvariables,suchassizeandcolor,andweknowwhetherthecolorisredandthesizeisfrom2.3”to3.7”.Wewillclassifythatfruitasanapple.Oppositetothis,inunsupervisedlearning,wewanttoseparatedifferentfruits,andwedonothaveanyoutputinformationinthetrainingdataset,sothelearningalgorithmwillseparatedifferentfruitsbasedondifferentfeaturespresentinthedataset,butitwillnotbeabletolabelthem.Inotherwords,itwillnotbeabletotellwhichoneisanappleandwhichoneisabanana,althoughitwillbeabletoseparatethem.

Page 46: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ApplicationoftheclassificationsystemClassificationisusedforprediction.Inthecaseofe-mailcategorization,itisusedtoclassifye-mailasspamornotspam.Nowadays,Gmailisclassifyinge-mailsasprimary,social,andpromotionalaswell.Classificationisusefulinpredictingcreditcardfrauds,tocategorizecustomersforeligibilityofloans,andsoon.Itisalsousedtopredictcustomerchurnintheinsuranceandtelecomindustries.Itisusefulinthehealthcareindustryaswell.Basedonhistoricaldata,itisusefulinclassifyingparticularsymptomsofadiseasetopredictthediseaseinadvance.Classificationcanbeusedtoclassifytropicalcyclones.So,itisusefulacrossallindustries.

Page 47: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WorkingoftheclassificationsystemLet’sunderstandtheclassificationprocessinmoredetail.Intheprocessofclassification,withthedatasetgiventous,wetrytofindoutinformativevariablesusingwhichwecanreducetheuncertaintyandcategorizesomething.Theseinformativevariablesarecalledexplanatoryvariablesorfeatures.

Thefinalcategoriesthatweareinterestedarecalledtargetvariablesorlabels.Explanatoryvariablescanbeanyofthefollowingforms:

Continuous(numerictypes)CategoricalWord-likeText-like

NoteIfnumerictypesarenotusefulforanymathematicalfunctions,thosewillbecountedascategorical(zipcodes,streetnumbers,andsoon).

So,forexample,wehaveadatasetofcustomer’s’loanapplications,andwewanttobuildaclassifiertofindoutwhetheranewcustomeriseligibleforaloanornot.Inthisdataset,wecanhavethefollowingfields:

CustomerAgeCustomerIncome(PA)CustomerAccountBalanceLoanGranted

Fromthesefields,CustomerAge,CustomerIncome(PA)andCustomerAccountBalancewillworkasexplanatoryvariablesandLoanGrantedwillbethetargetvariable,asshowninthefollowingscreenshot:

Tounderstandthecreationoftheclassifier,weneedtounderstandafewterms,asshowninthefollowingdiagram:

Page 48: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Trainingdataset:Fromthegivendataset,aportionofthedataisusedtocreatethetrainingdataset(itcouldbe70percentofthegivendata).Thisdatasetisusedtobuildtheclassifier.Allthefeaturesetsareusedinthisdataset.Testdataset:Thedatasetthatisleftafterthetrainingdatasetisusedtotestthecreatedmodel.Withthisdata,onlythefeaturesetisusedandthemodelisusedtopredictthetargetvariablesorlabels.Model:Thisisusedtounderstandthealgorithmusedtogeneratethetargetvariables.

Whilebuildingaclassifier,wefollowthesesteps:

CollectinghistoricaldataCleaningdata(alotofactivitiesareinvolvedhere,suchasspaceremoval,andsoon)DefiningtargetvariablesDefiningexplanatoryvariablesSelectinganalgorithmTrainingthemodel(usingthetrainingdataset)RunningtestdataEvaluatingthemodelAdjustingexplanatoryvariablesRerunningthetest

Whilepreparingthemodel,oneshouldtakecareofoutlierdetection.Outlierdetectionisamethodtofindoutitemsthatdonotconformtoanexpectedpatterninadataset.Outliersinaninputdatasetcanmisleadthetrainingprocessofanalgorithm.Thiscanaffectthemodelaccuracy.Therearealgorithmstofindouttheseoutliersinthedatasets.Distance-

Page 49: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

basedtechniquesandfuzzy-logic-basedmethodsaremostlyusedtofindoutoutliersinthedataset.Let’stalkaboutoneexampletounderstandtheoutliers.

Wehaveasetofnumbers,andwewanttofindoutthemeanofthesenumbers:

10,75,10,15,20,85,25,30,25

Justplotthesenumbersandtheresultwillbeasshowninthefollowingscreenshot:

Clearly,thenumbers75and85areoutliers(farawayintheplotfromtheothernumbers).

Mean=sumofvalues/numberofvalues=32.78

Meanwithouttheoutliers:=19.29

So,nowyoucanunderstandhowoutlierscanaffecttheresults.

Whilecreatingthemodel,wecanencountertwomajorlyoccurringproblems—OverfittingandUnderfitting.

Overfittingoccurswhenthealgorithmcapturesthenoiseofthedata,andthealgorithmfitsthedatatoowell.Generally,itoccursifweuseallthegivendatatobuildthemodelusingpurememorization.Insteadoffindingoutthegeneralizingpattern,themodeljustmemorizesthepattern.Usually,inthecaseofoverfitting,themodelgetsmorecomplex,anditisallowedtopickupspuriouscorrelations.Thesecorrelationsarespecifictotrainingdatasetsanddonotrepresentcharacteristicsofthewholedatasetingeneral.

Thefollowingdiagramisanexampleofoverfitting.Anoutlierispresent,andthealgorithmconsidersthatandcreatesamodelthatperfectlyclassifiesthetrainingset,butbecauseofthis,thetestdataiswronglyclassified(boththerectanglesareclassifiedasstarsinthetestdata):

Page 50: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Thereisnosinglemethodtoavoidoverfitting;however,wehavesomeapproaches,suchasareductioninthenumberoffeaturesandtheregularizationofafewofthefeatures.Anotherwayistotrainthemodelwithsomedatasetandtestwiththeremainingdataset.Acommonmethodcalledcross-validationisusedtogeneratemultipleperformancemeasures.Inthisway,asingledatasetissplitandusedforthecreationofperformancemeasures.

Underfittingoccurswhenthealgorithmcannotcapturethepatternsinthedata,andthedatadoesnotfitwell.Underfittingisalsoknownashighbias.Itmeansyouralgorithmhassuchastrongbiastowardsitshypothesisthatitdoesnotfitthedatawell.Foranunderfittingerror,moredatawillnothelp.Itcanincreasethetrainingerror.Moreexplanatoryvariablescanhelptodealwiththeunderfittingproblem.Moreexplanatoryfieldswillexpandthehypothesisspaceandwillbeusefultoovercomethisproblem.

Bothoverfittingandunderfittingprovidepoorresultswithnewdatasets.

Page 51: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 52: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ClassificationalgorithmsWewillnowdiscussthefollowingalgorithmsthataresupportedbyApacheMahoutinthisbook:

Logisticregression/StochasticGradientDescent(SGD):Weusuallyreadregressionalongwithclassification,butactually,thereisadifferencebetweenthetwo.Classificationinvolvesacategoricaltargetvariable,whileregressioninvolvesanumerictargetvariable.Classificationpredictswhethersomethingwillhappen,andregressionpredictshowmuchofsomethingwillhappen.WewillcoverthisalgorithminChapter3,LearningLogisticRegression/SGDUsingMahout.MahoutsupportslogisticregressiontrainedviaStochasticGradientDescent.NaïveBayesclassification:Thisisaverypopularalgorithmfortextclassification.NaïveBayesusestheconceptofprobabilitytoclassifynewitems.ItisbasedontheBayestheorem.WewilldiscussthisalgorithminChapter4,LearningtheNaïveBayesClassificationUsingMahout.Inthischapter,wewillseehowMahoutisusefulinclassifyingtext,whichisrequiredinthedataanalysisfield.Wewilldiscussvectorization,bagofwords,n-grams,andothertermsusedintextclassification.HiddenMarkovModel(HMM):Thisisusedinvariousfields,suchasspeechrecognition,parts-of-speechtagging,geneprediction,time-seriesanalysis,andsoon.InHMM,weobserveasequenceofemissionsbutdonothaveasequenceofstateswhichamodelusestogeneratetheemission.InChapter5,LearningtheHiddenMarkovModelUsingMahout,wewilltakeonemorealgorithmsupportedbyMahoutHiddenMarkovModel.WewilldiscussHMMindetailandseehowMahoutsupportsthisalgorithm.RandomForest:Thisisthemostwidelyusedalgorithminclassification.RandomForestconsistsofacollectionofsimpletreepredictors,eachcapableofproducingaresponsewhenpresentedwithasetofexplanatoryvariables.InChapter6,LearningRandomForestUsingMahout,wewilldiscussthisalgorithmindetailandalsotalkabouthowtouseMahouttoimplementthisalgorithm.Multi-layerPerceptron(MLP):InChapter7,LearningMultilayerPerceptronUsingMahout,wewilldiscussthisnewlyimplementedalgorithminMahout.AnMLPconsistsofmultiplelayersofnodesinadirectedgraph,witheachlayerfullyconnectedtothenextone.Itisabasefortheimplementationofneuralnetworks.WewilldiscussneuralnetworksalittlebutonlyafteradetaileddiscussiononMLPinMahout.

WewilldiscussalltheclassificationalgorithmssupportedbyApacheMahoutinthisbook,andwewillalsocheckthemodelevaluationtechniquesprovidedbyApacheMahout.

Page 53: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 54: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ModelevaluationtechniquesWecannothaveasingleevaluationmetricthatcanfitalltheclassifiermodels,butwecanfindoutsomecommonissuesinevaluation,andwehavetechniquestodealwiththem.WewilldiscussthefollowingtechniquesthatareusedinMahout:

ConfusionmatrixROCgraphAUCEntropymatrix

Page 55: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TheconfusionmatrixTheconfusionmatrixprovidesuswiththenumberofcorrectandincorrectpredictionsmadebythemodelcomparedwiththeactualoutcomes(targetvalues)inthedata.AconfusionmatrixisaN*Nmatrix,whereNisthenumberoflabels(classes).Eachcolumnisaninstanceinthepredictedclass,andeachrowisaninstanceintheactualclass.Usingthismatrix,wecanfindouthowoneclassisconfusedwithanother.Let’sassumethatwehaveaclassifierthatclassifiesthreefruits:strawberries,cherries,andgrapes.Assumingthatwehaveasampleof24fruits:7strawberries,8cherries,and9grapes,theresultingconfusionmatrixwillbeasshowninthefollowingtable:

Predictedclassesbymodel

Actualclass

Strawberries Cherries Grapes

Strawberries 4 3 0

Cherries 2 5 1

Grapes 0 1 8

So,inthismodel,fromthe8strawberries,3wereclassifiedascherries.Fromthe8cherries,2wereclassifiedasstrawberries,and1isclassifiedasagrape.Fromthe9grapes,1isclassifiedasacherry.Fromthismatrix,wewillcreatethetableofconfusion.Thetableofconfusionhastworowsandtwocolumnsthatreportabouttruepositive,truenegative,falsepositive,andfalsenegative.

So,ifwebuildthistableforaparticularclass,let’ssayforstrawberries,itwouldbeasfollows:

TruePositive

4(actualstrawberriesclassifiedcorrectly)(a)

FalsePositive

2(cherriesthatwereclassifiedasstrawberries)(b)

FalseNegative

3(strawberrieswronglyclassifiedascherries)(c)

TrueNegative

15(allotherfruitscorrectlynotclassifiedasstrawberries)(d)

Usingthistableofconfusion,wecanfindoutthefollowingterms:

Accuracy:Thisistheproportionofthetotalnumberofpredictionsthatwerecorrectlyclassified.Itiscalculatedas(TruePositive+TrueNegative)/Positive+Negative.Therefore,accuracy=(a+d)/(a+b+c+d).Precisionorpositivepredictivevalue:Thisistheproportionofpositivecasesthatwerecorrectlyclassified.Itiscalculatedas(TruePositive)/(TruePositive+FalsePositive).Therefore,precision=a/(a+b).Negativepredictivevalue:Thisistheproportionofnegativecasesthatwereclassifiedcorrectly.ItiscalculatedasTrueNegative/(TrueNegative+FalseNegative).Therefore,negativepredictivevalue=d/(c+d).Sensitivity/truepositiverate/recall:Thisistheproportionoftheactualpositive

Page 56: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

casesthatwerecorrectlyidentified.ItiscalculatedasTruePositive/(TruePositive+FalseNegative).Therefore,sensitivity=a/(a+c).Specificity:Thisistheproportionoftheactualnegativecases.ItiscalculatedasTrueNegative/(FalsePositive+TrueNegative).Therefore,specificity=d/(b+d).F1score:Thisisthemeasureofatest’saccuracy,anditiscalculatedasfollows:F1=2.((Positivepredictivevalue(precision)*sensitivity(recall))/(Positivepredictivevalue(precision)+sensitivity(recall))).

Page 57: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TheReceiverOperatingCharacteristics(ROC)graphROCisatwo-dimensionalplotofaclassifierwithfalsepositiverateonthexaxisandtruepositiverateontheyaxis.Thelowerpoint(0,0)inthefigurerepresentsneverissuingapositiveclassification.Point(0,1)representsperfectclassification.Thediagonalfrom(0,0)to(1,1)dividestheROCspace.Pointsabovethediagonalrepresentgoodclassificationresults,andpointsbelowthelinerepresentpoorresults,asshowninthefollowingdiagram:

Page 58: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

AreaundertheROCcurveThisistheareaundertheROCcurveandisalsoknownasAUC.Itisusedtomeasurethequalityoftheclassificationmodel.Inpractice,mostoftheclassificationmodelshaveanAUCbetween0.5and1.Thecloserthevalueisto1,thegreaterisyourclassifier.

Page 59: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TheentropymatrixBeforegoingintothedetailsoftheentropymatrix,firstweneedtounderstandentropy.TheconceptofentropyininformationtheorywasdevelopedbyShannon.

Entropyisameasureofdisorderthatcanbeappliedtoaset.Itisdefinedas:

Entropy=-p1log(p1)–p2log(p2)-…….

Eachpistheprobabilityofaparticularpropertywithintheset.Let’srevisitourcustomerloanapplicationdataset.Forexample,assumingwehaveasetof10customersfromwhich6areeligibleforaloanand4arenot.Here,wehavetwoproperties(classes):eligibleornoteligible.

P(eligible)=6/10=0.6

P(noteligible)=4/10=0.4

So,entropyofthedatasetwillbe:

Entropy=-[0.6*log2(0.6)+0.4*log2(0.4)]

=-[0.6*-0.74+0.4*-1.32]

=0.972

Entropyisusefulinacquiringknowledgeofinformationgain.Informationgainmeasuresthechangeinentropyduetoanynewinformationbeingaddedinmodelcreation.So,ifentropydecreasesfromnewinformation,itindicatesthatthemodelisperformingwellnow.Informationgainiscalculatedas:

IG(classes,subclasses)=entropy(class)–(p(subclass1)*entropy(subclass1)+p(subclass2)*entropy(subclass2)+…)

Entropymatrixisbasicallythesameastheconfusionmatrixdefinedearlier;theonlydifferenceisthattheelementsinthematrixaretheaveragesofthelogoftheprobabilityscoreforeachtrueorestimatedcategorycombination.Agoodmodelwillhavesmallnegativenumbersalongthediagonalandwillhavelargenegativenumbersintheoff-diagonalposition.

Page 60: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 61: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryWehavediscussedclassificationanditsapplicationsandalsowhatalgorithmandclassifierevaluationtechniquesaresupportedbyMahout.Wediscussedtechniqueslikeconfusionmatrix,ROCgraph,AUC,andentropymatrix.

Now,wewillmovetothenextchapterandsetupApacheMahoutandthedeveloperenvironment.WewillalsodiscussthearchitectureofApacheMahoutandfindoutwhyMahoutisagoodchoiceforclassification.

Page 62: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 63: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter2.ApacheMahoutInthepreviouschapter,wediscussedclassificationandlookedintothealgorithmsprovidedbyMahoutinthisarea.Beforegoingtothosealgorithms,weneedtounderstandMahoutanditsinstallation.Inthischapter,wewillexplorethefollowingtopics:

WhatisApacheMahout?AlgorithmssupportedinMahoutWhyisitagoodchoiceforclassificationproblems?SettingupthesystemforMahoutdevelopment

Page 64: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingApacheMahoutAmahoutisapersonwhoridesandcontrolsanelephant.MostofthealgorithmsinApacheMahoutareimplementedontopofHadoop,whichisanotherApache-licensedprojectandhasthesymbolofanelephant(http://hadoop.apache.org/).AsApacheMahoutridesoverHadoop,thisnameisjustified.

ApacheMahoutisaprojectofApacheSoftwareFoundationthathasimplementationsofmachinelearningalgorithms.MahoutwasstartedasasubprojectoftheApacheLuceneprojectin2008.Aftersometime,anopensourceprojectnamedTaste,whichwasdevelopedforcollaborativefiltering,anditwasabsorbedintoMahout.MahoutiswritteninJavaandprovidesscalablemachinelearningalgorithms.Mahoutisthedefaultchoiceformachinelearningproblemsinwhichthedataistoolargetofitintoasinglemachine.MahoutprovidesJavalibrariesanddoesnotprovideanyuserinterfaceorserver.Itisaframeworkoftoolstobeusedandadaptedbydevelopers.

Tosumitup,Mahoutprovidesyouwithimplementationsofthemostfrequentlyusedmachinelearningalgorithmsintheareaofclassification,clustering,andrecommendation.Insteadofspendingtimewritingalgorithms,itprovidesuswithready-to-consumesolutions.

MahoutusesHadoopforitsalgorithms,butsomeofthealgorithmscanalsorunwithoutHadoop.Currently,Mahoutsupportsthefollowingusecases:

Recommendation:Thistakestheuserdataandtriestopredictitemsthattheusermightlike.Withthisusecase,youcanseeallthesitesthataresellinggoodstotheuser.Basedonyourpreviousaction,theywilltrytofindoutunknownitemsthatcouldbeofuse.Oneexamplecanbethis:assoonasyouselectsomebookfromAmazon,thewebsitewillshowyoualistofotherbooksunderthetitle,CustomersWhoBoughtThisItemAlsoBought.Italsoshowsthetitle,WhatOtherItemsDoCustomersBuyAfterViewingThisItem?AnotherexampleofrecommendationisthatwhileplayingvideosonYouTube,itrecommendsthatyoulistentosomeothervideosbasedonyourselection.MahoutprovidesfullAPIsupporttodevelopyour

Page 65: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ownuser-basedoritem-basedrecommendationengine.Classification:Asdefinedintheearlierchapter,classificationdecideshowmuchanitembelongstooneparticularcategory.E-mailclassificationforfilteringoutspamisaclassicexampleofclassification.MahoutprovidesarichsetofAPIstobuildyourownclassificationmodel.Forexample,Mahoutcanbeusedtobuildadocumentclassifierorane-mailclassifier.Clustering:Thisisatechniquethattriestogroupitemstogetherbasedonsomesortofsimilarity.Here,wefindthedifferentclustersofitemsbasedoncertainproperties,andwedonotknowthenameoftheclusterinadvance.Themaindifferencebetweenclusteringandclassificationisthatinclassification,weknowtheendclassname.Clusteringisusefulinfindingoutdifferentcustomersegments.GoogleNewsusestheclusteringtechniqueinordertogroupnews.Forclustering,Mahouthasalreadyimplementedsomeofthemostpopularalgorithmsinthisarea,suchask-means,fuzzyk-means,canopy,andsoon.Dimensionalreduction:Aswediscussedinthepreviouschapter,featuresarecalleddimensions.Dimensionalreductionistheprocessofreducingthenumberofrandomvariablesunderconsideration.Thismakesdataeasytouse.Mahoutprovidesalgorithmsfordimensionalreduction.SingularvaluedecompositionandLanczosareexamplesofthealgorithmsthatMahoutprovides.Topicmodeling:Topicmodelingisusedtocapturetheabstractideaofadocument.Atopicmodelisamodelthatassociatesprobabilitydistributionwitheachdocumentovertopics.Giventhatadocumentisaboutaparticulartopic,onewouldexpectparticularwordstoappearinthedocumentmoreorlessfrequently.“Football”and“goal”willappearmoreinadocumentaboutsports.LatentDirichletAllocation(LDA)isapowerfullearningalgorithmfortopicmodeling.InMahout,collapsedvariationalBayesisimplementedforLDA.

Page 66: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 67: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

AlgorithmssupportedinMahoutTheimplementationofalgorithmsinMahoutcanbecategorizedintotwogroups:

Sequentialalgorithms:ThesealgorithmsareexecutedsequentiallyanddonotuseHadoopscalableprocessing.TheyareusuallytheonesderivedfromTaste.Forexample:user-basedcollaborativefiltering,logisticregression,HiddenMarkovModel,multi-layerperceptron,singularvaluedecomposition.Parallelalgorithms:ThesealgorithmscansupportpetabytesofdatausingHadoop’smapandhencereduceparallelprocessing.Forexample,RandomForest,NaïveBayes,canopyclustering,k-meansclustering,spectralclustering,andsoon.

Page 68: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 69: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ReasonsforMahoutbeingagoodchoiceforclassificationInmachinelearningsystems,themoredatayouuse,themoreaccuratethesystembuiltwillbe.Mahout,whichusesHadoopforscalability,iswayaheadofothersintermsofhandlinghugedatasets.Asthenumberoftrainingsetsincreases,Mahout’sperformancealsoincreases.Iftheinputsizefortrainingexamplesisfrom1millionto10million,thenMahoutisanexcellentchoice.

Forclassificationproblems,increaseddatafortrainingisdesirableasitcanimprovetheaccuracyofthemodel.Generally,asthenumberofdatasetsincreases,memoryrequirementalsoincreases,andalgorithmsbecomeslow,butMahout’sscalableandparallelalgorithmsworkbetterwithregardstothetimetaken.Eachnewmachineaddeddecreasesthetrainingtimeandprovideshigherperformance.

Page 70: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 71: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

InstallingMahoutNowlet’strytheslightlychallengingpartofthisbook:Mahoutinstallation.Basedoncommonexperiences,Ihavecomeupwiththefollowingquestionsorconcernsthatusersfacebeforeinstallation:

IdonotknowanythingaboutMaven.HowwillIcompileMahoutbuild?HowcanIsetupEclipsetowritemyownprogramsinMahout?HowcanIinstallMahoutonaWindowssystem?

So,wewillinstallMahoutwiththehelpofthefollowingsteps.Eachstepisindependentfromtheother.Youcanchooseanyoneofthese:

BuildingMahoutcodeusingMavenSettingupadevelopmentenvironmentusingEclipseSettingupMahoutforaWindowsuser

Beforeanyofthesteps,someoftheprerequisitesare:

YoushouldhaveJavainstalledonyoursystem.Wikihowisagoodsourceforthisathttp://www.wikihow.com/Install-Java-on-LinuxYoushouldhaveHadoopinstalledonyoursystemfromthehttp://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleNodeSetup.htmlURL

Page 72: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

BuildingMahoutfromsourceusingMavenMahout’sbuildandreleasesystemisbasedonMaven.

InstallingMaven1. Createthefolder/usr/local/maven,asfollows:

mkdir/usr/local/maven

2. Downloadthedistributionapache-maven-x.y.z-bin.tar.gzfromtheMavensite(http://maven.apache.org/download.cgi)andmovethisto/usr/local/maven,asfollows:

mvapache-maven-x.y.z-bin.tar.gz/usr/local/maven

3. Unpacktothelocation/usr/local/maven,asfollows:

tar–xvfapache-maven-x.y.z-bin.tar.gz

4. Editthe.bashrcfile,asfollows:

exportM2_HOME=/usr/local/apache-maven-x.y.z

exportM2=$M2_HOME/bin

exportPATH=$M2:$PATH

NoteFortheEclipseIDE,gotoHelpandselectInstallnewSoftware.ClickontheAddbutton,andinthepopup,typethenameM2Eclipse,providethelinkhttp://download.eclipse.org/technology/m2e/releases,andclickonOK.

BuildingMahoutcodeBydefault,MahoutassumesthatHadoopisalreadyinstalledonthesystem.MahoutusestheHADOOP_HOMEandHADOOP_CONF_DIRenvironmentvariablestoaccessHadoopclusterconfigurations.ForsettingupMahout,executethefollowingsteps:

1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/.

2. ChooseaninstallationdirectoryforMahout(/usr/local/Mahout),andplacethedownloadedsourceinthefolder.Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.Thefollowingistheexactlocationofthesource:

tar-xvfmahout-distribution-0.9-src.tar.gz

3. InstalltheMahoutMavenproject,andskipthetestcaseswhileinstalling,asfollows:

mvninstall-Dmaven.test.skip=true

4. SettheMAHOUT_HOMEenvironmentvariableinthe~/.bashrcfile,andupdatethePATHvariablewiththeMahoutbindirectory:

exportMAHOUT_HOME=/user/local/mahout/mahout-distribution-0.9

Page 73: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

exportPATH=$PATH:$MAHOUT_HOME/bin

5. TotesttheMahoutinstallation,executethecommand:mahout.Thiswilllisttheavailableprogramswithinthedistributionbundle,asshowninthefollowingscreenshot:

Page 74: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SettingupadevelopmentenvironmentusingEclipseForthissetup,youshouldhaveMaveninstalledonthesystemandtheMavenpluginforEclipse.RefertotheInstallingMavenstepexplainedintheprevioussection.Thissetupcanbedoneinthefollowingsteps:

1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/andunzipthis:

tarxzfmahout-distribution-0.9-src.tar.gz

2. Let’screateafoldernamedworkspaceunder/usr/local/workspace,asfollows:

mkdir/usr/local/workspace

3. Movethedownloadeddistributiontothisfolder(fromthedownloadsfolder),asfollows:

mvmahout-distribution-0.9/usr/local/workspace/

4. Movetothefolder/usr/local/workspace/mahout-distribution-0.9andmakeanEclipseproject(thiscommandcantakeuptoanhour):

mvneclipse:eclipse

5. SettheMahouthomeinthe.bashrcfile,asexplainedearlierintheBuildingMahoutcodesection.

6. NowopenEclipse.Selectthefile,importMaven,andExistingMavenProjects.Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.

Page 75: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 76: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SettingupMahoutforaWindowsuserAWindowsusercanuseCygwin(alargecollectionofGNUandopensourcetoolsthatprovidesfunctionalitysimilartoaLinuxdistributiononWindows)tosetuptheirenvironment.Thereisalsoanotherwaythatiseasytouse,asshowninthefollowingsteps:

1. DownloadHortonworksSandboxforvirtualboxonyoursystemfromthelocationhttp://hortonworks.com/products/hortonworks-sandbox/#install.HortonworksSandboxonyoursystemwillbeapseudo-distributedmodeofHadoop.

2. Logintotheconsole.UseAlt+F5oralternativelydownloadPuttyandprovide127.0.0.1asthehostnameand2222intheport,asshowninthefollowingfigure.Loginwiththeusernamerootandpassword-hadoop.

3. Enterthefollowingcommand:

yuminstallmahout

Now,youwillseeascreenlikethis:

Page 77: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

4. Entery,andyourMahoutwillstartinstalling.Oncethisisdone,youcantestbytypingthecommandmahoutandthiswillshowyouthesamescreenasshownintheSettingupadevelopmentenvironmentusingEclipserecipeseenearlier.

Page 78: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 79: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryWediscussedApacheMahoutindetailinthischapter.WecoveredtheprocessofinstallingMahoutonoursystem,alongwithsettingupadevelopmentenvironmentthatisreadytoexecuteMahoutalgorithms.WehavealsotakenalookatthereasonsbehindMahoutbeingconsideredagoodchoiceforclassification.Now,wemovetothenextwherewewillunderstandaboutlogisticregressionandlearnabouttheprocessthatneedstobefollowedtoexecuteourfirstalgorithminMahout.

Page 80: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 81: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter3.LearningLogisticRegression/SGDUsingMahoutInsteadofjumpingdirectlyintologisticregression,let’strytounderstandafewofitsconcepts.Inthischapter,wewillexplorethefollowingtopics:

IntroducingregressionUnderstandinglinearregressionCostfunctionGradientdescentLogisticregressionUnderstandingSGDUsingMahoutforlogisticregression

Page 82: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingregressionRegressionanalysisisusedforpredictionandforecasting.Itisusedtofindouttherelationshipbetweenexplanatoryvariablesandtargetvariables.Essentially,itisastatisticalmodelthatisusedtofindouttherelationshipamongvariablespresentinthedatasets.Anexamplethatyoucanrefertoforabetterunderstandingofthistermisthis:determinetheearningsofworkersinaparticularindustry.Here,wewilltrytofindoutthefactorsthataffectaworker’ssalary.Thesefactorscanbeage,education,yearsofexperience,particularskillset,location,andsoon.Wewilltrytomakeamodelthatwilltakeallthesevariablesintoconsiderationandtrytopredictthesalary.Inregressionanalysis,wecharacterizethevariationofthetargetvariablearoundtheregressionfunction,whichcanbedescribedbyaprobabilitydistributionthatisalsoofinterest.Thereareanumberofregressionanalysistechniquesthatareavailable.Forexample,linearregression,ordinaryleastsquaresregression,logisticregression,andsoon.

Page 83: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UnderstandinglinearregressionInlinearregression,wecreateamodeltopredictthevalueofatargetvariablewiththehelpofanexplanatoryvariable.Tounderstandthisbetter,let’slookatanexample.

AcompanyXthatdealsinsellingcoffeehasnoticedthatinthemonthofmonsoon,theirsalesincreasedtoquiteanextent.Sotheyhavecomeupwithaformulatofindtherelationbetweenrainandtheirpercupcoffeesale,whichisshownasfollows:

C=1.5R+800

So,for2mmofrain,thereisademandof803cupsofcoffee.Nowifyougointominutedetails,youwillrealizethatwehavethedataforrainfallandpercupcoffeesale,andwearetryingtobuildamodelthatcanpredictthedemandforcoffeebasedontherainfall.Wehavedataintheformof(R1,C1),(R2,C2)….(Ri,Ci).Here,wewillbuildthemodelinamannerthatkeepstheerrorintheactualandpredictedvaluesataminimum.

CostfunctionIntheequationC=1.5R+800,thetwovalues1.5and800areparametersandthesevaluesaffecttheendresult.WecanwritethisequationasC=p0+p1R.Aswediscussedearlier,ourgoalistoreducethedifferencebetweentheactualvalueandthepredictedvalue,andthisisdependentonthevaluesofp0andp1.Let’sassumethatthepredictedvalueisCpandtheactualvalueisCsothatthedifferencewillbe(Cp-C).Thiscanbewrittenas(p0+p1R-C).Tominimizethiserror,wedefinetheerrorfunction,whichisalsocalledthecostfunction.

Thecostfunctioncanbedefinedwiththefollowingformula:

Here,iistheithsampleandNisthenumberoftrainingexamples.Wecalculatecostsfordifferentsetsofp0andp1andfinallyselectthep0andp1thatgivestheleastcost(C).Thisisthemodelthatwillbeusedtomakepredictionsfornewinput.

GradientdescentGradientdescentstartswithaninitialsetofparametervalues,p0andp1,anditerativelymovestowardsasetofparametervaluesthatminimizesthecostfunction.Wecanvisualizethiserrorfunctiongraphically,wherewidthandlengthcanbeconsideredastheparametersp0andp1andheightasthecostfunction.Ourgoalistofindthevaluesforp0andp1inawaythatourcostfunctionwillbeminimal.Westartthealgorithmwithsomevaluesofp0andp1anditerativelyworktowardstheminimumvalue.Agoodwaytoensurethatthegradientdescentisworkingcorrectlyistomakesurethatthecostfunctiondecreasesforeachiteration.Inthiscase,thecostfunctionsurfaceisconvexandwewilltrytofindouttheminimumvalue.Thiscanbeseeninthefollowingfigure:

Page 84: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 85: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

LogisticregressionLogisticregressionisusedtoascertaintheprobabilityofanevent.Generally,logisticregressionreferstoproblemswheretheoutcomeisbinary,forexample,inbuildingamodelthatisbasedonacustomer’sincome,traveluses,gender,andotherfeaturestopredictwhetherheorshewillbuyaparticularcarornot.So,theanswerwillbeasimpleyesorno.Whentheoutcomeiscomposedofmorethanonecategory,thisiscalledmultinomiallogisticregression.

Logisticregressionisbasedonthesigmoidfunction.Predictorvariablesarecombinedwithlinearweightandthenpassedtothisfunction,whichgeneratestheoutputintherangeof0–1.Anoutputcloseto1indicatesthatanitembelongstoacertainclass.Let’sfirstunderstandthesigmoidorlogisticfunction.Itcanbedefinedbythefollowingformula:

F(z)=1/1+e(-z)

Withasingleexplanatoryvariable,zwillbedefinedasz=β0+β1*x.Thisequationisexplainedasfollows:

z:Thisiscalledthedependentvariable.Thisisthevariablethatwewouldliketopredict.Duringthecreationofthemodel,wehavethisvariablewithusinthetrainingset,andwebuildthemodeltopredictthisvariable.Theknownvaluesofzarecalledobservedvalues.x:Thisistheexplanatoryorindependentvariable.Thesevariablesareusedtopredictthedependentvariablez.Forexample,topredictthesalesofanewlylaunchedproductataparticularlocation,wemightincludeexplanatoryvariablessuchasthepriceoftheproduct,theaverageincomeofthepeopleofthatlocation,andsoon.β0:Thisiscalledtheregressionintercept.Ifallexplanatoryvariablesarezero,thenthisparameterisequaltothedependentvariablez.β1:Thesearevaluesforeachexplanatoryvariable.

Thegraphofthelogisticfunctionisasfollows:

Page 86: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Withalittlebitofmathematics,wecanchangethisequationasfollows:

ln(F(x)/(1-F(x))=β0+β1*x

Inthecaseoflinearregression,thecostfunctiongraphwasconvex,buthere,itisnotgoingtobeconvex.Findingtheminimumvaluesforparametersinawaythatourpredictedoutputisclosetotheactualonewillbedifficult.Inacostfunction,whilecalculatingforlogisticregression,wewillreplaceourCpvalueoflinearregressionwiththefunctionF(z).Tomakeconvexlogisticregressioncostfunctions,wewillreplace(p0+p1Ri-Ci)2withoneofthefollowing:

log(1/1+e(-(β0+β1*x)))iftheactualoccurrenceofaneventis1,thisfunctionwillrepresentthecost.log(1-(1/1+e(-(β0+β1*x))))iftheactualoccurrenceofaneventis0,thisfunctionwillrepresentthecost.

Wewillhavetorememberthatinlogisticregression,wecalculatetheclassprobability.So,iftheprobabilityofaneventoccurring(customerbuyingacar,beingdefrauded,andsoon)isp,theprobabilityofnon-occurrenceis1-p.

Page 87: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

StochasticGradientDescentGradientdescentminimizesthecostfunction.Forverylargedatasets,gradientdescentisaveryexpensiveprocedure.StochasticGradientDescent(SGD)isamodificationofthegradientdescentalgorithmtohandlelargedatasets.Gradientdescentcomputesthegradientusingthewholedataset,whileSGDcomputesthegradientusingasinglesample.So,gradientdescentloadsthefulldatasetandtriestofindoutthelocalminimumonthegraphandthenrepeatthefullprocessagain,whileSGDadjuststhecostfunctionforeverysample,onebyone.AmajoradvantagethatSGDhasovergradientdescentisthatitsspeedofcomputationisawholelotfaster.LargedatasetsinRAMgenerallycannotbeheldasthestorageislimited.InSGD,theburdenontheRAMisreduced,whereineachsampleorbatchofsamplesareloadedandworkedwith,theresultsforwhicharestored,andsoon.

Page 88: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingMahoutforlogisticregressionMahouthasimplementationsforlogisticregressionusingSGD.Itisveryeasytounderstandanduse.Solet’sgetstarted.

Dataset

WewillusetheWisconsinDiagnosticBreastCancer(WDBC)dataset.Thisisadatasetforbreastcancertumorsanddataisavailablefrom1995onwards.Ithas569instancesofbreasttumorcasesandhas30featurestopredictthediagnosis,whichiscategorizedaseitherbenignormalignant.

NoteMoredetailsontheprecedingdatasetisavailableathttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names.

Preparingthetrainingandtestdata

Youcandownloadthewdbc.datadatasetfromhttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.

Now,saveitasaCSVfileandincludethefollowingheaderline:ID_Number,Diagnosis,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,ConcavePoints,Symmetry,Fractal_Dimension,RadiusStdError,TextureStdError,PerimeterStdError,AreaStdError,SmoothnessStdError,CompactnessStdError,ConcavityStdError,ConcavePointStdError,Symmetrystderror,FractalDimensionStderror,WorstRadius,worsttexture,worstperimeter,worstarea,worstsmoothness,worstcompactness,worstconcavity,worstconcavepoints,worstsymmentry,worstfractaldimensions

Now,wewillhavetoperformthefollowingstepstopreparethisdatatobeusedbytheMahoutlogisticregressionalgorithm:

1. Wewillmakethetargetclassnumeric.Inthiscase,thesecondfielddiagnosisisthetargetvariable.Wewillchangemalignantto0andbenignto1.Usethefollowingcodesnippettointroducethechanges.Wecanusethisstrategyforsmalldatasets,butforhugedatasets,wehavedifferentstrategies,whichwewillcoverinChapter4,LearningtheNaïveBayesClassificationUsingMahout:

publicvoidconvertTargetToInteger()throwsIOException{

//Readthedata

BufferedReaderbr=newBufferedReader(newFileReader("wdbc.csv"));

Stringline=null;

//Createthefiletosavetheresulteddata

FilewdbcData=newFile("<YourDestinationlocationforfile.>");

FileWriterfw=newFileWriter(wdbcData);

//Weareaddingheadertothenewfile

fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Pe

rimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+

","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusS

tdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdErro

r"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStd

Error"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDi

mensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimet

er"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"w

orstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worst

fractaldimensions"+"\n");

Page 89: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

/*Inthewhileloopwearereadinglinebylineandcheckingthelast

field-parts[1]andchangingittonumericvalueaccordingly*/

while((line=br.readLine())!=null){

String[]parts=line.split(",");

if(parts[1].equals("M")){

fw.write(parts[0]+","+"0"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa

rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10

]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]

+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+

","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"

,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",

"+parts[31]+"\n");

}

if(parts[1].equals("B")){

fw.write(parts[0]+","+"1"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa

rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10

]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]

+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+

","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"

,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",

"+parts[31]+"\n");

}

}

fw.close();

br.close();

}

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

2. Wewillhavetosplitthedatasetintotrainingandtestdatasetsandthenshufflethedatasetssothatwecanmixthemup,whichcanbedoneusingthefollowingcodesnippet:

publicvoiddataPrepration()throwsException{

//Readingthedatasetcreatedbyearliermethod

convertTargetToIntegerandhereweareusinggoogleguavaapi's.

List<String>result=

Resources.readLines(Resources.getResource("wdbc.csv"),Charsets.UTF_8);

//Thisistoremoveheaderbeforetherandomizationprocess.

Otherwiseitcanappearinthemiddleofdataset.

List<String>raw=result.subList(1,570);

Randomrandom=newRandom();

//Shufflingthedataset.

Collections.shuffle(raw,random);

//Splittingdatasetintotrainingandtestexamples.

Page 90: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

List<String>train=raw.subList(0,470);

List<String>test=raw.subList(470,569);

FiletrainingData=newFile("<yourLocation>/wdbcTrain.csv");

FiletestData=newFile("<yourLocation>/wdbcTest.csv");

writeCSV(train,trainingData);

writeCSV(test,testData);

}

//Thismethodiswritingthelisttodesiredfilelocation.

publicvoidwriteCSV(List<String>list,Filefile)throwsIOException{

FileWriterfw=newFileWriter(file);

fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Pe

rimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+

","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusS

tdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdErro

r"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStd

Error"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDi

mensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimet

er"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"w

orstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worst

fractaldimensions"+"\n");

for(inti=0;i<list.size();i++){

fw.write(list.get(i)+"\n");

}

fw.close();

}

Trainingthemodel

Wewillusethetrainingdatasetandtrainlogisticalgorithmtopreparethemodel.Usethefollowingcommandtocreatethemodel:

mahouttrainlogistic--input/tmp/wdbcTrain.csv--output/tmp//model--

targetDiagnosis--categories2--predictorsRadiusTexturePerimeterArea

SmoothnessCompactnessConcavityConcavePointsSymmetryFractal_Dimension

RadiusStdErrorTextureStdErrorPerimeterStdErrorAreaStdError

SmoothnessStdErrorCompactnessStdErrorConcavityStdError

ConcavePointStdErrorSymmetrystderrorFractalDimensionStderrorWorstRadius

worsttextureworstperimeterworstareaworstsmoothnessworstcompactness

worstconcavityworstconcavepointsworstsymmentryworstfractaldimensions--

typesnumeric--features30--passes90--rate300

Thiscommandwillgiveyouthefollowingoutput:

Page 91: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Let’sunderstandtheparametersusedinthiscommand:

trainlogistic:ThisisthealgorithmthatMahoutprovidestobuildthemodelusingyourinputparameters.input:Thisisthelocationoftheinputfile.output:Thisisthelocationofthemodelfile.target:Thisisthenameofthetargetvariablethatwewanttopredictfromthedataset.categories:Thisreferstothenumberofpredictedclasses.predictors:Thisfeaturesinthedatasetusedtopredictthetargetvariable.types:Thisisalistofthetypesofpredictorvariables.(Hereallarenumericbutitcouldbewordortextaswell.)features:Thisisthesizeofthefeaturevectorusedtobuildthemodel.passes:Thisspecifiesthenumberoftimestheinputdatashouldbere-examinedduringtraining.Smallinputfilesmayneedtobeexamineddozensoftimes.Verylargeinputfilesprobablydon’tevenneedtobecompletelyexamined.rate:Thissetstheinitiallearningrate.Thiscanbelargeifyouhavelotsofdataoruselotsofpassesbecauseitdecreasesprogressivelyasdataisexamined.

Nowourmodelisreadytomoveontothenextstepofevaluation.Toevaluatethemodelfurther,wecanusethesamedatasetandchecktheconfusionandAUCmatrix.Thecommandforthiswillbeasfollows:

mahoutrunlogistic--input/tmp/wdbcTrain.csv--model/tmp//model--auc--

confusion

runlogistic:Thisisthealgorithmtorunthelogisticregressionmodeloveraninputdatasetmodel:Thisisthelocationofthemodelfileauc:ThisprintstheAUCscoreforthemodelversustheinputdataafterthedataisreadconfusion:Thisprintstheconfusionmatrixforaparticularthreshold

Theoutputofthepreviouscommandisshowninthefollowingscreenshot:

Page 92: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Now,thesematricesshowthatthemodelisnotbad.Having0.88asthevalueforAUCisgood,butwewillcheckthisontestdataaswell.Theconfusionmatrixinformsusthatoutof172malignanttumors,ithascorrectlyclassified151instancesandthat34benigntumorsarealsoclassifiedasmalignant.Inthecaseofbenigntumors,outof298,ithascorrectlyclassified264.

Ifthemodeldoesnotprovidegoodresults,wehaveanumberofoptions.

Changetheparametersinthefeaturevector,increasingthemifweareselectingfewfeatures.Thisshouldbedoneoneatatime,andweshouldtesttheresultagainwitheachgeneratedmodel.WeshouldgetamodelwhereAUCiscloseto1.

Let’srunthesamealgorithmontestdataaswell:

mahoutrunlogistic--input/tmp/wdbcTest.csv--model/tmp//model--auc–

confusion

Sothismodelworksalmostthesameontestdataaswell.Ithasclassified34outofthe40malignanttumorscorrectly.

Page 93: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 94: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedlogisticregressionandhowwecanusethisalgorithmavailableinApacheMahout.WeusedtheWisconsinDiagnosticBreastCancerdatasetandrandomlybrokeitintotwodatasets:onefortrainingandtheotherfortesting.WecreatedthelogisticregressionmodelusingMahoutandalsorantestdataoverthismodel.Now,wewillmoveontothenextchapterwhereyouwilllearnabouttheNaïveBayesclassificationandalsothemostfrequentlyusedclassificationtechnique:textclassification.

Page 95: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 96: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter4.LearningtheNaïveBayesClassificationUsingMahoutInthischapter,wewillusetheNaïveBayesclassificationalgorithmtoclassifyasetofdocuments.Classifyingtextdocumentsisalittletrickybecauseofthedatapreparationstepsinvolved.Inthischapter,wewillexplorethefollowingtopics:

ConditionalprobabilityandtheBayesruleUnderstandingtheNaïveBayesalgorithmUnderstandingtermsusedintextclassificationUsingtheNaïveBayesalgorithminApacheMahout

Page 97: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingconditionalprobabilityandtheBayesruleBeforelearningtheNaïveBayesalgorithm,youshouldhaveanunderstandingofconditionalprobabilityandtheBayesrule.

Inverysimpleterms,conditionalprobabilityistheprobabilitythatsomethingwillhappen,giventhatsomethingelsehasalreadyhappened.ItisexpressedasP(A/B),whichcanbereadasprobabilityofAgivenB,anditfindstheprobabilityoftheoccurrenceofeventAonceeventBhasalreadyhappened.

Mathematically,itisdefinedasfollows:

Forexample,ifyouchooseacardfromastandardcarddeckandifyouwereaskedabouttheprobabilityforthecardtobeadiamond,youwouldquicklysay13/52or0.25,asthereare13diamondcardsinthedeck.However,ifyouthenlookatthecardanddeclarethatitisred,thenwewillhavenarrowedthepossibilitiesforthecardto26possiblecards,andtheprobabilitythatthecardisadiamondnowis13/26=0.5.So,ifwedefineAasadiamondcardandBasaredcard,thenP(A/B)willbetheprobabilityofthecardbeingadiamond,givenitisred.

Sometimes,foragivenpairofevents,conditionalprobabilityishardtocalculate,andBayes’theoremhelpsusherebygivingtherelationshipbetweentwoconditionalprobabilities.

Bayes’theoremisdefinedasfollows:

Thetermsintheformulaaredefinedasfollows:

P(A):ThisiscalledpriorprobabilityorpriorP(B/A):ThisiscalledconditionalprobabilityorlikelihoodP(B):ThisiscalledmarginalprobabilityP(A/B):Thisiscalledposteriorprobabilityorposterior

Thefollowingformulaisderivedonlyfromtheconditionalprobabilityformula.WecandefineP(B/A)asfollows:

Page 98: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Whenrearranged,theformulabecomesthis:

Now,fromtheprecedingconditionalprobabilityformula,wegetthefollowing:

Let’stakeanexamplethatwillhelpustounderstandhowBayes’theoremisapplied.

Acancertestgivesapositiveresultwithaprobabilityof97percentwhenthepatientisindeedaffectedbycancer,whileitgivesanegativeresultwith99percentprobabilitywhenthepatientisnotaffectedbycancer.Ifapatientisdrawnatrandomfromapopulationwhere0.2percentoftheindividualsareaffectedbycancerandheorsheisfoundtobepositive,whatistheprobabilitythatheorsheisindeedaffectedbycancer?Inprobabilisticterms,whatweknowaboutthisproblemcanbedefinedasfollows:

P(positive|cancer)=0.97

P(positive|nocancer)=1-0.99=0.01

P(cancer)=0.002

P(nocancer)=1-0.002=0.998

P(positive)=P(positive|cancer)P(cancer)+P(positive|nocancer)P(nocancer)

=0.97*0.002+0.01*0.998

=0.01192

NowP(cancer|positive)=(0.97*0.002)/0.01192=0.1628

Soevenwhenfoundpositive,theprobabilityofthepatientbeingaffectedbycancerinthisexampleisaround16percent.

Page 99: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 100: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UnderstandingtheNaïveBayesalgorithmInBayes’theorem,wehaveseenthattheoutcomeisbasedonlyononeevidence,butinclassificationproblems,wehavemultipleevidencesandwehavetopredicttheoutcome.InNaïveBayes,weuncouplemultiplepiecesofevidenceandtreateachoneofthemindependently.Itisdefinedasfollows:

P(outcome|multipleEvidence))=P(Evidence1|outcome)*P(Evidence2|outcome)*P(Evidence3|outcome)…./P(Evidence)

Runthisformulaforeachpossibleoutcome.Sincewearetryingtoclassify,eachoutcomewillbecalledaclass.Ourtaskistolookattheevidence(features)toconsiderhowlikelyitisforittobeofaparticularclassandthenassignitaccordingly.Theclassthathasthehighestprobabilitygetsassignedtothatcombinationofevidences.Let’sunderstandthiswithanexample.

Let’ssaythatwehavedataon1,000piecesoffruit.Theyhappentobebananas,apples,orsomeotherfruit.Weareawareofthreecharacteristicsofeachfruit:

Size:TheyareeitherlongornotlongTaste:TheyareeithersweetornotsweetColor:Theyareeitheryellowornotyellow

Assumethatwehaveadatasetlikethefollowing:

Fruittype Taste–sweet Taste–notsweet Color–yellow Color–notyellow Size–long Size–notlong Total

Banana 350 150 450 50 400 100 500

Apple 150 150 100 200 0 300 300

Other 150 50 50 150 100 100 200

Total 650 350 600 400 500 500 1000

Nowlet’slookatthethingswehave:

P(Banana)=500/1000=0.5

P(Apple)=300/1000=0.3

P(Other)=200/1000=0.2

Let’slookattheprobabilityofthefeatures:

P(Sweet)=650/1000=0.65

P(Yellow)=600/1000=0.6

P(long)=500/1000=0.5

P(notSweet)=350/1000=0.35

P(notyellow)=400/1000=0.4

Page 101: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

P(notlong)=500/1000=0.5

Nowwewanttoknowwhatfruitwewillhaveifitisnotyellowandnotlongandsweet.Theprobabilityofitbeinganappleisasfollows:

P(Apple|sweet,notlong,notyellow)=P(sweet|Apple)*P(notlong|Apple)*P(notyellow|Apple)*P(Apple)/P(sweet)*P(notlong)*P(notyellow)

=0.5*1*0.67*0.3/P(Evidence)

=0.1005/P(Evidence)

Theprobabilityofitbeingabananaisthis:

P(banana|sweet,notlong,notyellow)=P(sweet|banana)*P(notlong|banana)*P(notyellow|banana)*P(banana)/P(sweet)*P(notlong)*P(notyellow)

=0.7*0.2*0.1*0.5/P(Evidence)

=0.007/P(Evidence)

Theprobabilityofitbeinganyotherfruitisasfollows:

P(otherfruit|sweet,notlong,notyellow)=P(sweet|otherfruit)*P(notlong|otherfruit)*P(notyellow|otherfruit)*P(otherfruit)/P(sweet)*P(notlong)*P(notyellow)

=0.75*0.5*0.75*0.2/P(Evidence)

=0.05625/P(Evidence)

Sofromtheresults,youcanseethatifthefruitissweet,notlong,andnotyellow,thenthehighestprobabilityisthatitwillbeanapple.Sofindoutthehighestprobabilityandassigntheunknownitemtothatclass.

NaïveBayesisaverygoodchoicefortextclassification.BeforewemoveontotextclassificationusingNaïveBayesinMahout,let’sunderstandafewtermsthatarereallyusefulfortextclassification.

Page 102: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 103: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UnderstandingthetermsusedintextclassificationTopreparedatasothatitcanbeusedbyaclassifierisacomplexprocess.Fromrawdata,wecancollectexplanatoryandtargetvariablesandencodethemasvectors,whichistheinputoftheclassifier.

Vectorsareorderedlistsofvaluesasdefinedintwo-dimensionalspace.Youcantakeacluefromcoordinategeometryaswell.Apoint(3,4)isapointinthexandyplanes.InMahout,itisdifferent.Here,avectorcanhave(3,4)or10,000dimensions.

Mahoutprovidessupportforcreatingvectors.TherearetwotypesofvectorimplementationsinMahout:sparseanddensevectors.Thereareafewtermsthatweneedtounderstandfortextclassification:

Bagofwords:Thisconsiderseachdocumentasacollectionofwords.Thisignoreswordorder,grammar,andpunctuation.So,ifeverywordisafeature,thencalculatingthefeaturevalueofthedocumentwordisrepresentedasatoken.Itisgiventhevalue1ifitispresentor0ifnot.Termfrequency:Thisconsidersthewordcountinthedocumentinsteadof0and1.Sotheimportanceofawordincreaseswiththenumberoftimesitappearsinthedocument.Considerthefollowingexamplesentence:

ApplehaslaunchediPhoneanditwillcontinuetolaunchsuchproducts.OthercompetitorsarealsoplanningtolaunchproductssimilartothatofiPhone.

Thefollowingisthetablethatrepresentstermfrequency:

Term Count

Apple 1

Launch 3

iPhone 2

Product 2

Plan 1

Thefollowingtechniquesareusuallyappliedtocomeupwiththistypeoftable:

Stemmingofwords:Withthis,thesuffixisremovedfromthewordso“launched”,“launches”,and“launch”areallconsideredas“launch”.Casenormalization:Withthis,everytermisconvertedtolowercase.Stopwordremoval:Therearesomewordsthatarealmostpresentineverydocument.Wecallthesewordsstopwords.Duringanimportantfeatureextractionfromadocument,thesewordscomeintoaccountandtheywillnotbehelpfulinthe

Page 104: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

overallcalculation.Examplesofthesewordsare“is,are,the,that,andsoon.”So,whileextracting,wewillignorethesekindofwords.Inversedocumentfrequency:Thisisconsideredastheboostatermgetsforbeingrare.Atermshouldnotbetoocommon.Ifatermoccursineverydocument,itisnotgoodforclassification.Thefewerdocumentsinwhichatermoccurs,themoresignificantitislikelytobeforthedocumentsitdoesoccurin.Foratermt,inversedocumentfrequencyiscalculatedasfollows:

IDF(t)=1+log(totalnumberofdocuments/numberofdocumentscontainingt)

Termfrequencyandinversetermfrequency:Thisisoneofthepopularrepresentationsofthetext.Itistheproductoftermfrequencyandinversedocumentfrequency,asfollows:

TFIDF(t,d)=TF(t,d)*IDF(t)

Eachdocumentisafeaturevectorandacollectionofdocumentsisasetofthesefeaturevectorsandthissetworksastheinputfortheclassification.Nowthatweunderstandthebasicconceptsbehindthevectorcreationoftextdocuments,let’smoveontothenextsectionwherewewillclassifytextdocumentsusingtheNaïveBayesalgorithm.

Page 105: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 106: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingtheNaïveBayesalgorithminApacheMahoutWewilluseadatasetof20newsgroupsforthisexercise.The20newsgroupsdatasetisastandarddatasetcommonlyusedformachinelearningresearch.Thedataisobtainedfromtranscriptsofseveralmonthsofpostingsmadein20Usenetnewsgroupsfromtheearly1990s.Thisdatasetconsistsofmessages,oneperfile.Eachfilebeginswithheaderlinesthatspecifythingssuchaswhosentthemessage,howlongitis,whatkindofsoftwarewasused,andthesubject.Ablanklinefollowsandthenthemessagebodyfollowsasunformattedtext.

Downloadthe20news-bydate.tar.gzdatasetfromhttp://qwone.com/~jason/20Newsgroups/.ThefollowingstepsareusedtobuildtheNaïveBayesclassifierusingMahout:

1. Createa20newsdatadirectoryandunzipthedatahere:

mkdir/tmp/20newsdata

cd/tmp/20newsdata

tar–xzvf/tmp/20news-bydate.tar.gz

2. Youwillseetwofoldersunder20newsdata:20news-bydate-testand20news-bydate-train.Nowcreateanotherdirectorycalled20newsdataallandmergeboththetrainingandtestdataofthe20newsgroups.

3. Comeoutofthedirectoryandmovetothehomedirectoryandexecutethefollowing:

mkdir/tmp/20newsdataall

cp–R/20newsdata/*/*/tmp/20newsdataall

4. CreateadirectoryinHadoopandsavethisdatainHDFSformat:

hadoopfs–mkdir/user/hue/20newsdata

hadoopfs–put/tmp/20newsdataall/user/hue/20newsdata

5. Converttherawdataintoasequencefile.Theseqdirectorycommandwillgeneratesequencefilesfromadirectory.SequencefilesareusedinHadoop.Asequencefileisaflatfilethatconsistsofbinarykey/valuepairs.WeareconvertingthefilesintosequencefilessothatitcanbeprocessedinHadoop,whichcanbedoneusingthefollowingcommand:

bin/mahoutseqdirectory-i/user/hue/20newsdata/20newsdataall-o

/user/hue/20newsdataseq-out

Theoutputoftheprecedingcommandcanbeseeninthefollowingscreenshot:

Page 107: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

6. Convertthesequencefileintoasparsevectorusingthefollowingcommand:

bin/mahoutseq2sparse-i/user/hue/20newsdataseq-out/part-m-00000-o

/user/hue/20newsdatavec-lnorm-nv-wttfidf

Thetermsusedintheprecedingcommandareasfollows:

lnorm:Thisisfortheoutputvectortobelognormalizednv:Thisreferstonamedvectorswt:Thisreferstothekindofweighttouse;here,weusetfidf

Theoutputoftheprecedingcommandontheconsoleisshowninthefollowingscreenshot:

7. Splitthesetofvectorstotrainandtestthemodel:

bin/mahoutsplit-i/user/hue/20newsdatavec/tfidf-vectors--

trainingOutput/user/hue/20newsdatatrain--testOutput

Page 108: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

/user/hue/20newsdatatest--randomSelectionPct40--overwrite--

sequenceFiles-xmsequential

Thetermsusedintheprecedingcommandareasfollows:

randomSelectionPct:Thisdividesthepercentageofdataintotestingandtrainingdatasets.Here,60percentisfortestingand40percentfortraining.xm:Thisreferstotheexecutionmethodtouse:sequentialormapreduce.Thedefaultismapreduce.

8. Nowtrainthemodel:

bin/mahouttrainnb-i/user/hue/20newsdatatrain-el-o/user/hue/model

-li/user/hue/labelindex-ow-c

9. Testthemodelusingthefollowingcommand:

bin/mahouttestnb-i/user/hue/20newsdatatest-m/user/hue/model/-l

/user/hue/labelindex-ow-o/user/hue/results

Theoutputoftheprecedingcommandontheconsoleisshowninthefollowingscreenshot:

Page 109: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WegettheresultofourNaïveBayesclassifierforthe20newsgroups.

Page 110: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 111: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedtheNaïveBayesalgorithm.Thisalgorithmisasimplisticyethighlyregardedstatisticalmodelthatiswidelyusedinbothindustryandacademia,anditproducesgoodresultsonmanyoccasions.WeinitiallydiscussedconditionalprobabilityandtheBayesrule.WethensawanexampleoftheNaïveBayesalgorithm.Youlearnedabouttheapproachestoconverttextintoavectorformat,whichisaninputforclassifiers.Finally,weusedthe20newsgroupsdatasettobuildaclassifierusingtheNaïveBayesalgorithminMahout.Inthenextchapter,wewillcontinueourjourneyofexploringclassificationalgorithmsinMahoutwiththeHiddenMarkovmodelimplementation.

Page 112: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 113: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter5.LearningtheHiddenMarkovModelUsingMahoutInthischapter,wewillcoveroneofthemostinterestingtopicsofclassificationtechniques:theHiddenMarkovModel(HMM).TounderstandtheHMM,wewillcoverthefollowingtopicsinthischapter:

DeterministicandnondeterministicpatternsTheMarkovprocessIntroducingtheHMMUsingMahoutfortheHMM

Page 114: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

DeterministicandnondeterministicpatternsInadeterministicsystem,eachstateissolelydependentonthestateitwaspreviouslyin.Forexample,let’stakethecaseofasetoftrafficlights.Thesequenceoflightsisred→green→amber→red.So,hereweknowwhatstatewillfollowafterthecurrentstate.Oncethetransitionsareknown,deterministicsystemsareeasytounderstand.

Fornondeterministicpatterns,consideranexampleofapersonnamedBobwhohashissnacksat4:00P.M.everyday.Let’ssayhehasanyoneofthethreeitemsfromthemenu:icecream,juice,orcake.Wecannotsayforsurewhatitemhewillhavethenextday,evenifweknowwhathehadtoday.Thisisanexampleofanondeterministicpattern.

Page 115: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 116: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TheMarkovprocessIntheMarkovprocess,thenextstateisdependentonthepreviousstates.Ifweassumethatwehaveannstatesystem,thenthenextstateisdependentonthepreviousnstates.Thisprocessiscalledannmodelorder.IntheMarkovprocess,wemakethechoiceforthenextstateprobabilistically.So,consideringourpreviousexample,ifBobhadjuicetoday,hecanhavejuice,icecream,orcakethenextday.Inthesameway,wecanreachanystateinthesystemfromthepreviousstate.TheMarkovprocessisshowninthefollowingdiagram:

Ifwehavenstatesinaprocess,thenwecanreachanystatewithn2transitions.Wehaveaprobabilityofmovingtoanystate,andhence,wewillhaven2probabilitiesofdoingthis.ForaMarkovprocess,wewillhavethefollowingthreeitems:

States:Thisreferstothestatesinthesystem.Inourexample,let’ssaytherearethreestates:state1,state2,andstate3.Transitionmatrix:Thiswillhavetheprobabilitiesofmovingfromonestatetoanyotherstate.Anexampleofthetransitionmatrixisshowninthefollowingscreenshot:

Thismatrixshowsthatifthesystemwasinstate1yesterday,thentheprobabilityofittoremaininthesamestatetodaywillbe0.1.

Initialstatevector:Thisisthevectoroftheinitialstateofthesystem.(Anyoneofthestateswillhaveaprobabilityof1andtherestwillhaveaprobabilityof0inthisvector.)

Page 117: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 118: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 119: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IntroducingtheHiddenMarkovModelTheHiddenMarkovModel(HMM)isaclassificationtechniquetopredictthestatesofasystembyobservingtheoutcomeswithouthavingaccesstotheactualstatesthemselves.ItisaMarkovmodelinwhichthestatesarehidden.

Let’scontinuewithBob’ssnackexamplewesawearlier.Nowassumewehaveonemoresetofeventsinplacethatisdirectlyobservable.WeknowwhatBobhaseatenforlunchandhissnacksintakeisrelatedtohislunch.So,wehaveanobservationstate,whichisBob’slunch,andhiddenstates,whicharehissnacksintake.WewanttobuildanalgorithmthatcanforecastwhatwouldbeBob’schoiceofsnackbasedonhislunch.

InadditiontothetransitionprobabilitymatrixintheHiddenMarkovModel,wehaveonemorematrixthatiscalledanemissionmatrix.Thismatrixcontainstheprobabilityoftheobservablestate,provideditisassignedahiddenstate.Theemissionmatrixisasfollows:

P(observablestate|onestate)

So,aHiddenMarkovModelhasthefollowingproperties:

Statevector:ThiscontainstheprobabilityofthehiddenmodeltobeinaparticularstateatthestartTransitionmatrix:Thishastheprobabilitiesofahiddenstate,giventheprevioushiddenstateEmissionmatrix:Giventhatthehiddenmodelisinaparticularhiddenstate,thishastheprobabilitiesofobservingaparticularobservablestateHiddenstates:ThisreferstothestatesofthesystemthatcanbedefinedbytheHiddenMarkovModelObservablestate:Thestatesthatarevisibleintheprocess

UsingtheHiddenMarkovModel,threetypesofproblemscanbesolved.ThefirsttwoarerelatedtothepatternrecognitionproblemandthethirdtypeofproblemgeneratesaHiddenMarkovModel,givenasequenceofobservations.Let’slookatthesethreetypes

Page 120: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ofproblems:

Evaluation:Thisisfindingouttheprobabilityofanobservedsequence,givenanHMM.FromthenumberofdifferentHMMsthatdescribedifferentsystemsandasequenceofobservations,ourgoalwillbetofindoutwhichHMMwillmostprobablygeneratetherequiredsequence.WeusetheforwardalgorithmtocalculatetheprobabilityofanobservationsequencewhenaparticularHMMisgivenandfindoutthemostprobableHMM.Decoding:Thisisfindingthemostprobablesequenceofhiddenstatesfromsomeobservations.WeusetheViterbialgorithmtodeterminethemostprobablesequenceofhiddenstateswhenyouhaveasequenceofobservationsandanHMM.Learning:LearningisgeneratingtheHMMfromasequenceofobservations.So,ifwehavesuchasequence,wemaywonderwhichisthemostlikelymodeltogeneratethissequence.Theforward-backwardalgorithmsareusefulinsolvingthisproblem.

TheHiddenMarkovModelisusedindifferentapplicationssuchasspeechrecognition,handwrittenletterrecognition,genomeanalysis,partsofspeechtagging,customerbehaviormodeling,andsoon.

Page 121: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 122: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingMahoutfortheHiddenMarkovModelApacheMahouthastheimplementationoftheHiddenMarkovModel.Itisavailableintheorg.apache.mahout.classifier.sequencelearning.hmmpackage.

Theoverallimplementationisprovidedbyeightdifferentclasses:

HMMModel:ThisisthemainclassthatdefinestheHiddenMarkovModel.HmmTrainer:ThisclasshasalgorithmsthatareusedtotraintheHiddenMarkovModel.Themainalgorithmsaresupervisedlearning,unsupervisedlearning,andunsupervisedBaum-Welch.HmmEvaluator:ThisclassprovidesdifferentmethodstoevaluateanHMMmodel.Thefollowingusecasesarecoveredinthisclass:

Generatingasequenceofoutputstatesfromamodel(prediction)Computingthelikelihoodthatagivenmodelwillgeneratethegivensequenceofoutputstates(modellikelihood)Computingthemostlikelyhiddensequenceforagivenmodelandagivenobservedsequence(decoding)

HmmAlgorithms:ThisclasscontainsimplementationsofthethreemajorHMMalgorithms:forward,backward,andViterbi.HmmUtils:ThisisautilityclassandprovidesmethodstohandleHMMmodelobjects.RandomSequenceGenerator:Thisisacommand-linetooltogenerateasequencebythegivenHMM.BaumWelchTrainer:ThisistheclasstotrainHMMfromtheconsole.ViterbiEvaluator:Thisisalsoacommand-linetoolforViterbievaluation.

Now,let’sworkwithBob’sexample.

Thefollowingisthegivenmatrixandtheinitialprobabilityvector:

Icecream Cake Juice

0.36 0.51 0.13

Thefollowingwillbethestatetransitionmatrix:

Icecream Cake Juice

Icecream 0.365 0.500 0.135

Cake 0.250 0.125 0.625

Juice 0.365 0.265 0.370

Page 123: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Thefollowingwillbetheemissionmatrix:

Spicyfood Normalfood Nofood

Icecream 0.1 0.2 0.7

Cake 0.5 0.25 0.25

Juice 0.80 0.10 0.10

Nowwewillexecuteacommand-line-basedexampleofthisproblem.WehavethreehiddenstatesofwhatBob’seatenforsnacks:ice-cream,cake,orjuice.Then,wehavethreeobservablestatesofwhatheishavingatlunch:spicyfood,normalfood,ornofoodatall.Now,thefollowingarethestepstoexecutefromthecommandline:

1. Createadirectorywiththenamehmm:mkdir/tmp/hmm.Gotothisdirectoryandcreatethesampleinputfileoftheobservedstates.ThiswillincludeasequenceofBob’slunchhabit:spicyfood(state0),normalfood(state1),andnofood(state2).Executethefollowingcommand:

echo"012221100212111122200000022200000

011112222202120212110001010212121211

002202110">hmm-input

2. RuntheBaumWelchalgorithmtotrainthemodelusingthefollowingcommand:

mahoutbaumwelch-i/tmp/hmm/hmm-input-o/tmp/hmm/hmm-model-nh3-no

3-e.0001-m1000

Theparametersusedintheprecedingcommandareasfollows:

i:Thisistheinputfilelocationo:Thisistheoutputlocationforthemodelnh:Thisisthenumberofhiddenstates.Inourexample,itisthree(icecream,juice,orcake)no:Thisisthenumberofobservablestates.Inourexample,itisthree(spicy,normal,ornofood)e:Thisistheepsilonnumber.Thisistheconvergencethresholdvaluem:Thisisthemaximumiterationnumber

Thefollowingscreenshotshowstheoutputonexecutingthepreviouscommand:

Page 124: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

3. NowwehaveanHMMmodelthatcanbeusedtobuildapredictedsequence.Wewillrunthemodeltopredictthenext15statesoftheobservablesequenceusingthefollowingcommand:

mahouthmmpredict-m/tmp/hmm/hmm-model-o/tmp/hmm/hmm-predictions-l

10

Theparametersusedintheprecedingcommandareasfollows:

m:ThisisthepathfortheHMMmodel

o:Thisistheoutputdirectorypath

l:Thisisthelengthofthegeneratedsequence

4. Toviewthepredictionforthenext10observablestates,usethefollowingcommand:

mahouthmmpredict-m/tmp/hmm/hmm-model-o/tmp/hmm/hmm-predictions-l

10

Theoutputofthepreviouscommandisshowninthefollowingscreenshot:

Fromtheoutput,wecansaythatthenextobservablestatesforBob’slunchwillbespicy,spicy,spicy,normal,normal,nofood,nofood,nofood,nofood,andnofood.

5. Now,wewilluseonemorealgorithmtopredictthehiddenstate.WewillusetheViterbialgorithmtopredictthehiddenstatesforagivenobservationalstate’ssequence.Wewillfirstcreatethesequenceoftheobservationalstateusingthefollowingcommand:

echo"012021100112">/tmp/hmm/hmm-viterbi-input

6. WewillusetheViterbicommand-lineoptiontogeneratetheoutputwiththelikelihoodofgeneratingthissequence:

mahoutviterbi--input/tmp/hmm/hmm-viterbi-input--outputtmp/hmm/hmm-

viterbi-output--model/tmp/hmm/hmm-model--likelihood

Page 125: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Theparametersusedintheprecedingcommandareasfollows:

input:Thisistheinputlocationofthefileoutput:ThisistheoutputlocationoftheViterbialgorithm’soutputmodel:ThisistheHMMmodellocationthatwecreatedearlierlikelihood:Thisisthecomputedlikelihoodoftheobservedsequence

Thefollowingscreenshotshowstheoutputonexecutingthepreviouscommand:

7. PredictionsfromtheViterbiaresavedintheoutputfileandcanbeseenusingthecatcommand:

cat/tmp/hmm/hmm-viterbi-output

Thefollowingoutputshowsthepredictionsforthehiddenstate:

Page 126: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 127: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedanotherclassificationtechnique:theHiddenMarkovModel.Youlearnedaboutdeterministicandnondeterministicpatterns.WealsotouchedupontheMarkovprocessandHiddenMarkovprocessingeneral.WecheckedtheclassesimplementedinsideMahouttosupporttheHiddenMarkovModel.WetookupanexampletocreatetheHMMmodelandfurtherusedthismodeltopredicttheobservationalstate’ssequence.WeusedtheViterbialgorithmimplementedinMahouttopredictthehiddenstatesinthesystem.

Now,inthenextchapter,wewillcoveronemoreinterestingalgorithmusedinclassificationarea:Randomforest.

Page 128: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 129: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter6.LearningRandomForestUsingMahoutRandomforestisoneofthemostpopulartechniquesinclassification.Itstartswithamachinelearningtechniquecalleddecisiontree.Inthischapter,wewillexplorethefollowingtopics:

DecisiontreeRandomforestUsingMahoutforRandomforest

Page 130: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

DecisiontreeAdecisiontreeisusedforclassificationandregressionproblems.Insimpleterms,itisapredictivemodelthatusesbinaryrulestocalculatethetargetvariable.Inadecisiontree,weuseaniterativeprocessofsplittingthedataintopartitions,thenwesplititfurtheronbranches.Asinotherclassificationmodelcreationprocesses,westartwiththetrainingdatasetinwhichtargetvariablesorclasslabelsaredefined.Thealgorithmtriestobreakalltherecordsintrainingdatasetsintotwopartsbasedononeoftheexplanatoryvariables.Thepartitioningisthenappliedtoeachnewpartition,andthisprocessiscontinueduntilnomorepartitioningcanbedone.Thecoreofthealgorithmistofindouttherulethatdeterminestheinitialsplit.Therearealgorithmstocreatedecisiontrees,suchasIterativeDichotomiser3(ID3),ClassificationandRegressionTree(CART),Chi-squaredAutomaticInteractionDetector(CHAID),andsoon.AgoodexplanationforID3canbefoundathttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html.

Formingtheexplanatoryvariablestochoosethebestsplitterinanode,thealgorithmconsiderseachvariableinturn.Everypossiblesplitisconsideredandtried,andthebestsplitistheonethatproducesthelargestdecreaseindiversityoftheclassificationlabelwithineachpartition.Thisisrepeatedforallvariables,andthewinnerischosenasthebestsplitterforthatnode.Theprocessiscontinuedinthenextnodeuntilwereachanodewherewecanmakethedecision.

Wecreateadecisiontreefromatrainingdatasetsoitcansufferfromtheoverfittingproblem.Thisbehaviorcreatesaproblemwithrealdatasets.Toimprovethissituation,aprocesscalledpruningisused.Inthisprocess,weremovethebranchesandleavesofthetreetoimprovetheperformance.Algorithmsusedtobuildthetreeworkbestatthestartingorrootnodesincealltheinformationisavailablethere.Lateron,witheachsplit,dataislessandtowardstheendofthetree,aparticularnodecanshowpatternsthatarerelatedtothesetofdatawhichisusedtosplit.Thesepatternscreateproblemswhenweusethemtopredicttherealdataset.Pruningmethodsletthetreegrowandremovethesmallerbranchesthatfailtogeneralize.Nowtakeanexampletounderstandthedecisiontree.

Considerwehaveairisflowerdataset.Thisdatasetishugelypopularinthemachinelearningfield.ItwasintroducedbySirRonaldFisher.Itcontains50samplesfromeachofthreespeciesofirisflower(Irissetosa,Irisvirginica,andIrisversicolor).Thefourexplanatoryvariablesarethelengthandwidthofthesepalsandpetalsincentimeters,andthetargetvariableistheclasstowhichtheflowerbelongs.

Page 131: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Asyoucanseeintheprecedingdiagram,allthegroupswereearlierconsideredasSentosaspeciesandthentheexplanatoryvariableandpetallengthwerefurtherusedtodividethegroups.Ateachstep,thecalculationformisclassifieditemswasalsodone,whichshowshowmanyitemswerewronglyclassified.Moreover,thepetalwidthvariablewastakenintoaccount.Usually,itemsatleafnodesarecorrectlyclassified.

Page 132: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 133: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

RandomforestTheRandomforestalgorithmwasdevelopedbyLeoBreimanandAdeleCutler.Randomforestsgrowmanyclassificationtrees.Theyareanensemblelearningmethodforclassificationandregressionthatconstructsanumberofdecisiontreesattrainingtimeandalsooutputstheclassthatisthemodeoftheclassesoutputtedbyindividualtrees.

Singledecisiontreesshowthebias–variancetradeoff.Sotheyusuallyhavehighvarianceorhighbias.Thefollowingaretheparametersinthealgorithm:

Bias:ThisisanerrorcausedbyanerroneousassumptioninthelearningalgorithmVariance:Thisisanerrorthatrangesfromsensitivitytosmallfluctuationsinthetrainingset

Randomforestsattempttomitigatethisproblembyaveragingtofindanaturalbalancebetweentwoextremes.ARandomforestworksontheideaofbagging,whichistoaveragenoisyandunbiasedmodelstocreateamodelwithlowvariance.ARandomforestalgorithmworksasalargecollectionofdecorrelateddecisiontrees.TounderstandtheideaofaRandomforestalgorithm,let’sworkwithanexample.

Considerwehaveatrainingdatasetthathaslotsoffeatures(explanatoryvariables)andtargetvariablesorclasses:

Wecreateasamplesetfromthegivendataset:

Adifferentsetofrandomfeaturesweretakenintoaccounttocreatetherandomsub-dataset.Now,fromthesesub-datasets,differentdecisiontreeswillbecreated.Soactuallywehavecreatedaforestofthedifferentdecisiontrees.Usingthesedifferenttrees,wewillcreatearankingsystemforalltheclassifiers.Topredicttheclassofanewunknownitem,wewilluseallthedecisiontreesandseparatelyfindoutwhichclassthesetreesare

Page 134: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

predicting.Seethefollowingdiagramforabetterunderstandingofthisconcept:

Differentdecisiontreestopredicttheclassofanunknownitem

Inthisparticularcase,wehavefourdifferentdecisiontrees.Wepredicttheclassofanunknowndatasetwitheachofthetrees.Aspertheprecedingfigure,thefirstdecisiontreeprovidesclass2asthepredictedclass,theseconddecisiontreepredictsclass5,thethirddecisiontreepredictsclass5,andthefourthdecisiontreepredictsclass3.Now,aRandomforestwillvoteforeachclass.Sowehaveonevoteeachforclass2andclass3andtwovotesforclass5.Therefore,ithasdecidedthatforthenewunknowndataset,thepredictedclassisclass5.Sotheclassthatgetsahighervoteisdecidedforthenewdataset.ARandomforesthasalotofbenefitsinclassificationandafewofthemarementionedinthefollowinglist:

CombinationoflearningmodelsincreasestheaccuracyoftheclassificationRunseffectivelyonlargedatasetsaswellThegeneratedforestcanbesavedandusedforotherdatasetsaswellCanhandlealargeamountofexplanatoryvariables

NowthatwehaveunderstoodtheRandomforesttheoretically,let’smoveontoMahoutandusetheRandomforestalgorithm,whichisavailableinApacheMahout.

Page 135: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 136: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingMahoutforRandomforestMahouthasimplementationfortheRandomforestalgorithm.Itisveryeasytounderstandanduse.Solet’sgetstarted.

Dataset

WewillusetheNSL-KDDdataset.Since1999,KDD‘99hasbeenthemostwidelyuseddatasetfortheevaluationofanomalydetectionmethods.ThisdatasetispreparedbyS.J.StolfoandisbuiltbasedonthedatacapturedintheDARPA‘98IDSevaluationprogram(R.P.Lippmann,D.J.Fried,I.Graf,J.W.Haines,K.R.Kendall,D.McClung,D.Weber,S.E.Webster,D.Wyschogrod,R.K.Cunningham,andM.A.Zissman,“Evaluatingintrusiondetectionsystems:The1998darpaoff-lineintrusiondetectionevaluation,”discex,vol.02,p.1012,2000).

DARPA‘98isabout4GBofcompressedraw(binary)tcpdumpdataof7weeksofnetworktraffic,whichcanbeprocessedintoabout5millionconnectionrecords,eachwithabout100bytes.Thetwoweeksoftestdatahavearound2millionconnectionrecords.TheKDDtrainingdatasetconsistsofapproximately4,900,000singleconnectionvectors,eachofwhichcontains41featuresandislabeledaseithernormaloranattack,withexactlyonespecificattacktype.

NSL-KDDisadatasetsuggestedtosolvesomeoftheinherentproblemsoftheKDD‘99dataset.Youcandownloadthisdatasetfromhttp://nsl.cs.unb.ca/NSL-KDD/.

WewilldownloadtheKDDTrain+_20Percent.ARFFandKDDTest+.ARFFdatasets.

NoteInKDDTrain+_20Percent.ARFFandKDDTest+.ARFF,removethefirst44lines(that

Page 137: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

is,alllinesstartingwith@attribute).Ifthisisnotdone,wewillnotbeabletogenerateadescriptorfile.

Page 138: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

StepstousetheRandomforestalgorithminMahoutThestepstoimplementtheRandomforestalgorithminApacheMahoutareasfollows:

1. Transferthetestandtrainingdatasetstohdfsusingthefollowingcommands:

hadoopfs-mkdir/user/hue/KDDTrain

hadoopfs-mkdir/user/hue/KDDTest

hadoopfs–put/tmp/KDDTrain+_20Percent.arff/user/hue/KDDTrain

hadoopfs–put/tmp/KDDTest+.arff/user/hue/KDDTest

2. Generatethedescriptorfile.BeforeyoubuildaRandomforestmodelbasedonthetrainingdatainKDDTrain+.arff,adescriptorfileisrequired.Thisisbecauseallinformationinthetrainingdatasetneedstobelabeled.Fromthelabeleddataset,thealgorithmcanunderstandwhichoneisnumericalandcategorical.Usethefollowingcommandtogeneratedescriptorfile:

hadoopjar$MAHOUT_HOME/core/target/mahout-core-xyz.job.jar

org.apache.mahout.classifier.df.tools.Describe

-p/user/hue/KDDTrain/KDDTrain+_20Percent.arff

-f/user/hue/KDDTrain/KDDTrain+.info

-dN3C2NC4NC8N2C19NL

Jar:Mahoutcorejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.ThemainclassDescribeisusedhereandittakesthreeparameters:

Theppathforthedatatobedescribed.

Theflocationforthegenerateddescriptorfile.

distheinformationfortheattributeonthedata.N3C2NC4NC8N2C19NLdefinesthatthedatasetisstartingwithanumeric(N),followedbythreecategoricalattributes,andsoon.Inthelast,Ldefinesthelabel.

Theoutputofthepreviouscommandisshowninthefollowingscreenshot:

3. BuildtheRandomforestusingthefollowingcommand:

hadoopjar$MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar

org.apache.mahout.classifier.df.mapreduce.BuildForest

-Dmapred.max.split.size=1874231-d

/user/hue/KDDTrain/KDDTrain+_20Percent.arff

-ds/user/hue/KDDTrain/KDDTrain+.info

-sl5-p-t100–o/user/hue/nsl-forest

Jar:Mahoutexamplejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Themainclassbuild

Page 139: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

forestisusedtobuildtheforestwithotherarguments,whichareshowninthefollowinglist:

Dmapred.max.split.sizeindicatestoHadoopthemaximumsizeofeachpartition.

dstandsforthedatapath.

dsstandsforthelocationofthedescriptorfile.

slisavariabletoselectrandomlyateachtreenode.Here,eachtreeisbuiltusingfiverandomlyselectedattributespernode.

pusespartialdataimplementation.

tstandsforthenumberoftreestogrow.Here,thecommandsbuild100treesusingpartialimplementation.

ostandsfortheoutputpaththatwillcontainthedecisionforest.

Intheend,theprocesswillshowthefollowingresult:

4. Usethismodeltoclassifythenewdataset:

hadoopjar$MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar

org.apache.mahout.classifier.df.mapreduce.TestForest

-i/user/hue/KDDTest/KDDTest+.arff

-ds/user/hue/KDDTrain/KDDTrain+.info-m/user/hue/nsl-forest-a–mr

-o/user/hue/predictions

Jar:Mahoutexamplejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Theclasstotesttheforesthasthefollowingparameters:

Iindicatesthepathforthetestdata

dsstandsforthelocationofthedescriptorfile

mstandsforthelocationofthegeneratedforestfromthepreviouscommand

Page 140: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ainformstoruntheanalyzertocomputetheconfusionmatrix

mrinformshadooptodistributetheclassification

ostandsforthelocationtostorethepredictionsin

Thejobprovidesthefollowingconfusionmatrix:

So,fromtheconfusionmatrix,itisclearthat9,396instanceswerecorrectlyclassifiedand315normalinstanceswereincorrectlyclassifiedasanomalies.Andtheaccuracypercentageis77.7635(correctlyclassifiedinstancesbythemodel/classifiedinstances).Theoutputfileinthepredictionfoldercontainsthelistwhere0and1.0definesthenormaldatasetand1definestheanomaly.

Page 141: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 142: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedtheRandomforestalgorithm.WestartedourdiscussionbyunderstandingthedecisiontreeandcontinuedwithanunderstandingoftheRandomforest.WetookuptheNSL-KDDdataset,whichisusedtobuildpredictivesystemsforcybersecurity.WeusedMahouttobuildtheRandomforesttree,anduseditwiththetestdatasetandgeneratedtheconfusionmatrixandotherstatisticsfortheoutput.

Inthenextchapter,wewilllookatthefinalclassificationalgorithmavailableinApacheMahout.Sostaytuned!

Page 143: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 144: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter7.LearningMultilayerPerceptronUsingMahoutTounderstandaMultilayerPerceptron(MLP),wewillfirstexploreonemorepopularmachinelearningtechnique:neuralnetwork.Inthischapter,wewillexplorethefollowingtopics:

NeuralnetworkandneuronsMLPUsingMahoutforMLPimplementation

Page 145: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

NeuralnetworkandneuronsNeuralnetworkisanoldalgorithm,anditwasdevelopedwithagoalinmind:toprovidethecomputerwithabrain.Neuralnetworkisinspiredbythebiologicalstructureofthehumanbrainwheremultipleneuronsareconnectedandformcolumnsandlayers.Aneuronisanelectricallyexcitablecellthatprocessesandtransmitsinformationthroughelectricalandchemicalsignals.Perceptualinputentersintotheneuralnetworkthroughoursensoryorgansandisthenfurtherprocessedintohigherlevels.Let’sunderstandhowneuronsworkinourbrain.

Neuronsarecomputationalunitsinthebrainthatcollecttheinputfrominputnerves,whicharecalleddendrites.Theyperformcomputationontheseinputmessagesandsendtheoutputusingoutputnerves,whicharecalledaxons.Seethefollowingfigure(http://vv.carleton.ca/~neil/neural/neuron-a.html):

Onthesamelines,wedevelopaneuralnetworkincomputers.Wecanrepresentaneuroninouralgorithmasshowninthefollowingfigure:

Page 146: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Here,x1,x2,andx3arethefeaturevectors,andtheyareassignedtoafunctionf,whichwilldothecomputationandprovidetheoutput.Thisactivationfunctionisusuallychosenfromthefamilyofsigmoidalfunctions(asdefinedinChapter3,LearningLogisticRegression/SGDUsingMahout).Inthecaseofclassificationproblems,softmaxactivationfunctionsareused.Inclassificationproblems,wewanttheoutputastheprobabilitiesoftargetclasses.So,itisdesirablefortheoutputtoliebetween0and1andthesumcloseto1.Softmaxfunctionenforcestheseconstraints.Itisageneralizationofthelogisticfunction.Moredetailsonsoftmaxfunctioncanbefoundathttp://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html.

Page 147: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 148: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

MultilayerPerceptronAneuralnetworkorartificialneuralnetworkgenerallyreferstoanMLPnetwork.Wedefinedneuronasanimplementationincomputersintheprevioussection.AnMLPnetworkconsistsofmultiplelayersoftheseneuronunits.Let’sunderstandaperceptronnetworkofthreelayers,asshowninthenextfigure.ThefirstlayeroftheMLPrepresentstheinputandhasnootherpurposethanroutingtheinputtoeveryconnectedunitinafeed-forwardfashion.Thesecondlayeriscalledhiddenlayers,andthelastlayerservesthespecialpurposeofdeterminingtheoutput.Theactivationofneuronsinthehiddenlayerscanbedefinedasthesumoftheweightofalltheinput.Neuron1inlayer2isdefinedasfollows:

Y12=g(w110x0+w111x1+w112x2+w113x3)

Thefirstpartwhere*x0=0*iscalledthebiasandcanbeusedasanoffset,independentoftheinput.Neuron2inlayer2isdefinedasfollows:

Y22=g(w120x0+w121x1+w122x2+w123x3)

Neuron3inlayer2isdefinedasfollows:

Y32=g(w130x0+w131x1+w132x2+w133x3)

Here,gisasigmoidfunction,asdefinedinChapter3,LearningLogisticRegression/SGDUsingMahout.Thefunctionisasfollows:

g(z)=1/1+e(-z)

InthisMLPnetworkoutput,fromeachinputandhiddenlayers,neuronunitsaredistributedtoothernodes,andthisiswhythistypeofnetworkiscalledafullyconnectednetwork.Inthisnetwork,novaluesarefedbacktothepreviouslayer.(Feedforwardis

Page 149: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

anotherstrategyandisalsoknownasbackpropagation.Detailsonthiscanbefoundathttp://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html.)

AnMLPnetworkcanhavemorethanonehiddenlayer.TogetthevalueoftheweightssothatwecangetthepredictedvalueascloseaspossibletotheactualoneisatrainingprocessoftheMLP.Tobuildaneffectivenetwork,weconsideralotofitemssuchasthenumberofhiddenlayersandneuronunitsineachlayer,thecostfunctiontominimizetheerrorinpredictedandactualvalues,andsoon.

Nowlet’sdiscusstwomoreimportantandproblematicquestionsthatarisewhencreatinganMLPnetwork:

Howmanyhiddenlayersshouldoneuseforthenetwork?Howmanynumbersofhiddenunits(neuronunits)shouldoneuseinahiddenlayer?

Zerohiddenlayersarerequiredtoresolvelinearlyseparabledata.Assumingyourdatadoesrequireseparationbyanon-lineartechnique,alwaysstartwithonehiddenlayer.Almostcertainly,that’sallyouwillneed.IfyourdataisseparableusinganMLP,thenthisMLPprobablyonlyneedsasinglehiddenlayer.Inordertoselectthenumberofunitsindifferentlayers,thesearetheguidelines:

Inputlayer:Thisreferstothenumberofexplanatoryvariablesinthemodelplusoneforthebiasnode.Outputlayer:Inthecaseofclassification,thisreferstothenumberoftargetvariables,andinthecaseofregression,thisisobviouslyone.Hiddenlayer:Startyournetworkwithonehiddenlayerandusethenumberofneuronunitsequivalenttotheunitsintheinputlayer.Thebestwayistotrainseveralneuralnetworkswithdifferentnumbersofhiddenlayersandhiddenneuronsandmeasuretheperformanceofthesenetworksusingcross-validation.Youcanstickwiththenumberthatyieldsthebest-performingnetwork.Problemsthatrequiretwohiddenlayersarerarelyencountered.However,neuralnetworksthathavemorethanonehiddenlayercanrepresentfunctionswithanykindofshape.Thereiscurrentlynotheorytojustifytheuseofneuralnetworkswithmorethantwohiddenlayers.Infact,formanypracticalproblems,thereisnoreasontouseanymorethanonehiddenlayer.Anetworkwithnohiddenlayerisonlycapableofrepresentinglinearlyseparablefunctions.Networkswithonelayercanapproximateanyfunctionthatcontainsacontinuousmappingfromonefinitespacetoanother,andnetworkswithtwohiddenlayerscanrepresentanarbitrarydecisionboundarytoarbitraryaccuracywithrationalactivationfunctionsandcanapproximateanysmoothmappingtoanyaccuracy(Chapter5ofthebookIntroductiontoNeuralNetworksforJava).Numberofneuronsorhiddenunits:Usethenumberofneuronunitsequivalenttotheunitsintheinputlayer.Thenumberofhiddenunitsshouldbelessthantwicethenumberofunitsintheinputlayer.Anotherruletocalculatethisis(numberofinputunits+numberofoutputunits)*2/3.

Dothetestingforgeneralizationerrors,trainingerrors,bias,andvariance.Whenageneralizationerrordips,thenjustbeforeitbeginstoincreaseagain,thenumbersofnodesareusuallyfoundtobeperfectatthispoint.

Page 150: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Nowlet’smoveontothenextsectionandexplorehowwecanuseMahoutforanMLP.

Page 151: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 152: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

MLPimplementationinMahoutTheMLPimplementationisbasedonamoregeneralneuralnetworkclass.ItisimplementedtorunonasinglemachineusingStochasticGradientDescent,wheretheweightsareupdatedusingonedatapointatatime.

Thenumberoflayersandunitsperlayercanbespecifiedmanuallyanddeterminesthewholetopologywitheachunitbeingfullyconnectedtothepreviouslayer.Abiasunitisautomaticallyaddedtotheinputofeverylayer.Abiasunitishelpfulforshiftingtheactivationfunctiontotheleftorright.Itislikeaddingacoefficienttothelinearfunction.

Currently,thelogisticsigmoidisusedasasquashingfunctionineveryhiddenandoutputlayer.

Thecommand-lineversiondoesnotperformiterationsthatleadtobadresultsonsmalldatasets.AnotherrestrictionisthattheCLIversionoftheMLPonlysupportsclassification,sincethelabelshavetobegivenexplicitlywhenexecutingtheimplementationinthecommandline.

Alearnedmodelcanbestoredandupdatedwithnewtraininginstancesusingthe`--update`flag.Theoutputoftheclassificationresultissavedasa.txtfileandonlyconsistsoftheassignedlabels.Apartfromthecommand-lineinterface,itispossibletoconstructandcompilemorespecializedneuralnetworksusingtheAPIandinterfacesinthemrlegacypackage.(Thecorepackageisrenamedasmrlegacy.)

Inthecommandline,weuseTrainMultilayerPerceptronandRunMultilayerPerceptronclassesthatareavailableinthemrlegacypackagewiththreeotherclasses:Neuralnetwork.java,NeuralNetworkFunctions.java,andMultilayerPerceptron.java.Forthisparticularimplementation,userscanfreelycontrolthetopologyoftheMLP,includingthefollowing:

ThesizeoftheinputlayerThenumberofhiddenlayersThesizeofeachhiddenlayerThesizeoftheoutputlayerThecostfunctionThesquashingfunction

Themodelistrainedinanonlinelearningapproach,wheretheweightsofneuronsintheMLPisupdatedandincrementedusingthebackPropagationalgorithmproposedbyRumelhart,D.E.,Hinton,G.E.,andWilliams,R.J.(1986),Learningrepresentationsbyback-propagatingerrors.Nature,323,533-536.

Page 153: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 154: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingMahoutforMLPMahouthasimplementationforanMLPnetwork.TheMLPimplementationiscurrentlylocatedintheMap-Reduce-Legacypackage.Aswithotherclassificationalgorithms,twoseparatedclassesareimplementedtotrainandusethisclassifier.Fortrainingtheclassifier,theorg.apache.mahout.classifier.mlp.TrainMultilayerPerceptronclass,andforrunningtheclassifier,theorg.apache.mahout.classifier.mlp.RunMultilayerPerceptronclassisused.Thereareanumberofparametersdefinedthatareusedwiththeseclasses,butwewilldiscusstheseparametersoncewerunourexampleonadataset.

Dataset

Inthischapter,wewilltrainanMLPtoclassifytheirisdataset.Theirisflowerdatasetcontainsdataofthreeflowerspecies,whereeachdatapointconsistsoffourfeatures.ThisdatasetwasintroducedbySirRonaldFisher.Itconsistsof50samplesfromeachofthreespeciesofiris.ThesespeciesareIrissetosa,Irisvirginica,andIrisversicolor.Fourfeaturesweremeasuredfromeachsample:

SepallengthSepalwidthPetallengthPetalwidth

Allmeasurementsareincentimeters.Youcandownloadthisdatasetfromhttps://archive.ics.uci.edu/ml/machine-learning-databases/iris/andsaveitasa.csvfile,asshowninthefollowingscreenshot:

Thisdatasetwilllooklikethethefollowingscreenshot:

Page 155: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 156: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

StepstousetheMLPalgorithminMahoutThestepstousetheMLPalgorithminMahoutareasfollows:

1. CreatetheMLPmodel.

TocreatetheMLPmodel,wewillusetheTrainMultilayerPerceptronclass.Usethefollowingcommandtogeneratethemodel:

bin/mahoutorg.apache.mahout.classifier.mlp.TrainMultilayerPerceptron-

i/tmp/irisdata.csv-labelsIris-setosaIris-versicolorIris-virginica

-mo/tmp/model.model-ls483-l0.2-m0.35-r0.0001

Youcanalsorunusingthecorejar:Mahoutcorejar(xyzstandsfortheversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Executethefollowingcommand:

Java–cp/usr/lib/mahout/mahout-core-xyz-job.jar

org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron-i

/tmp/irisdata.csv-labelsIris-setosaIris-versicolorIris-virginica-

mo/user/hue/mlp/model.model-ls483-l0.2-m0.35-r0.0001

TheTrainMultilayerPerceptronclassisusedhereandittakesdifferentparameters.Also,iisthepathfortheinputdataset.Here,wehaveputthedatasetunderthe/tmpfolder(localfilesystem).Additionally,labelsaredefinedinthedataset.Herewehavethefollowinglabels:

moistheoutputlocationforthecreatedmodel.lsisthenumberofunitsperlayer,includinginput,hidden,andoutputlayers.Thisparameterspecifiesthetopologyofthenetwork.Here,wehave4astheinputfeature,8forthehiddenlayer,and3fortheoutputclassnumber.listhelearningratethatisusedforweightupdates.Thedefaultis0.5.Toapproximategradientdescent,neuralnetworksaretrainedwithalgorithms.Learningispossibleeitherbybatchoronlinemethods.Inbatchtraining,weightchangesareaccumulatedoveranentirepresentationofthetrainingdata(anepoch)beforebeingapplied,whileonlinetrainingupdatesweighsafterthepresentationofeachtrainingexample(instance).Moredetailscanbefoundathttp://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf.misthemomentumweightthatisusedforgradientdescent.Thismustbeintherangebetween0–1.0.ristheregularizationvaluefortheweightvector.Thismustbeintherangebetween0–0.1.Itisusedtopreventoverfitting.

Page 157: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

2. Totest/runtheMLPclassificationofthetrainedmodel,wecanusethefollowingcommand:

bin/mahoutorg.apache.mahout.classifier.mlp.RunMultilayerPerceptron-i

/tmp/irisdata.csv-cr03-mo/tmp/model.model-o/tmp/labelResult.txt

YoucanalsorunusingtheMahoutcorejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Executethefollowingcommand:

Java–cp/usr/lib/mahout/mahout-core-xyz-job.jar

org.apache.mahout.classifier.mlp.RunMultilayerPerceptron-i

/tmp/irisdata.csv-cr03-mo/tmp/model.model-o/tmp/labelResult.txt

TheRunMultilayerPerceptronclassisemployedheretousethemodel.Thisclassalsotakesdifferentparameters,whichareasfollows:

iindicatestheinputdatasetlocationcristherangeofcolumnstousefromtheinputfile,startingwith0(thatis,`-cr05`forincludingthefirstsixcolumnsonly)moisthelocationofthemodelbuiltearlieroisthepathtostorelabeledresultsfromrunningthemodel

Page 158: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 159: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedoneofthenewlyimplementedalgorithmsinMahout:MLP.WestartedourdiscussionbyunderstandingneuralnetworksandneuronunitsandcontinuedourdiscussionfurthertounderstandtheMLPnetworkalgorithm.Wediscussedhowtochoosedifferentlayerunits.WethenmovedtoMahoutandusedtheirisdatasettotestandrunanMLPalgorithmimplementedinMahout.Withthis,wehavefinishedourdiscussiononclassificationalgorithmsavailableinApacheMahout.

NowwemoveontothenextchapterofthisbookwherewewilldiscussthenewchangescomingupinthenewMahoutrelease.

Page 160: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 161: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter8.MahoutChangesintheUpcomingReleaseMahoutisacommunity-drivenprojectanditscommunityisverystrong.Thiscommunitydecidedonsomeofthemajorchangesintheupcoming1.0release.Inthischapter,wewillexploretheupcomingchangesanddevelopmentsinApacheMahout.Wewilllookatthefollowingtopicsinbrief:

NewchangesdueinMahout1.0ApacheSparkH20-platform-relatedworkinApacheMahout

Page 162: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

MahoutnewchangesMahoutwasusingthemapreduceprogrammingmodeltohandlelargedatasets.FromtheendofApril2014,thecommunitydecidedtostoptheimplementationofthenewmapreducealgorithm.Thisdecisionhasavalidreason.Mahout’scodebasewillbemovingtomoderndataprocessingsystemsthatofferaricherprogrammingmodelandmoreefficientexecutionthanHadoop’sMapReduce.

MahouthasstarteditsimplementationonthetopofDomainSpecificLanguage(DSL)forlinearalgebraicoperations.ProgramswritteninthisDSLareautomaticallyoptimizedandexecutedinparallelonApacheSpark.ScalaDSLandalgebraicoptimizerisScalaandSparkbindingforMahout.

Page 163: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

MahoutScalaandSparkbindingsWithMahoutScalabindingsandMahoutSparkbindingsforlinearalgebrasubroutines,developersinMahoutaretryingtobringsemanticexplicitnesstoMahout’sin-coreandout-of-corelinearalgebrasubroutines.TheyaredoingthiswhileaddingthebenefitsofthestrongprogrammingenvironmentofScalaandcapitalizingonscalabilitybenefitsofSparkandGraphX.ScalabindingisusedtoprovidesupportforScalaDSL,andthiswillmakewritingmachinelearningprogramseasier.

MahoutScalaandSparkbindingsarepackagesthataimtoprovideanR-likelookandfeeltoMahout’sin-coreandout-of-coreSpark-backedlinearalgebra.AnimportantpartofSparkbindingsistheexpressionoptimizer.Thisoptimizerlooksattheentireexpressionanddecidesonhowitcanbesimplifiedandwhichphysicaloperatorsshouldbepicked.Ahigh-leveldiagramofthebindingstackisshowninthefollowingfigure(https://issues.apache.org/jira/secure/attachment/12638098/BindingsStack.jpg):

TheSparkbindingshellhasalsobeenimplementedinMahout1.0.Let’sunderstandtheApacheSparkprojectfirstandthenwewillrevisittheSparkbindingshellinMahout.

Page 164: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 165: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ApacheSparkApacheSparkisanopensource,in-memory,general-purposecomputingsystem.Spark’sin-memorytechniqueprovidesperformancethatis100timesfaster.InsteadofHadoop-likedisk-basedcomputation,Sparkusesclustermemorytouploadallthedataintothememory,andthisdatacanbequeriedrepeatedly.

ApacheSparkprovideshigh-levelAPIsinJava,Python,andScalaandanoptimizedenginethatsupportsgeneralexecutiongraphs.Itprovidesthefollowinghigh-leveltools:

SparkSQL:ThisisforSQLandstructureddataprocessing.MLib:ThisisSpark’sscalablemachinelearninglibrarythatconsistsofcommonlearningalgorithmsandutilities,includingclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellastheunderlyingoptimizationprimitives.GraphX:ThisisthenewSparkAPIforgraphsandgraph-parallelcomputation.Sparkstreaming:Thiscancollectdatafrommanysourcesandafterprocessingthisdata,itusescomplexalgorithmsandcanpushthedatatofilesystems,databases,andlivedashboards.

AsSparkisgainingpopularityamongdatascientists,theMahoutcommunityisalsoquicklyworkingonmakingMahoutalgorithmsfunctiononSpark’sexecutionenginetospeedupitscalculation10to100timesfaster.MahoutprovidesseveralimportantbuildingblockstocreaterecommendationsusingSpark.Spark-itemsimilaritycanbeusedtocreateotherpeoplealsolikedthesethingskindofrecommendationsandwhenpairedwithasearchenginecanpersonalizerecommendationsforindividualusers.Spark-rowsimilaritycanprovidenon-personalizedcontentbasedonrecommendationsandwhenpairedwithasearchenginecanbeusedtopersonalizecontentbasedonrecommendations(http://comments.gmane.org/gmane.comp.apache.mahout.scm/6513).

Page 166: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

UsingMahout’sSparkshellYoucanuseMahout’sSparkshellbyreferringtothefollowingsteps:

1. DownloadSparkfromhttp://spark.apache.org/downloads.html.2. Createanewfolderwiththenamesparkusingthefollowingcommandandmovethe

downloadedfilethere:

mkdir/tmp/spark

mv~/Downloads/spark-1.1.1.tgz/tmp/spark

3. Unpackthearchivedfileinafolderusingthefollowingcommand:

cd/tmp/spark

tarxzfspark-1.1.1.tgz

4. Thiswillunzipthefileunder/tmp/spark/spark-1.1.1.Now,movetothenewlycreatedfolderandrunthefollowingcommand:

cd/spark-1.1.1

sbt/sbtassembly

ThiswillbuildSparkonyoursystemasshowninthefollowingscreenshot:

5. NowcreateaMahoutdirectoryandmovethefiletoitusingthefollowingcommand:

mkdir/tmp/Mahout

6. CheckoutthemasterbranchofMahoutfromGitHubusingthefollowingcommand:

gitclonehttps://github.com/apache/mahoutmahout

Page 167: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:

7. ChangeyourdirectorytothenewlycreatedMahoutdirectoryandbuildMahout:

cdmahout

mvn-DskipTestscleaninstall

Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:

8. MovetothedirectorywhereyouunpackedSparkandtypethefollowingcommandtostartSparklocally:

cd/tmp/spark/spark-1.1.1

sbin/start-all-sh

Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:

9. Openabrowser;pointittohttp://localhost:8080/tocheckwhetherSparkhas

Page 168: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

successfullystarted.CopytheURLoftheSparkmasteratthetopofthepage(itstartswithspark://).

10. Definethefollowingenvironmentvariables:

exportMAHOUT_HOME=[directoryintowhichyoucheckedoutMahout]

exportSPARK_HOME=[directorywhereyouunpackedSpark]

exportMASTER=[urloftheSparkmaster]

11. Finally,changetothedirectorywhereyouunpackedMahoutandtypebin/mahoutspark-shell;youshouldseetheshellstartingandgetthemahout>prompt.

NowyourMahoutSparkshellisreadyandyoucanstartplayingwithdata.Formoreinformationonthistopic,seetheimplementationsectionathttps://mahout.apache.org/users/sparkbindings/play-with-shell.html.

Page 169: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 170: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

H2OplatformintegrationAsdiscussedearlier,anexperimentalworktointegrateMahoutandtheH2Oplatformisalsoinprogress.TheintegrationprovidesanH2ObackendtotheMahoutalgebraDSL.

H2OmakesHadoopdomath!H2Oscalesstatistics,machinelearning,andmathoverbigdata.Itisextensibleanduserscanbuildblocksusingsimplemathlegosinthecore.H2OkeepsfamiliarinterfacessuchasR,Excel,andJSONsothatbigdataenthusiastsandexpertscanexplore,munge,model,andscoredatasetsusingarangeofsimple-to-advancedalgorithms.Datacollectioniseasy,whiledecisionmakingishard.H2Omakesitfastandeasytoderiveinsightsfromyourdatathroughfasterandbetterpredictivemodeling.Italsohasavisionofonlinescoringandmodelinginasingleplatform(http://0xdata.com/download/).

H2Oisfundamentallyapeer-to-peersystem.H2Onodesjointogethertoformacloudonwhichhigh-performancedistributedmathcanbeexecuted.Eachnodejoinsacloudofagivenname.Multiplecloudscanexistonthesamenetworkatthesametimeaslongastheirnamesaredifferent.Multiplenodescanexistonthesameserveraswell(theycanevenbelongtothesamecloud).

TheMahoutH2OintegrationisfitintothismodelbyhavingN-1workernodesandonedrivernode,allbelongingtothesamecloudname.Thedefaultcloudnameusedfortheintegrationismah2out.Cloudshavetobespunupaspertheirtask/job.

Moredetailscanbefoundathttps://issues.apache.org/jira/browse/MAHOUT-1500.

Page 171: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 172: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedtheupcomingreleaseofMahout1.0,andthechangesthatarecurrentlygoingon.WealsoglancedthroughSpark,Scalabinding,andApacheSpark.Wealsodiscussedahigh-leveloverviewofH2OMahoutintegration.

Nowlet’smoveontothefinalchapterofthisbookwherewewilldevelopaproduction-readyclassifier.

Page 173: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 174: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Chapter9.BuildinganE-mailClassificationSystemUsingApacheMahoutInthischapter,wewillcreateaclassifiersystemusingMahout.Inordertobuildthissystem,wewillcoverthefollowingtopics:

GettingthedatasetPreparationofthedatasetPreparingthemodelTrainingthemodel

Inthischapter,wewilltargetthecreationoftwodifferentclassifiers.Thefirstonewillbeaneasyonebecauseyoucanbothcreateandtestitonapseudo-distributedHadoopinstallation.Forthesecondclassifier,Iwillprovideyouwithallthedetails,soyoucanrunitusingyourfullydistributedHadoopinstallation.Iwillcountthesecondoneasahands-onexerciseforthereadersofthisbook.

Firstofall,let’sunderstandtheproblemstatementforthefirstusecase.Nowadays,inmostofthee-mailsystems,weseethate-mailsareclassifiedasspamornotspam.E-mailsthatarenotspamaredelivereddirectlyintoourinboxbutspame-mailsarestoredinafoldercalledSpam.Usually,basedonacertainpatternsuchasmessagesubject,sender’se-mailaddress,orcertainkeywordsinthemessagebody,wecategorizeanincominge-mailasspam.WewillcreateaclassifierusingMahout,whichwillclassifyane-mailintospamornotspam.WewilluseSpamAssassin,anApacheopensourceprojectdatasetforthistask.

Forthesecondusecase,wewillcreateaclassifier,whichcanpredictagroupofincominge-mails.Asanopensourceproject,therearelotsofprojectsundertheApachesoftwarefoundation,suchasApacheMahout,ApacheHadoop,ApacheSolr,andsoon.WewilltaketheApacheSoftwareFoundation(ASF)e-maildatasetandusingthis,wewillcreateandtrainourmodelsothatourmodelcanpredictanewincominge-mail.So,basedoncertainfeatures,wewillbeabletopredictwhichgroupanewincominge-mailbelongsto.

InMahout’sclassificationproblem,wewillhavetoidentifyapatterninthedatasettohelpuspredictthegroupofanewe-mail.Wealreadyhaveadataset,whichisseparatedbyprojectnames.WewillusetheASFpublice-mailarchivesdatasetforthisusecase.

Now,let’sconsiderourfirstusecase:spame-maildetectionclassifier.

Page 175: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Spame-maildatasetAsImentioned,wewillbeusingtheApacheSpamAssassinprojectsdataset.ApacheSpamAssassinisanopensourcespamfilter.Download20021010_easy_ham.tarand20021010_spam.tarfromhttp://spamassassin.apache.org/publiccorpus/,asshowninthefollowingscreenshot:

Page 176: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 177: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

CreatingthemodelusingtheAssassindatasetWecancreatethemodelwiththehelpofthefollowingsteps:

1. Createafolderundertmpwiththenamedataset,andthenclickonthefolderandunzipthedatasetsusingthefollowingcommand:

mkdir/tmp/assassin/dataset

tar–xvf/tmp/assassin/20021010_easy_ham.tar.bz2

tar–xvf/tmp/assassin/20021010_spam.tar.bz2

Thiswillcreatetwofoldersunderthedatasetfolder,easy_hamandspam,asshowninthefollowingscreenshot:

2. CreateafolderinHdfsandmovethisdatasetintoHadoop:

hadoopfs-mkdir/user/hue/assassin/

hadoopfs–put/tmp/assassin/dataset/user/hue/assassin

tar–xvf/tmp/assassin/20021010_spam.tar.bz2

Nowourdatapreparationisdone.Wehavedownloadedthedataandmovedthisdataintohdfs.Let’smoveontothenextstep.

3. ConvertthisdataintosequencefilessothatwecanprocessitusingHadoop:

bin/mahoutseqdirectory–i/user/hue/assassin/dataset–o

/user/hue/assassinseq-out

Page 178: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

4. Convertthesequencefileintosparsevector(Mahoutalgorithmsacceptinputinvectorformat,whichiswhyweareconvertingthesequencefileintosparsevector)byusingthefollowingcommand:

bin/mahoutseq2sparse-i/user/hue/assassinseq-out/part-m-00000-o

/user/hue/assassinvec-lnorm-nv-wttfidf

Thecommandintheprecedingscreenshotisexplainedasfollows:

lnorm:Thiscommandisusedforoutputvectortobelognormalized.nv:Thiscommandisusedfornamedvector.wt:Thiscommandisusedtoidentifythekindofweighttouse.Hereweusetf-idf.

5. Splitthesetofvectorsfortrainingandtestingthemodel,asfollows:

Page 179: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

bin/mahoutsplit-i/user/hue/assassinvec/tfidf-vectors--

trainingOutput/user/hue/assassindatatrain--testOutput

/user/hue/assassindatatest--randomSelectionPct20--overwrite--

sequenceFiles-xmsequential

Theprecedingcommandcanbeexplainedasfollows:

TherandomSelectionPctparameterdividesthepercentageofdataintotestandtrainingdatasets.Inthiscase,it’s80percentfortestand20percentfortraining.Thexmparameterspecifieswhatportionofthetf(tf-idf)vectorsistobeusedexpressedintimesthestandarddeviation.Thesigmasymbolspecifiesthedocumentfrequenciesofthesevectors.Itcanbeusedtoremovereallyhighfrequencyterms.Itisexpressedasadoublevalue.Agoodvaluetobespecifiedis3.0.Ifthevalueislessthan0,novectorswillbefilteredout.

6. Now,trainthemodelusingthefollowingcommand:

bin/mahouttrainnb-i/user/hue/assassindatatrain-el-o

/user/hue/prodmodel-li/user/hue/prodlabelindex-ow-c

7. Now,testthemodelusingthefollowingcommand:

bin/mahouttestnb-i/user/hue/assassindatatest-m/user/hue/prodmodel/

-l/user/hue/prodlabelindex-ow-o/user/hue/prodresults

Page 180: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Youcanseefromtheresultsthattheoutputisdisplayedontheconsole.Asperthematrix,thesystemhascorrectlyclassified99.53percentoftheinstancesgiven.

Wecanusethiscreatedmodeltoclassifynewdocuments.Todothis,wecaneitheruseaJavaprogramorcreateaservletthatcanbedeployedonourserver.

Let’stakeanexampleofaJavaprogramincontinuationofthisexercise.

Page 181: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 182: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ProgramtouseaclassifiermodelWewillcreateaJavaprogramthatwilluseourmodeltoclassifynewe-mails.Thisprogramwilltakemodel,labelindex,dictionary-file,documentfrequency,andtextfileasinputandwillgenerateascoreforthecategories.Thecategorywillbedecidedbasedonthehigherscores.

Let’shavealookatthisprogramstepbystep:

The.jarfilesrequiredtomakeacompilationofthisprogramareasfollows:

Hadoop-core-x.y.x.jar

Mahout-core-xyz.jar

Mahout-integration-xyz.jar

Mahout-math-xyz.jar

Theimportstatementsarelistedasfollows.WearediscussingthisbecausetherearelotsofchangesintheMahoutreleasesandpeopleusuallyfinditdifficulttogetthecorrectclasses.

importjava.io.BufferedReader;

importjava.io.FileReader;

importjava.io.StringReader;

importjava.util.HashMap;

importjava.util.Map;

importorg.apache.hadoop.conf.Configuration;

importorg.apache.hadoop.fs.Path;

importorg.apache.lucene.analysis.Analyzer;

importorg.apache.lucene.analysis.TokenStream;

importorg.apache.lucene.analysis.standard.StandardAnalyzer;

import

org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

importorg.apache.lucene.util.Version;

importorg.apache.mahout.classifier.naivebayes.BayesUtils;

importorg.apache.mahout.classifier.naivebayes.NaiveBayesModel;

import

org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;

importorg.apache.mahout.common.Pair;

import

org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;

importorg.apache.mahout.math.RandomAccessSparseVector;

importorg.apache.mahout.math.Vector;

importorg.apache.mahout.math.Vector.Element;

importorg.apache.mahout.vectorizer.TFIDF;

importorg.apache.hadoop.io.*;

importcom.google.common.collect.ConcurrentHashMultiset;

importcom.google.common.collect.Multiset;

Page 183: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Thesupportingmethodstoreadthedictionaryareasfollows:

publicstaticMap<String,Integer>readDictionary(Configurationconf,

PathdictionaryPath){

Map<String,Integer>dictionary=newHashMap<String,Integer>();

for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,

IntWritable>(dictionaryPath,true,conf)){

dictionary.put(pair.getFirst().toString(),pair.getSecond().get());

}

returndictionary;

}

Thesupportingmethodstoreadthedocumentfrequencyareasfollows:

publicstaticMap<Integer,Long>readDocumentFrequency(Configuration

conf,PathdocumentFrequencyPath){

Map<Integer,Long>documentFrequency=newHashMap<Integer,Long>();

for(Pair<IntWritable,LongWritable>pair:new

SequenceFileIterable<IntWritable,LongWritable>(documentFrequencyPath,

true,conf)){

documentFrequency.put(pair.getFirst().get(),

pair.getSecond().get());

}

returndocumentFrequency;

}

Thefirstpartofthemainmethodisusedtoperformthefollowingactions:

GettingtheinputLoadingthemodelInitializingStandardNaiveBayesClassifierusingourcreatedmodelReadinglabelindex,documentfrequency,anddictionarycreatedwhilecreatingthevectorfromthedataset

Thefollowingcodecanbeusedfortheprecedingactions:

publicstaticvoidmain(String[]args)throwsException{

if(args.length<5){

System.out.println("Arguments:[model][labelindex]

[dictionary][documentfrequency][newfile]");

return;

}

StringmodelPath=args[0];

StringlabelIndexPath=args[1];

StringdictionaryPath=args[2];

StringdocumentFrequencyPath=args[3];

StringnewDataPath=args[4];

Configurationconfiguration=newConfiguration();//modelisa

matrix(wordId,labelId)=>probabilityscore

NaiveBayesModelmodel=NaiveBayesModel.materialize(new

Path(modelPath),configuration);

StandardNaiveBayesClassifierclassifier=new

StandardNaiveBayesClassifier(model);

//labelsisamaplabel=>classId

Map<Integer,String>labels=

BayesUtils.readLabelIndex(configuration,newPath(labelIndexPath));

Page 184: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Map<String,Integer>dictionary=readDictionary(configuration,

newPath(dictionaryPath));

Map<Integer,Long>documentFrequency=

readDocumentFrequency(configuration,new

Path(documentFrequencyPath));

Thesecondpartofthemainmethodisusedtoextractwordsfromthee-mail:

Analyzeranalyzer=newStandardAnalyzer(Version.LUCENE_CURRENT);

intlabelCount=labels.size();

intdocumentCount=documentFrequency.get(-1).intValue();

System.out.println("Numberoflabels:"+labelCount);

System.out.println("Numberofdocumentsintrainingset:"+

documentCount);

BufferedReaderreader=newBufferedReader(new

FileReader(newDataPath));

while(true){

Stringline=reader.readLine();

if(line==null){

break;

}

ConcurrentHashMultiset<Object>words=

ConcurrentHashMultiset.create();

//extractwordsfrommail

TokenStreamts=analyzer.tokenStream("text",new

StringReader(line));

CharTermAttributetermAtt=ts.addAttribute(CharTermAttribute.class);

ts.reset();

intwordCount=0;

while(ts.incrementToken()){

if(termAtt.length()>0){

Stringword=

ts.getAttribute(CharTermAttribute.class).toString();

IntegerwordId=dictionary.get(word);

//ifthewordisnotinthedictionary,skipit

if(wordId!=null){

words.add(word);

wordCount++;

}

}

}

ts.close();

Thethirdpartofthemainmethodisusedtocreatevectoroftheidwordandthetf-idfweights:

Vectorvector=newRandomAccessSparseVector(10000);

TFIDFtfidf=newTFIDF();

for(Multiset.Entryentry:words.entrySet()){

Stringword=(String)entry.getElement();

intcount=entry.getCount();

IntegerwordId=dictionary.get(word);

Longfreq=documentFrequency.get(wordId);

Page 185: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

doubletfIdfValue=tfidf.calculate(count,freq.intValue(),

wordCount,documentCount);

vector.setQuick(wordId,tfIdfValue);

}

Inthefourthpartofthemainmethod,withclassifier,wegetthescoreforeachlabelandassignthee-mailtothehigherscoredlabel:

VectorresultVector=classifier.classifyFull(vector);

doublebestScore=-Double.MAX_VALUE;

intbestCategoryId=-1;

for(inti=0;i<resultVector.size();i++){

Elemente1=resultVector.getElement(i);

intcategoryId=e1.index();

doublescore=e1.get();

if(score>bestScore){

bestScore=score;

bestCategoryId=categoryId;

}

System.out.print(""+labels.get(categoryId)+":"+score);

}

System.out.println("=>"+labels.get(bestCategoryId));

}

}

Now,putallthesecodesunderoneclassandcreatethe.jarfileofthisclass.Wewillusethis.jarfiletotestournewe-mails.

Page 186: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 187: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TestingtheprogramTotesttheprogram,performthefollowingsteps:

1. Createafoldernamedassassinmodeltestinthelocaldirectory,asfollows:

mkdir/tmp/assassinmodeltest

2. Tousethismodel,getthefollowingfilesfromhdfsto/tmp/assassinmodeltest:

Fortheearliercreatedmodel,usethefollowingcommand:

hadoopfs–get/user/hue/prodmodel/tmp/assassinmodeltest

Forlabelindex,usethefollowingcommand:

hadoopfs–get/user/hue/prodlabelindex/tmp/assassinmodeltest

Fordf-countsfromtheassassinvecfolder(changethenameofthepart-00000filetodf-count),usethefollowingcommands:

hadoopfs–get/user/hue/assassinvec/df-count

/tmp/assassinmodeltest

dictionary.file-0fromthesameassassinvecfolder

hadoopfs–get/user/hue/assassinvec/dictionary.file-0

/tmp/assassinmodeltest

3. Under/tmp/assassinmodeltest,createafilewiththemessageshowninthefollowingscreenshot:

4. Now,runtheprogramusingthefollowingcommand:

Java–cp/tmp/assassinmodeltest/spamclassifier.jar:/usr/lib/mahout/*

com.packt.spamfilter.TestClassifier/tmp/assassinmodeltest

/tmp/assassinmodeltest/prodlabelindex

/tmp/assassinmodeltest/dictionary.file-0/tmp/assassinmodeltest/df-

count/tmp/assassinmodeltest/testemail

5. Now,updatetheteste-mailfilewiththemessageshowninthefollowingscreenshot:

Page 188: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

6. Runtheprogramagainusingthesamecommandasgiveninstep4andviewtheresultasfollows:

Now,wehaveaprogramreadythatcanuseourclassifiermodelandpredicttheunknownitems.Let’smoveontooursecondusecase.

Page 189: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 190: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SecondusecaseasanexerciseAsdiscussedatthestartofthischapter,wewillnowworkonasecondusecase,wherewewillpredictthecategoryofanewe-mail.

Page 191: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

TheASFe-maildatasetTheApacheSoftwareFoundatione-maildatasetispartitionedbyproject.Thise-maildatasetcanbefoundathttp://aws.amazon.com/datasets/7791434387204566.

Asmallerdatasetcanbefoundathttp://files.grantingersoll.com/ibm.tar.gz.(Refertohttp://lucidworks.com/blog/scaling-mahout/).Usethisdatatoperformthefollowingsteps:

1. Movethisdatatothefolderofyourchoice(/tmp/asfmail)andunzipthefolder:

mkdir/tmp/asfmail

tar–xvfibm.tar

2. Movethedatasettohdfs:

hadoopfs-put/tmp/asfmail/ibm/content/user/hue/asfmail

3. ConvertthemboxfilesintoHadoop’sSequenceFileformatusingMahout’sSequenceFilesFromMailArchivesasfollows:

mahoutorg.apache.mahout.text.SequenceFilesFromMailArchives--charset

"UTF-8"--body--subject--input/user/hue/asfmail/content--output

/user/hue/asfmailout

Page 192: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

4. Convertthesequencefileintosparsevector:

mahoutseq2sparse--input/user/hue/asfmailout--output

/user/hue/asfmailseqsp--norm2--weightTFIDF--namedVector--

maxDFPercent90--minSupport2--analyzerName

org.apache.mahout.text.MailArchivesClusteringAnalyzer

5. Modifythelabels:

mahoutorg.apache.mahout.classifier.email.PrepEmailDriver--input

/user/hue/asfmailseqsp--output/user/hue/asfmailseqsplabel--

maxItemsPerLabel1000

Now,thenextthreestepsaresimilartotheonesweperformedearlier:

1. Splitthedatasetintotrainingandtestdatasetsusingthefollowingcommand:

mahoutsplit--input/user/hue/asfmailseqsplabel--trainingOutput

/user/hue/asfmailtrain--testOutput/user/hue/asfmailtest--

randomSelectionPct20--overwrite--sequenceFiles

2. Trainthemodelusingthetrainingdatasetasfollows:

mahouttrainnb-i/user/hue/asfmailtrain-o/user/hue/asfmailmodel-

extractLabels--labelIndex/user/hue/asfmaillabels

3. Testthemodelusingthetestdataset:

mahouttestnb-i/user/hue/asfmailtest-m/user/hue/asfmailmodel--

labelIndex/user/hue/asfmaillabels

Asyoumayhavenoticed,allthestepsareexactlyidenticaltotheonesweperformedearlier.Hereby,Ileavethistopicasanexerciseforyoutocreateyourownclassifiersystemusingthismodel.Youcanusehintsasprovidedforthespamfilterclassifier.Wenowmoveourdiscussiontotuningourclassifier.Let’stakeabriefoverviewofthebestpracticesinthisarea.

Page 193: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 194: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

ClassifierstuningWealreadydiscussedclassifiers’evaluationtechniquesinChapter1,ClassificationinDataAnalysis.Justasareminder,weevaluateourmodelusingtechniquessuchasconfusionmatrix,entropymatrix,areaundercurve,andsoon.

Fromtheexplanatoryvariables,wecreatethefeaturevector.Tocheckhowaparticularmodelisworking,thesefeaturevectorsneedtobeinvestigated.InMahout,thereisaclassavailableforthis,ModelDissector.Ittakesthefollowingthreeinputs:

Features:Thisclasstakesafeaturevectortouse(destructively)TraceDictionary:ThisclasstakesatracedictionarycontainingvariablesandthelocationsinthefeaturevectorthatareaffectedbythemLearner:Thisclasstakesthemodelthatweareprobingtofindweightsonfeatures

ModelDissectortweaksthefeaturevectorandobserveshowthemodeloutputchanges.Bytakinganaverageofthenumberofexamples,wecandeterminetheeffectofdifferentexplanatoryvariables.

ModelDissectorhasasummarymethod,whichreturnsthemostimportantfeatureswiththeirweights,mostimportantcategory,andthetopfewcategoriesthattheyaffect.

TheoutputofModelDissectorishelpfulintroubleshootingproblemsinawronglycreatedmodel.

Moredetailsforthecodecanbefoundathttps://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/sgd/ModelDissector.java

Whileimprovingtheoutputoftheclassifier,oneshouldtakecarewithtwocommonlyoccurringproblems:targetleak,andbrokenfeatureextraction.

Ifthemodelisshowingresultsthataretoogoodtobetrueoranoutputbeyondexpectations,wecouldhaveaproblemwithtargetleak.Thiserrorcomesonceinformationfromthetargetvariableisincludedintheexplanatoryvariables,whichareusedtotraintheclassifier.Inthisinstance,theclassifierwillworktoowellforthetestdataset.

Ontheotherhand,brokenfeatureextractionoccurswhenfeatureextractionisbroken.Thistypeofclassifiershowstheoppositeresultfromthetargetleakclassifiers.Here,themodelprovidesresultspoorerthanexpected.

Totunetheclassifier,wecanusenewexplanatoryvariables,transformationsofexplanatoryvariables,andcanalsoeliminatesomeofthevariables.Weshouldalsotrydifferentlearningalgorithmstocreatethemodelandchooseanalgorithm,whichisgoodinperformance,trainingtime,andspeed.

MoredetailsontuningcanbefoundinChapter16,DeployingaclassifierinthebookMahoutinAction.

Page 195: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF
Page 196: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

SummaryInthischapter,wediscussedcreatingourownproductionreadyclassifiermodel.Wetookuptwousecaseshere,oneforane-mailspamfilterandtheotherforclassifyingthee-mailaspertheprojects.WeuseddatasetsforApacheSpamAssassinforthee-mailfilterandASFforthee-mailclassifier.

Wealsosawhowtoincreasetheperformanceofyourmodel.

SoyouarenowreadytoimplementclassifiersusingApacheMahoutforyourownrealworldusecases.Happylearning!

Page 197: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IndexA

algorithms,classificationLogisticregression/ClassificationalgorithmsStochasticGradientDescent(SGD)/ClassificationalgorithmsNaïveBayesclassification/ClassificationalgorithmsHiddenMarkovModel(HMM)/Classificationalgorithmsrandomforest/ClassificationalgorithmsMulti-layerperceptron(MLP)/Classificationalgorithms

ApacheSpamAssassinproject/Spame-maildatasetApacheSpark

about/ApacheSparkSparkSQL/ApacheSparkMLib/ApacheSparkGraphX/ApacheSparkSparkstreaming/ApacheSpark

ASFe-maildatasetabout/TheASFe-maildatasetURL/TheASFe-maildataset

Assassindatasetused,forcreatingmodel/CreatingthemodelusingtheAssassindataset

AUC(areaundertheROCcurve)/AreaundertheROCcurveaxons/Neuralnetworkandneurons

Page 198: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Bbackpropagation/MultilayerPerceptronBagofwords/UnderstandingthetermsusedintextclassificationBaumWelchTrainerclass/UsingMahoutfortheHiddenMarkovModelBayesrule

about/IntroducingconditionalprobabilityandtheBayesrulebindingstack

URL/MahoutScalaandSparkbindings

Page 199: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

CChi-squaredAutomaticInteractionDetector(CHAID)/Decisiontreeclassification

about/Introducingtheclassification,IntroducingApacheMahoutapplication/Applicationoftheclassificationsystemsystem,working/Workingoftheclassificationsystemalgorithms/Classificationalgorithms

ClassificationandRegressionTree(CART)/Decisiontreeclassifier

trainingdataset/Workingoftheclassificationsystemtestdataset/Workingoftheclassificationsystemmodel/Workingoftheclassificationsystembuilding/Workingoftheclassificationsystem

classifiermodelusing,programfor/Programtouseaclassifiermodel

classifierstuning/Classifierstuning

clusteringabout/IntroducingApacheMahout

conditionalprobabilityabout/IntroducingconditionalprobabilityandtheBayesrule

confusionmatrixabout/TheconfusionmatrixAccuracy/TheconfusionmatrixPrecisionorpositivepredictivevalue/TheconfusionmatrixNegativepredictivevalue/TheconfusionmatrixSensitivity/truepositiverate/recall/TheconfusionmatrixSpecificity/TheconfusionmatrixF1score/Theconfusionmatrix

costfunction,linearregressionabout/Costfunction

Page 200: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

DDARPA‘98/UsingMahoutforRandomforestdataanalysis

classification/Introducingtheclassificationdecisiontree

about/Decisiontreedendrites/Neuralnetworkandneuronsdependentvariable/Logisticregressiondeterministicpatterns/Deterministicandnondeterministicpatternsdevelopmentenvironment

settingup,Eclipseused/SettingupadevelopmentenvironmentusingEclipsedimensionalreduction

about/IntroducingApacheMahoutDomainSpecificLanguage(DSL)/Mahoutnewchanges

Page 201: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

EEclipse

used,forbuildingdevelopmentenvironment/SettingupadevelopmentenvironmentusingEclipse

emissionmatrix,HMM/IntroducingtheHiddenMarkovModelEntropymatrix

about/Theentropymatrixexplanatoryvariable/Logisticregressionexplanatoryvariables

about/Workingoftheclassificationsystem

Page 202: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Ggradientdescent

about/Gradientdescentsigmoidfunction/Logisticregressionlogisticfunction/Logisticregression

GraphX/ApacheSpark

Page 203: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

HH2Oplatform

integration/H2OplatformintegrationURL/H2Oplatformintegration

HadoopURL/IntroducingApacheMahout,InstallingMahout

hiddenlayer,MLPnetwork/MultilayerPerceptronhiddenlayers,MLPnetwork/MultilayerPerceptronHiddenMarkovModel(HMM)/Classificationalgorithmshiddenstates,HMM/IntroducingtheHiddenMarkovModelHMM

about/IntroducingtheHiddenMarkovModelproperties/IntroducingtheHiddenMarkovModelstatevector/IntroducingtheHiddenMarkovModeltransitionmatrix/IntroducingtheHiddenMarkovModelemissionmatrix/IntroducingtheHiddenMarkovModelhiddenstates/IntroducingtheHiddenMarkovModelobservablestate/IntroducingtheHiddenMarkovModelMahoutused/UsingMahoutfortheHiddenMarkovModelModelclass/UsingMahoutfortheHiddenMarkovModelHmmTrainerclass/UsingMahoutfortheHiddenMarkovModelHmmEvaluatorclass/UsingMahoutfortheHiddenMarkovModelHmmAlgorithmsclass/UsingMahoutfortheHiddenMarkovModelHmmUtilsclass/UsingMahoutfortheHiddenMarkovModelRandomSequencerGenerator/UsingMahoutfortheHiddenMarkovModelBaumWelchTrainerclass/UsingMahoutfortheHiddenMarkovModelViterbiEvaluatorclass/UsingMahoutfortheHiddenMarkovModelinputcommand/UsingMahoutfortheHiddenMarkovModeloutputcommand/UsingMahoutfortheHiddenMarkovModelmodelcommand/UsingMahoutfortheHiddenMarkovModellikelihoodcommand/UsingMahoutfortheHiddenMarkovModel

HMM,issuesevaluation/IntroducingtheHiddenMarkovModeldecoding/IntroducingtheHiddenMarkovModellearning/IntroducingtheHiddenMarkovModel

HmmAlgorithmsclass/UsingMahoutfortheHiddenMarkovModelHmmEvaluatorclass/UsingMahoutfortheHiddenMarkovModelHMMModelclass/UsingMahoutfortheHiddenMarkovModelHmmTrainerclass/UsingMahoutfortheHiddenMarkovModelHmmUtilsclass/UsingMahoutfortheHiddenMarkovModelHortonworksSandbox

URL/SettingupMahoutforaWindowsuser

Page 204: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

IiInitialstatevector,Markovprocess/TheMarkovprocessindependentvariable/Logisticregressioninputlayer,MLPnetwork/MultilayerPerceptronirisdataset

URL/UsingMahoutforMLPIterativeDichotomiser3(ID3)

URL/Decisiontree

Page 205: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

JJava

URL/InstallingMahout

Page 206: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Llabels

about/WorkingoftheclassificationsystemLatentDirichletAllocation(LDA)/IntroducingApacheMahoutlinearregression

about/Understandinglinearregressioncostfunction/Costfunctiongradientdescent/Gradientdescent

logisticfunction/Logisticregressionlogisticregression/Classificationalgorithms

about/LogisticregressionMahout,usingfor/UsingMahoutforlogisticregressiondataset/UsingMahoutforlogisticregressiontrainingandtestdata,preparing/UsingMahoutforlogisticregressionmodel,training/UsingMahoutforlogisticregressiontrainlogistic/UsingMahoutforlogisticregressioninput/UsingMahoutforlogisticregressionoutput/UsingMahoutforlogisticregressiontarget/UsingMahoutforlogisticregressioncategories/UsingMahoutforlogisticregressionpredictors/UsingMahoutforlogisticregressiontypes/UsingMahoutforlogisticregressionfeatures/UsingMahoutforlogisticregressionpasses/UsingMahoutforlogisticregressionrate/UsingMahoutforlogisticregressionrunlogistic/UsingMahoutforlogisticregressionmodel/UsingMahoutforlogisticregressionauc/UsingMahoutforlogisticregressionconfusion/UsingMahoutforlogisticregression

Page 207: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

MM2Eclipse

URL/InstallingMavenMahout

about/IntroducingApacheMahoutusecases/IntroducingApacheMahoutfeatures/ReasonsforMahoutbeingagoodchoiceforclassificationinstalling/InstallingMahoutprerequisites/InstallingMahoutbuildingfromsource,Mavenused/BuildingMahoutfromsourceusingMavenMaven,installing/InstallingMavencode,building/BuildingMahoutcodedistributionfile,URL/BuildingMahoutcode,SettingupadevelopmentenvironmentusingEclipsesettingup,forWindowsuser/SettingupMahoutforaWindowsuserused,forlogisticregression/UsingMahoutforlogisticregressionNaïveBayesalgorithm/UsingtheNaïveBayesalgorithminApacheMahoutusing,forHMM/UsingMahoutfortheHiddenMarkovModelusing,forRandomforestalgorithm/UsingMahoutforRandomforestRandomforestalgorithm,implementing/StepstousetheRandomforestalgorithminMahoutMLP,implementing/MLPimplementationinMahoutusing,forMLP/UsingMahoutforMLPMLPalgorithm,using/StepstousetheMLPalgorithminMahoutupdations/MahoutnewchangesScalabindings/MahoutScalaandSparkbindingsSparkbindings/MahoutScalaandSparkbindingsSparkshell,using/UsingMahout’sSparkshellH2Oplatform,integration/H2Oplatformintegration

Mahout,algorithmsabout/AlgorithmssupportedinMahoutsequentialalgorithms/AlgorithmssupportedinMahoutparallelalgorithms/AlgorithmssupportedinMahout

Mahout,usecasesrecommendation/IntroducingApacheMahoutclassification/IntroducingApacheMahoutclustering/IntroducingApacheMahoutdimensionalreduction/IntroducingApacheMahouttopicmodeling/IntroducingApacheMahout

MahoutScalabindingsabout/MahoutScalaandSparkbindings

MahoutSparkbindingsabout/MahoutScalaandSparkbindings

Page 208: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Markovprocessabout/TheMarkovprocessstates/TheMarkovprocesstransitionmatrix/TheMarkovprocessTransitionmatrix/TheMarkovprocessInitialstatevector/TheMarkovprocess

Mavenused,forbuildingMahoutfromsource/BuildingMahoutfromsourceusingMaveninstalling/InstallingMavenURL/InstallingMaven

MLib/ApacheSparkMLP

implementing,inMahout/MLPimplementationinMahoutMahoutused/UsingMahoutforMLPirisdataset/UsingMahoutforMLP

MLPalgorithmusing,inMahout/StepstousetheMLPalgorithminMahout

MLPnetworkabout/MultilayerPerceptronhiddenlayers/MultilayerPerceptronbackpropagation/MultilayerPerceptronzerohiddenlayers/MultilayerPerceptroninputlayer/MultilayerPerceptronoutputlayer/MultilayerPerceptronhiddenlayer/MultilayerPerceptronnumberofneuronsorhiddenunits/MultilayerPerceptron

modelcreating,Assassindatasetused/CreatingthemodelusingtheAssassindatasetclassifiermodel,programforusing/Programtouseaclassifiermodel

model,evaluationconfusionmatrix/TheconfusionmatrixReceiverOperatingCharacteristics(ROC)graph/TheReceiverOperatingCharacteristics(ROC)graphareaundertheROCcurve(AUC)/AreaundertheROCcurveEntropymatrix/Theentropymatrix

model,issuesoverfitting/Workingoftheclassificationsystemunderfitting/Workingoftheclassificationsystem

ModelDissectorFeaturesclass/ClassifierstuningTraceDictionaryclass/ClassifierstuningLearnerclass/Classifierstuningabout/Classifierstuning

Page 209: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Multi-layerperceptron(MLP)/Classificationalgorithms

Page 210: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

NNaïveBayesalgorithm

about/UnderstandingtheNaïveBayesalgorithminApacheMahout/UsingtheNaïveBayesalgorithminApacheMahout

NaïveBayesclassification/Classificationalgorithmsneuralnetwork

about/Neuralnetworkandneuronsneurons

about/NeuralnetworkandneuronsURL/Neuralnetworkandneurons

nondeterministicpatterns/DeterministicandnondeterministicpatternsNSL-KDDdataset

URL/UsingMahoutforRandomforest

Page 211: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Oobservablestate,HMM/IntroducingtheHiddenMarkovModeloutlierdetection

about/Workingoftheclassificationsystemoutputlayer,MLPnetwork/MultilayerPerceptronoverfitting,model

issues/Workingoftheclassificationsystem

Page 212: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Pparallelalgorithms/AlgorithmssupportedinMahoutprogram

testing/Testingtheprogrampruning/Decisiontree

Page 213: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Rrandomforest/ClassificationalgorithmsRandomforestalgorithm

about/RandomforestBiasparameter/RandomforestVarianceparameter/RandomforestMahoutused/UsingMahoutforRandomforestNSL-KDDdataset/UsingMahoutforRandomforestdataset/UsingMahoutforRandomforestimplementing,inMahout/StepstousetheRandomforestalgorithminMahout

RandomSequencerGenerator/UsingMahoutfortheHiddenMarkovModelReceiverOperatingCharacteristics(ROC)graph

about/TheReceiverOperatingCharacteristics(ROC)graphregression

about/Introducingregressionlinearregression/Understandinglinearregression

regressionintercept/Logisticregression

Page 214: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Ssequentialalgorithms/AlgorithmssupportedinMahoutsigmoidfunction/Logisticregressionsoftmaxfunction

URL/Neuralnetworkandneuronsspame-maildatasetclassifier

about/Spame-maildatasetSpark

URL/UsingMahout’sSparkshellbinding,URL/UsingMahout’sSparkshell

Spark-item/ApacheSparkSpark-row/ApacheSparkSparkshell

using/UsingMahout’sSparkshellSparkSQL/ApacheSparkSparkstreaming/ApacheSparkstates,Markovprocess/TheMarkovprocessstatevector,HMM/IntroducingtheHiddenMarkovModelStochasticGradientDescent(SGD)/Classificationalgorithms

about/StochasticGradientDescent

Page 215: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Ttargetvariables

about/WorkingoftheclassificationsystemTermfrequency/Understandingthetermsusedintextclassificationtermfrequency

Stemmingofwords/UnderstandingthetermsusedintextclassificationCasenormalization/UnderstandingthetermsusedintextclassificationStopwordremoval/UnderstandingthetermsusedintextclassificationInversedocumentfrequency/UnderstandingthetermsusedintextclassificationTermfrequencyandinversetermfrequency/Understandingthetermsusedintextclassification

textclassificationabout/Understandingthetermsusedintextclassification

topicmodelingabout/IntroducingApacheMahout

transitionmatrix,HMM/IntroducingtheHiddenMarkovModeltransitionmatrix,Markovprocess/TheMarkovprocess

Page 216: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Uunderfitting,model

issues/Workingoftheclassificationsystem

Page 217: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

Vvectors

about/UnderstandingthetermsusedintextclassificationViterbiEvaluatorclass/UsingMahoutfortheHiddenMarkovModel

Page 218: Learning Apache Mahout Classification - DropPDF1.droppdf.com/files/0ZaSP/learning-apache-mahout-classification-by... · Learning Apache Mahout Classification Credits ... , with PDF

WWindows

user,Mahoutsettingupfor/SettingupMahoutforaWindowsuserWisconsinDiagnosticBreastCancer(WDBC)dataset

URL/UsingMahoutforlogisticregression