Top Banner
Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, BUT WERE AFRAID TO ASK, ABOUT FINDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION © UTILIS TECHNOLOGY LIMITED 2017
26

Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Sep 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FoundingaHadoopLabEVERYTHINGYOUALWAYSWANTEDTOKNOW,

BUTWEREAFRAIDTOASK,

ABOUTF INDINGSUCCESSWITHHADOOP INYOURORAGANIZATION

©UTILISTECHNOLOGYLIMITED2017

Page 2: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

[email protected]

AShortIntroductiontoYourSpeaker

MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ AdvisoryrolesonHadoopinfinance

MyCareerinFinance◦ Fourbanks,onestockexchange,onepensionfund◦ Capitalmarkets,retailbanking,enterpriseriskroles◦ FounderoftwoITdepartments◦ TechnologyleaderinRiskSystemsfor15years–

◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE

Page 3: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,buildingateamandformingpartnerships◦ Foundationalworktosetapathtosuccess

Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning,charge-out,andthecentralcapitalaccount

Real-lifeLessonsLearned◦ SettingupinfrastructuretotakeadvantageofHadoop’suniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles

ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems

Page 4: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

WhatrolewillyourHadoopLabplay?“YOUCAN’TSHRINKYOURWAYTOGREATNESS”- TOMPETERS

Page 5: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?

Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance

Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusingtraining,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption

Page 6: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership

◦ Includeshareofexecutionqueues,directorystructuresandcascadingpermissions

◦ Setupself-serveuseron-boardingthroughyourorganization’sHelpDesk◦ ImplementsinglesignonforKerberos-securedclusters

Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforbothinteractiveandapplicationuses◦ Use“showback”reportingtomonitorperformanceagainstobjectives

Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers

◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild

◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews

Page 7: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform

◦ Investincontinuousintegrationandautomatedregressiontestingforyourdevelopmentteams◦ Establishabetter-than-quarterlyreleasecycle◦ Publishachecklistofacceptableopensourcelicenses(orblacklistofprohibitedones)

◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments

Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoopinsideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaborationintheopensourcecommunity◦ Prohibitequipment“carveouts”fromyoursharedgrid

◦ Includethecostofadditionalequipmentinthebusinesscase,co-locate,andchargeoutaccordingly

Page 8: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridofintermediatedeveloperandjuniordatascientist◦ Gooddataengineeringacceleratesdatascience,andtheabilitytodeploydatasciencetoproduction

Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess

Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia

FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneedtheadministrativeaccessthattheseteamsareallowed

Page 9: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholdersisparticularlyvaluablefortechnologyteams

EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolvingtheaccountingproblemstogettheirexpertise

Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunitytodothis

AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutionsincludeHAWQ,Vertica,Polybase*

Page 10: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Whatisareasonablebudget?“PRICEISWHATYOUPAY. VALUEISWHATYOUGET.”- WARREN BUFFET

Page 11: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity

◦ Mostfront-endsystemshavedozensofoutboundfeedsthattheyhavetosupportandmaintain– offerthemthechancetodropoffasinglecomprehensivefeedtoHadoopsothatconsumerscanbuildandmanagetheirownoutboundfeeds

◦ Consumingsystemsalsohavesupportteamsmanaginginboundfeeds,sotheywon’tseeasignificantchangeinsupportcosts

◦ DataconsumerswillseeHadoopasimprovingtheircapabilities◦ Traditionaldatasupplychainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovisionit,andthenfinallya

datascientistgetstoconsumeit◦ Givingdatascientistsaccesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoprovidingthedata!

Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgridshouldbesharedbytheconsumersofgridservices

Page 12: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore,“storage”isless

Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgridinabankdatacentrearearoundUS$550/TBperyear

◦ Capitalchargesforinfrastructurecosts,includingserversanddedicatednetworkswitching,areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupportsubscriptionsforoperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded

◦ ThiscreatesaroundUS$450/TBperyearofbudgetroomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear

◦ Thisbudgetfundsastaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear

Page 13: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumptionwillbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit

◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution

CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters

Page 14: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadooptofundintra-yearwork

◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-upsonHadoopcanbecheaperthantapes

Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams

◦ Mostapplicationshave“lightson”fundinginsufficienttosupportthePOCsneededtoexploreHadoopadoption◦ Setasidefundingtopayforcross-teamchargesforparticipationinaPOC◦ UsethePOCstosupportprojectproposalsbasedoncostreduction

◦ StaffingtheHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability

Page 15: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER

Page 16: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone

Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine

AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipmentfailureinanapplianceisall-or-nothing

◦ CentralizingtheHadoopgridintooneapplianceincreasestheneedforexpensivefaulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliancesbarelystayunderthe$1K/TBbenchmark

◦ Yourvirtualizationfarmduplicatesallofthefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarksshowthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyouendupwithmorenode-count-drivenHadoopcosts

Page 17: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotoftrafficintoandoutofthegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork

Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigherspeedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost

Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers

Page 18: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boardingdataproducers,thegreaterthedifficultyofupdatingtheDataLake’sHadoopdistribution– theincentiveto“standpat”grows◦ Evenworseifyou’reusingthird-partytoolsforingestion– itcreatesanexternal stakeholderwhichcanblockchange!

◦ Themoresuccessfulyouareinon-boardingdataconsumers,thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution– datascientistsalwayswantthemostcurrent nextversionofeverything

Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently

◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements

◦ PutallofthedatascientistsontotheirowngridthatupdateswiththeHadoopdistribution◦ Self-servedataprovisioningtosmallgridsinacloudalsoworksreallywellfromtheconsumer’sview

◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegridsispainless

Page 19: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop

◦ ThinkofHadoopascontainerinstead,andre-architecttheapplicationtoruninsideHadoop

◦ DonotuseHadooptohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodelsaremuchmoreefficientonHadoop

◦ DonotcreateabstractionlayersusinglayeredHiveviews

ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”oftenturnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanodetohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs

Page 20: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions

Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess

Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned

◦ Thereareawealthofopensourcesocialmediaingestionandanalysistoolsavailable

◦ IVRsystemsarelinkedtocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit

◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessingtoolsareavailableinpython

◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors– butsettingupaninboundfeediseasy

Page 21: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern

◦ Datascientistsdon’tknowwhattheirrequirementsareuntilthey’vedonetheirwork– theirjobistoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem

◦ Don’twaste(toomuch)timeoncentraldataquality– they’rejustgoingtore-doitanyway◦ ”Correct”dataissubjectivebystudy,sothereisn’tananswertoimplementcentrally◦ Preparingatimeseriesincludesdataqualitysuitabletodatascience– regardlessofhowgoodthestartingdatais

◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers

Page 22: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata– whichbreakslotsofpolicies

SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping”forcontent– evenwhengovernancewillinitiallyprohibitusingthecontent

Investinadvanceddatamasking◦ Investinadvanceddatamaskingtoprepareproductiondataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlyingdata

Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools

◦ Thegoodtoolsturnthe”shoppingtrip”intodeployablecodethatyoucanpackagefordeploymentorautomationeasily

Page 23: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

ProjectsthatSucceed“RISKCOMESFROMNOTKNOWINGWHATYOU’REDOING”- WARREN BUFFET

Page 24: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop

Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundredsofcopiesinparallel

Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-basedtoolscanbepairedwithsomebasicdatasciencetofind“lifeevents”

TrendAnalysisonRiskData◦ SimulationoutputsfromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning

Page 25: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects

VolckerRule◦ VolckerRulemetricsrequireanenormousamountofdata,whichisexpensivetostore◦ Retentionisrequiredforfiveyearsofcalendardays◦ ComputationscanbeimplementedinSQLandwillrunwellinHive

Customer360◦ Hadoopisanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata

DailyLiquidityManagement◦ Runningthecalculationsbeforepoolingfacilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards

Page 26: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

ThankYouforYourTime