Top Banner

of 33

1597 EDW Big Data Analytics Kimball

Jun 03, 2018

Download

Documents

Harjeet Bakshi
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    1/33

    The Evolving Role of theEnterprise Data Warehouse inthe Era of Big Data nalytics

    AKimballGroupWhitePaper

    ByRalphKimball

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    2/33

    Table of ContentsExecutiveSummary......................................................................................................1

    AbouttheAuthor...........................................................................................................1

    Introduction..................................................................................................................2

    Dataisanassetonthebalancesheet.....................................................................3

    Raisingthecurtainonbigdataanalytics.....................................................................4

    Usecasesforbigdataanalytics...............................................................................4

    Makingsenseofbigdataanalyticusecases...........................................................7

    Bigdataanalyticssystemrequirements......................................................................9

    Extendedrelationaldatabasemanagementsystems............................................10

    MapReduce/Hadoopsystems................................................................................13

    HowMapReduceworksinHadoop........................................................................14

    ToolsfortheHadoopenvironment.........................................................................16

    Featureconvergenceinthecomingdecade..........................................................18

    Reusableanalytics.................................................................................................20

    Complexeventprocessing(CEP)..........................................................................20

    Datawarehouseculturalchangesinthecomingdecade..........................................21

    Sandboxes..............................................................................................................21

    Lowlatency.............................................................................................................22

    Continuousthirstformoreexquisitedetail.............................................................22

    Lighttouchdatawaitsforitsrelevancetobeexposed ...........................................23

    Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata.23

    Datastructuresshouldbedeclaredatquerytime,notatdataloadtime ...............24

    TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep .........24

    Theconflictbetweenabstractionandcontrol.........................................................24

    Datawarehouseorganizationchangesinthecomingdecade...................................25

    Technicalskillsetsrequired..................................................................................25

    Neworganizationsrequired....................................................................................26

    Newdevelopmentparadigmsrequired...................................................................27

    Lessonsfromtheearlydatawarehousingera.......................................................28

    Analyticsinthecloud..............................................................................................29

    WhitherEDW?.........................................................................................................29

    Acknowledgements....................................................................................................31

    References.................................................................................................................31

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    3/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics1

    Executive SummaryInthiswhitepaper,wedescribetherapidlyevolvinglandscapefordesigningan

    enterprisedatawarehouse(EDW)tosupportbusinessanalyticsintheeraof"big

    data.Wedescribethescopeandchallengesofbuildingandevolvingaverystable

    andsuccessfulEDWarchitecturetomeetnewbusinessrequirements.These

    includeextremeintegration,semi-andun-structureddatasources,petabytesofbehavioralandimagedataaccessedthroughMapReduce/Hadoopaswellas

    massivelyparallelrelationaldatabases,andthenstructuringtheEDWtosupport

    advancedanalytics.Thispaperprovidesdetailedguidancefordesigningand

    administeringthenecessaryprocessesfordeployment.Thiswhitepaperhasbeen

    writteninresponsetoalackofspecificguidanceintheindustryastohowtheEDW

    needstorespondtothebigdataanalyticschallenge,andwhatnecessarydesign

    elementsareneededtosupportthesenewrequirements.

    About the AuthorRalphKimballfoundedtheKimballGroup.Sincethemid1980s,hehasbeenthedatawarehouse/businessintelligence(DW/BI)industrysthoughtleaderonthe

    dimensionalapproachandtrainedmorethan10,000ITprofessionals.PriortoworkingatMetaphorandfoundingRedBrickSystems,Ralphco-inventedtheStarworkstationatXeroxsPaloAltoResearchCenter(PARC).RalphhashisPh.D.inElectricalEngineeringfromStanfordUniversity.TheKimballGroupisthesourcefordimensionalDW/BIconsultingandeducation,consistentwithourbest-sellingToolkitbookseries,DesignTips,andaward-winningarticles.Visitwww.kimballgroup.comformoreinformation.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    4/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics2

    IntroductionWhatisbigdata?Itsbignessisactuallynotthemostinterestingcharacteristic.Big

    dataisstructured,semistructured,unstructured,andrawdatainmanydifferent

    formats,insomecaseslookingtotallydifferentthanthecleanscalarnumbersandtext

    wehavestoredinourdatawarehousesforthelast30years.Muchbigdatacannotbe

    analyzedwithanythingthatlookslikeSQL.Butmostimportant,bigdataisaparadigmshiftinhowwethinkaboutdataassets,wheredowecollectthem,howdo

    weanalyzethem,andhowdowemonetizetheinsightsfromtheanalysis.Thebig

    datarevolutionisaboutfindingnewvaluewithinandoutsideconventionaldata

    sources.Anadditionalapproachisneededbecausethesoftwareandhardware

    environmentsofthepasthavenotbeenabletocapture,manage,orprocessthenew

    formsofdatawithinreasonabledevelopmenttimesorprocessingtimes.Weare

    challengedtoreorganizeourinformationmanagementlandscapetoextenda

    remarkablystableandsuccessfulEDWarchitecturetothisneweraofbigdata

    analytics.

    Inreadingthiswhitepaperpleasebearinmindthattheconsistentviewofthisauthorhasalwaysbeenthatthe"datawarehouse"comprisesthecompleteecosystemfor

    extracting,cleaning,integratinganddeliveringdatatodecisionmakers,andtherefore

    includestheextract-transform-load(ETL)andbusinessintelligence(BI)functions

    consideredasoutsideofthedatawarehousebymoreconservativewriters.This

    authorhasalwaystakentheviewthatdatawarehousinghasaverycomprehensive

    roleincapturingallformsofenterprisedata,andthenpreparingthatdataforthemost

    effectiveusebydecision-makersallacrosstheenterprise.Thiswhitepapertakesthe

    aggressiveviewthattheenterprisedatawarehouseisonthevergeofaveryexciting

    newsetofresponsibilities.ThescopeoftheEDWwillincreasedramatically.

    Also,inthiswhitepaper,althoughweconsistentlyusethetermETLtodescribethemovementofdatawithintheenterprisedatawarehouse,theconventionaluseofthis

    termdoesnotdojusticetothemuchlargerresponsibilityofmovingdataacross

    networksandbetweensystemsandbetweenprofoundlydifferentprocessesinthe

    worldofbigdataanalytics.ETLisaportionofamuchlargertechnologycalleddata

    integration(DI).SincewehaveusedETLconsistentlyinourbooksandclassesfor

    manyyears,wewillkeepthatterminologyinthispaper,bearinginmindthatETLis

    meantinthelargersenseofDI.

    Thiswhitepaperstandsbackfromthemarketplaceasitexistsinearly2011to

    highlighttheclearlyemergingnewtrendsbroughtbythebigdatarevolution.Anda

    revolutionitis.AsJamesMarkarian,Informatica'sExecutiveVicePresidentandChiefTechnologyOfficer,remarked:"thedatabasemarkethasfinallygotteninteresting

    again."Becausemuchofthenewbigdatatoolsandapproachesareversion1or

    evenversion0developments,thelandscapewillcontinuetochangerapidly.However

    thereisgrowingawarenessinthemarketplacethatnewkindsofanalysisarepossible

    andthatkeycompetitors,especiallye-commerceenterprises,arealreadytaking

    advantageofthenewparadigm.Thiswhitepaperisintendedtobeaguidetohelp

    businessintelligence,datawarehousingandinformationmanagementprofessionals

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    5/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics3

    andmanagementteamsunderstandandprepareforbigdataasacomplementary

    extensiontotheircurrentEDWarchitecture.

    Dataisanassetonthebalancesheet

    Enterprisesincreasinglyrecognizethatdataitselfisanassetthatshouldappearon

    thebalancesheetinthesamewaythattraditionalassetsfromthemanufacturingage

    suchasequipmentandlandhavealwaysappeared.Thereareseveralwaysto

    determinethevalueofthedataasset,including

    costtoproducethedata

    costtoreplacethedataifitislost

    revenueorprofitopportunityprovidedbythedata

    revenueorprofitlossifdatafallsintocompetitorshands

    legalexposurefromfinesandlawsuitsifdataisexposedtothewrongparties

    Butmoreimportantthanthedataitself,enterpriseshaveshownthatinsightsfrom

    datacanbemonetized.Whenane-commercesitedetectsanincreaseinfavorableclickthroughsfromanexperimentaladtreatment,thatinsightcanbetakentothe

    bottomlineimmediately.Thisdirectcause-and-effectiseasilyunderstoodby

    management,andananalyticresearchgroupthatconsistentlydemonstratesthese

    insightsislookeduponasastrategicresourcefortheenterprisebythehighestlevels

    ofmanagement.Thisgrowthinbusinessawarenessofthevalueofdata-driven

    insightsisrapidlyspreadingoutwardfromthee-commerceworldtovirtuallyevery

    businesssegment.

    Datawarehousing,ofcourse,hasbeendemonstratingthevalueofdata-driven

    insightsforatleast20years.Butuntilquiterecentlydatawarehousinghasbeen

    focusedonhistoricaltransactiondata.Duringthepastdecadefrom2000to2009,threemajorseismicshiftsoccurredindatawarehousing.Thefirst,earlyinthe

    decade,wasthedecisiveintroductionoflowlatencyoperationaldataintothedata

    warehousetogetherwiththeexistinghistoricaldata.Ofcourse,manyofthesenew

    operationaldatausecasesbenefitedfromreal-timedata,insomecasesdemanding

    instantaneousdelivery.Thesecondseismicshiftgrowingincreasinglythroughoutthe

    decadewasthegatheringofcustomerbehaviordata,whichnotonlyincluded

    traditionaltransactionssuchaspurchasesandclickthroughsbutaddedhuge

    volumesof"subtransactions"thatrepresentedmeasurableeventsleadinguptothe

    transactionsthemselves.Forexample,allthewebpageeventsacustomerengaged

    inpriortothefinaltransactioneventbecamearecordofcustomerbehavior."Good

    paths"throughthesewebpageeventhistoriesgavelotsofinsightintoproductive(i.e.,

    monetizable)customerbehavior.

    Thethirdseismicevent,whichisgatheringenormousmomentumaswetransitioninto

    thecurrentdecade,istheextractionofproductpreferencesandcustomers

    sentimentsfromsocialmedia,especiallythemassivequantitiesofmachine-

    generatedunstructureddatageneratedbythenewbusinessparadigmsofdot-com

    companies.Itisthisfinalseismicshiftthathaspushedmanyenterprisesintolooking

    seriouslyatunstructureddataforthefirsttime,andasking"howonearthdowe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    6/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics4

    analyzethisstuff?"Thepointhereisnotthatunstructureddataissomenewthing

    recentlydiscovered,butrathertheanalysisofunstructureddatahasgone

    mainstreamjustrecently.

    Raising the curtain on big data analyticsUsecasesforbigdataanalytics

    Bigdataanalyticsusecasesarespreadinglikewildfire.Hereisasetofusecases

    reportedrecently,includingabenchmarksetof"Hadoop-able"usecasesproposed

    byJeffHammerbacher,ChiefScientistforCloudera.Followingthesebrief

    descriptionsisatablesummarizingthesalientstructureandprocessing

    characteristicsofeachusecase.Notethatnoneoftheseusecasescanbesatisfied

    withscalarnumericdata,norcananybeproperlyanalyzedbysimpleSQL

    statements.Allofthemcanbescaledintothepetabyterangeandbeyondwith

    appropriatebusinessassumptions.

    Search ranking.Allsearchenginesattempttoranktherelevanceofawebpagetoasearchrequestagainstallotherpossiblewebpages.Googlespagerankalgorithmis,

    ofcourse,theposterchildforthisusecase.

    Ad tracking. E-commercesitestypicallyrecordanenormousriverofdataincludingeverypageeventineveryusersession.Thisallowsforveryshortturnaroundof

    experimentsinadplacement,color,size,wording,andotherfeatures.Whenan

    experimentshowsthatsuchafeaturechangeinanadresultsinimprovedclick

    throughbehavior,thechangecanbeimplementedvirtuallyinrealtime. Location and proximity tracking. ManyusecasesaddpreciseGPSlocationtracking,togetherwithfrequentupdates,inoperationalapplications,securityanalysis,

    navigation,andsocialmedia.Preciselocationtrackingopensthedoorforan

    enormousoceanofdataaboutotherlocationsnearbytheGPSmeasurement.These

    otherlocationsmayrepresentopportunitiesforsalesorservices.

    Causal factor discovery.Point-of-saledatahaslongbeenabletoshowuswhenthesalesofaproductgoessharplyupordown.Butsearchingforthecausalfactorsthat

    explainthesedeviationshasbeen,atbest,aguessinggameoranartform.The

    answersmaybefoundincompetitivepricingdata,competitivepromotionaldata

    includingprintandtelevisionmedia,weather,holidays,nationaleventsincluding

    disasters,andvirallyspreadopinionsfoundinsocialmedia.Seethenextusecaseaswell.

    Social CRM.Thisusecaseisoneofthehottestnewareasformarketinganalysis.TheAltimeterGrouphasdescribedaveryusefulsetofkeyperformanceindicatorsfor

    socialCRMthatincludeshareofvoice,audienceengagement,conversationreach,

    activeadvocates,advocateinfluence,advocacyimpact,resolutionrate,resolution

    time,satisfactionscore,topictrends,sentimentratio,andideaimpact.Thecalculation

    oftheseKPIsinvolvesin-depthtrollingofahugearrayofdatasources,especially

    unstructuredsocialmedia.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    7/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics5

    Document similarity testing.Twodocumentscanbecomparedtoderiveametricofsimilarity.Thereisalargebodyofacademicresearchandtestedalgorithms,for

    examplelatentsemanticanalysis,thatisjustnowfindingitswaytodrivingmonetized

    insightsofinteresttobigdatapractitioners.Forexample,asinglesourcedocument

    canbeusedasakindofmultifacetedtemplatetocompareagainstalargesetof

    targetdocuments.Thiscouldbeusedforthreatdiscovery,sentimentanalysis,and

    opinionpolls.Forexample:"findallthedocumentsthatagreewithmysourcedocumentonglobalwarming."

    Genomics analysis: e.g., commercial seed gene sequencing.Afewmonthsagothecottonresearchcommunitywasthrilledbyagenomesequencingannouncementthat

    statedinpart"Thesequencewillserveacriticalroleasthereferenceforfuture

    assemblyofthelargercottoncropgenome.Cottonisthemostimportantfibercrop

    worldwideandthissequenceinformationwillopenthewayformorerapidbreeding

    forhigheryield,betterfiberqualityandadaptationtoenvironmentalstressesandfor

    insectanddiseaseresistance.ScientistRyanRappstressedtheimportanceof

    involvingthecottonresearchcommunityinanalyzingthesequence,identifyinggenes

    andgenefamiliesanddeterminingthefuturedirectionsofresearch.(SeedQuest,Sept22,2010).Thisusecaseisjustoneexampleofawholeindustrythatisbeing

    formedtoaddressgenomicsanalysisbroadly,beyondthisexampleofseedgene

    sequencing.

    Discovery of customer cohort groups.Customercohortgroupsareusedbymanyenterprisestoidentifycommondemographictrendsandbehaviorhistories.Weareall

    familiarwithAmazon'scohortgroupswhentheysayothercustomerswhoboughtthe

    samebookasyouhavealsoboughtthefollowingbooks.Ofcourse,ifyoucansell

    yourproductorservicetoonememberofacohortgroup,thenalltherestmaybe

    reasonableprospects.Cohortgroupsarerepresentedlogicallyandgraphicallyas

    links,andmuchoftheanalysisofcohortgroupsinvolvesspecializedlinkanalysisalgorithms.

    In-flight aircraft status.Thisusecaseaswellasthefollowingtwousecasesaremadepossiblebytheintroductionofsensortechnologyeverywhere.Inthecaseofaircraft

    systems,in-flightstatusofhundredsofvariablesonengines,fuelsystems,hydraulics,

    andelectricalsystemsaremeasuredandtransmittedeveryfewmilliseconds.The

    valueofthisusecaseisnotjusttheengineeringtelemetrydatathatcouldbe

    analyzedatsomefuturepointintime,butdrivesreal-timeadaptivecontrol,fuel

    usage,partfailureprediction,andpilotnotification.

    Smart utility meters.Itdidn'ttakelongforutilitycompaniestofigureoutthatasmartmetercanbeusedformorethanjustthemonthlyreadoutthatproducesthe

    customersutilitybill.Bydrasticallycrankingupthefrequencyofthereadoutstoas

    muchasonereadoutpersecondpermeteracrosstheentirecustomerlandscape,

    manyusefulanalysescanbeperformedincludingdynamicload-balancing,failure

    response,adaptivepricing,andlonger-termstrategiesforincentingcustomersto

    utilizetheutilitymoreeffectively(eitherfromthecustomerspointofvieworthe

    utility'spointofview!)

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    8/33

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    9/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics7

    Data bag exploration.Therearemanysituationsincommercialenvironmentsandintheresearchcommunitieswherelargevolumesofrawdataarecollected.One

    examplemightbedatacollectedaboutstructurefires.Beyondthepredictable

    dimensionsoftime,place,primarycauseoffire,andrespondingfirefighters,there

    maybeawealthofunpredictableanecdotaldatathatatbestcanbemodeledasa

    disorderlycollectionofnamevaluepairs,suchas"contributingweather=lightning.

    Anotherexamplewouldbethelistingofallrelevantfinancialassetsforadefendantinalawsuit.Againsuchalistislikelytobeadisorderlycollectionofnamevaluepairs,

    suchas"sharedrealestateownership=condominium.Thelistofexampleslikethis

    isendless.Whattheyhaveincommonistheneedtoencapsulatethedisorderly

    collectionofnamevaluepairswhichisgenerallyknownasa"databag.Complex

    databagsmaycontainbothnamevaluepairsaswellasembeddedsubdatabags.

    Thechallengeinthisusecaseistofindacommonwaytoapproachtheanalysisof

    databagswhenthecontentofthedatamayneedtobediscoveredafterthedatais

    loaded.

    Thefinaltwousecasesareoldandvenerableexamplesthatevenpredatedata

    warehousingitself.Butnewlifehasbeenbreathedintotheseusecasesbecauseoftheexcitingpotentialofultra-atomiccustomerbehaviordata.

    Loan risk analysis and insurance policy underwriting.Inordertoevaluatetheriskofaprospectiveloanoraprospectiveinsurancepolicy,manydatasourcescanbe

    broughtintoplayrangingfrompaymenthistories,detailedcreditbehavior,

    employmentdata,andfinancialassetdisclosures.Insomecasesthecollateralfora

    loanortheinsureditemmaybeaccompaniedbyimagedata.

    Customer churn analysis.Enterprisesconcernedwithchurnwanttounderstandthepredictivefactorsleadinguptothelossofacustomer,includingthatcustomers

    detailedbehavioraswellasmanyexternalfactorsincludingtheeconomy,lifestageandotherdemographicsofthecustomer,andfinallyrealtimecompetitiveissues.

    Makingsenseofbigdataanalyticusecases

    Certainlythepurposeofdevelopingthislistofusecasesistoconvincethereader

    thattheusecasescomeinallshapesandsizesandformats,andrequiremany

    specializedapproachestoanalyze.Upuntilveryrecentlyalltheseusecasesexisted

    asseparateendeavors,ofteninvolvingspecialpurposebuiltsystems.Buttheindustry

    awarenessofthe"bigdataanalyticschallenge"ismotivatingeveryonetolookforthe

    architecturalsimilaritiesanddifferencesacrossalltheseusecases.Anygiven

    enterpriseisincreasinglylikelytoencounteroneormoreoftheseusecases.That

    realizationisdrivingtheinterestinsystemarchitecturesthataddressesthebigdata

    analyticsprobleminageneralway.Pleasestudythefollowingtable.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    10/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics8

    Thesheerdensityofthistablemakesitclearthatsystemstosupportbigdata

    analyticshavetolookverydifferentthantheclassicrelationaldatabasesystemsfrom

    the1980sand1990s.TheoriginalRDBMSswerenotbuilttohandleanyofthe

    requirementsrepresentedascolumnsinthistable!

    Searchranking X X X X X X

    Adtracking X X X X X X X X

    Location&proximity X X X X X

    Causaldiscovery X X X X X X X

    SocialCRM X X X X X X X X

    Documentsimilarity X X X X X X X

    Genomicanalysis X X X X X

    Cohortgroups X X X X X X

    In-flightenginestatus X X X X X X

    Smartutilitymeters X X X X X X

    Buildingsensors X X X X X X X X

    Satelliteimages X X X X

    CATscans X X X X X X

    Financialfraud X X X X X X X X X

    Hackingdetection X X X X X X X X X

    Gamegestures X X X X X X X X

    Bigscience X X X X X X X X X

    Databagexploration X X X X X X

    Riskanalysis X X X X X X X X

    Churnanalysis X X X X X X X

    Vector,

    matrix

    ,or

    complex

    structure

    Free

    text

    Image

    orbinary

    data

    D

    ata

    bags

    Iter

    ative

    logic

    or

    complex

    branching

    Advanced

    ana

    lytic

    routines

    Rap

    idly

    repeated

    me

    asurements

    Extreme

    low

    latency

    Access

    to

    all

    data

    required

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    11/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics9

    Big data analytics system requirementsBeforediscussingtheexcitingnewtechnicalandarchitecturaldevelopmentsofthe

    2010s,let'ssummarizetheoverallrequirementsforsupportingbigdataanalytics,

    keepinginmindthatwearenotrequiringasinglesystemorasinglevendor's

    technologytoprovideablanketsolutionforeveryusecase.Fromtheperspectiveof

    2011,wehavetheluxuryofstandingbackfromalltheseusecasesgatheredinthelastfewyears,andwearenowinapositiontosurroundtherequirementswithsome

    confidence.

    Thedevelopmentofbigdataanalyticshasreachedapointwhereitneedsanoverall

    missionstatementandidentityindependentofalistofusecases.Manyofushave

    livedthroughearlierinstantiationsofadvancedanalyticsthatwentbythenamesof

    advancedstatistics,artificialintelligenceanddatamining.Noneoftheseearlier

    wavesbecameacoherentthemethattranscendedtheindividualexamples,as

    compellingasthoseexampleswere.

    Hereisanattempttostepbackanddefinethecharacteristicsofbigdataanalyticsatthehighestlevels.Inthefollowing,theterm"UDF"isusedinthebroadestsenseof

    anyuserdefinedfunctionorprogramoralgorithmthatmayappearanywhereinthe

    end-to-endanalysisarchitecture.

    Inthecoming2010sdecade,theanalysisofbigdatawillrequireatechnologyor

    combinationoftechnologiescapableof:

    scalingtoeasilysupportpetabytes(thousandsofterabytes)ofdata

    beingdistributedacrossthousandsofprocessors,potentiallygeographically

    unaware,andpotentiallyheterogeneous

    subsecondresponsetimeforhighlyconstrainedstandardSQLqueries embeddingarbitrarilycomplexuser-definedfunctions(UDFs)within

    processingrequests

    implementingUDFsinawidevarietyofindustry-standardprocedural

    languages

    assemblingextensivelibrariesofreusableUDFscrossingmostorallofuse

    cases

    executingUDFsas"relationscans"overpetabytesizeddatasetsinafew

    minutes

    supportingawidevarietyofdatatypesgrowingtoincludeimages,waveforms,

    arbitrarilyhierarchicaldatastructures,anddatabags

    loadingdatatobereadyforanalysis,atveryhighrates,atleastgigabytesper

    second

    integratingdatafrommultiplesourcesduringtheloadprocessatveryhigh

    rates(GB/sec)

    loadingdatabeforedeclaringordiscoveringitsstructure

    executingcertainstreaminganalyticqueriesinrealtimeonincomingload

    data

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    12/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics10

    updatingdatainplaceatfullloadspeeds

    joiningabillionrowdimensiontabletoatrillionrowfacttablewithoutpre-

    clusteringthedimensiontablewiththefacttable

    schedulingandexecutionofcomplexmulti-hundrednodeworkflows

    beingconfiguredwithoutbeingsubjecttoasinglepointoffailure

    failoverandprocesscontinuationwhenprocessingnodesfail supportingextrememixedworkloadsincludingthousandsofgeographically

    dispersedon-lineusersandprogramsexecutingavarietyofrequestsranging

    fromadhocqueriestostrategicanalysis,andwhileloadingdatainbatchand

    streamingfashion

    Twoarchitectureshaveemergedtoaddressbigdataanalytics:extendedRDBMS,

    andMapReduce/Hadoop.Thesearchitecturesarebeingimplementedascompletely

    separatesystemsandinvariousinterestinghybridcombinationsinvolvingboth

    architectures.Wewillstartbydiscussingthearchitecturesseparately.

    Extendedrelationaldatabasemanagementsystems

    Allofthemajorrelationaldatabasemanagementsystemvendorsareaddingfeatures

    toaddressbigdataanalyticsfromasolidrelationalperspective.Thetwomost

    significantarchitecturaldevelopmentshavebeentheovertakingofthehighendofthe

    marketwithmassivelyparallelprocessing(MPP),andthegrowingadoptionof

    columnarstorage.WhenMPPandcolumnarstoragetechniquesarecombined,a

    numberofthesystemrequirementsintheabovelistcanstarttobeaddressed,

    including:

    scalingtosupportexabytes(thousandsofpetabytes)ofdata

    beingdistributedacrosstensofthousandsofgeographicallydispersedprocessors

    subsecondresponsetimeforhighlyconstrainedstandardSQLqueries

    updatingdatainplaceatfullloadspeeds

    beingconfiguredwithoutbeingsubjecttoasinglepointoffailure

    failoverandprocesscontinuationwhenprocessingnodesfail

    Additionally,RDBMSvendorsareaddingsomecomplexuser-definedfunctions

    (UDF's)totheirsyntax,butthekindofgeneralpurposeprocedurallanguage

    computingrequiredbybigdataanalyticsisnotbeingsatisfiedinrelational

    environmentsatthistime.

    Inasimilarvein,RDBMSvendorsareallowingcomplexdatastructurestobestored

    inindividualfields.Thesekindofembeddedcomplexdatastructureshavebeen

    knownas"blobs"formanyyears.It'simportanttounderstandthatrelational

    databaseshaveahardtimeprovidinggeneralsupportforinterpretingblobssince

    blobsdonotfittherelationalparadigm.AnRDBMSindeedprovidessomevalueby

    hostingtheblobsinastructuredframework,butmuchofthecomplexinterpretation

    andcomputationontheblobsmustbedonewithspeciallycraftedUDFs,orBI

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    13/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics11

    applicationlayerclients.Blobsarerelatedtodatabagsdiscussedelsewhereinthis

    paper.SeethesectionentitledDatastructuresshouldbedeclaredatquerytime.

    MPPimplementationshaveneversatisfactorilyaddressedthe"bigjoinissuewherea

    billionrowdimensiontableisattemptedtobejoinedtoatrillionrowfacttablewithout

    resortingtoclusteredstorage.Thebigjoincrisisoccurswhenanadhocconstraintis

    placedagainstthedimensiontableresultinginapotentiallyverylargesetofdimensionkeysthatmustbephysicallydownloadedintoeveryoneofthephysicalsegmentsof

    thetrillionrowfacttablestoredseparatelyintheMPPsystem.Sincethedimension

    keysarescatteredrandomlyacrosstheseparatesegmentsofthetrillionrowfacttable,

    itisveryhardtoavoidalengthydownloadstepoftheverylargedimensiontableto

    everyoneofthefacttablestoragepartitions.Tobefair,theMapReduce/Hadoop

    architecturehasnotbeenabletoaddressthebigjoinproblemeither.

    Columnardatastoragefitstherelationalparadigm,andespeciallydimensionally

    modeleddatabases,verywell.Besidesthesignificantadvantageofhighcompression

    ofsparsedata,columnardatabasesallowaverylargenumberofcolumnscompared

    torow-orienteddatabases,andplacelittleoverheadonthesystemwhencolumnsareaddedtoanexistingschema.ThemostsignificantAchilles'heel,atleastin2011,is

    theslowloadingspeedofdataintothecolumnarformat.Althoughimpressiveload

    speedimprovementsarebeingannouncedbycolumnardatabasevendors,theyhave

    stillnotachievedthegigabytes-per-secondrequirementlistedabove.

    ThestandardRDBMSarchitectureforimplementinganenterprisedatawarehouse

    basedondimensionalmodelingprinciplesissimpleandwellunderstood,asshownin

    Figure1.Recallthatthroughoutthiswhitepaper,theEDWisdefinedinthe

    comprehensivesensetoincludeallbackroomandfrontroomprocessesincluding

    ETL,datapresentation,andBIapplications.

    Figure 1. The standard RDBMS based architecture for an enterprise data warehouseSource: The Data Warehouse Lifecycle Toolkit, 2ndedition, Kimball et al. (2008)

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    14/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics12

    InthisstandardEDWarchitecturetheETLsystemisamajorcomponentthatsits

    betweenthesourcesystemsandthepresentationserversthatareresponsiblefor

    exposingalldatatobusinessintelligenceapplications.Inthisview,theETLsystem

    addssignificantvaluebycleaning,conforming,andarrangingthedataintoaseriesof

    dimensionalschemaswhicharethenstoredphysicallyinthepresentationserver.A

    crucialelementofthisarchitectureisthepreparationofconformeddimensionsinthe

    ETLsystemthatservesasthebasisofintegrationfortheBIapplications.Itisthestrongconvictionofthisauthorthatdeferringthebuildingofthedimensional

    structuresandtheissuesofintegrationuntilquerytimeisthewrongarchitecture.

    Sucha"deferredcomputation"approachrequiresanundulyexpensivequery

    optimizertocorrectlyquerycomplexnon-dimensionalmodelseverytimeaqueryis

    presented.Thecalculationofintegrationatqueryprocessingtimegenerallyrequires

    complexapplicationlogicintheBItoolswhichalsomighthavetobeexecutedfor

    everyquery.

    TheextendedRDBMSarchitecturetosupportbigdataanalyticspreservesthe

    standardarchitecturewithanumberofimportantadditions,shownbelowinFigure2

    withlargearrows:

    Figure 2. The extended RDBMS based architecture for an enterprise data warehouseThefactthatthehigh-levelenterprisedatawarehousearchitectureisnotmaterially

    changedbytheintroductionofnewdatastructures,oragrowinglibraryofspecially

    crafteduser-definedfunctions,orpowerfulprocedurallanguage-basedprograms

    actingaspowerfulBIclients,isthecharmoftheextendedRDBMSapproachtobig

    dataanalytics.ThemajorRDBMSplayersareabletomarshaltheirenormouslegacy

    ofmillionsoflinesofcode,powerfulgovernancecapabilities,andsystemstabilitybuilt

    overdecadesofservingthemarketplace.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    15/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics13

    However,itistheopinionofthisauthorthattheextendedRDBMSsystemscannotbe

    theonlysolutionforbigdataanalytics.Atsomepoint,tackingonnon-relationaldata

    structuresandnon-relationalprocessingalgorithmstothebasic,coherentRDBMS

    architecturewillbecomeunwieldyandinefficient.TheSwissArmyknifeanalogy

    comestomind.Anotheranalogyclosertothetopicistheprogramminglanguage

    PL/1.Originallydesignedasanoverarching,multipurpose,powerfulprogramming

    languageforallformsofdataandallapplications,itultimatelybecameabloatedandsprawlingcorpusthattriedtodotoomanythingsinasinglelanguage.Sincethe

    heydayofPL/1therehasbeenawonderfulevolutionofmorenarrowlyfocused

    programminglanguageswithmanynewconceptsandfeaturesthatsimplycouldn'tbe

    tackedontoPL/1afteracertainpoint.Relationaldatabasemanagementsystemsdo

    somanythingssowellthatthereisnodangerofsufferingthesamefateasPL/1.The

    bigdataanalyticsspaceisgrowingsorapidlyandinsuchexcitingandunexpected

    newdirectionsthatalighterweight,moreflexibleandmoreagileprocessing

    frameworkinadditiontoRDBMSsystemsmaybeareasonablealternative.

    MapReduce/Hadoopsystems

    MapReduceisaprocessingframeworkoriginallydevelopedbyGoogleintheearly

    2000sforperformingwebpagesearchesacrossthousandsofphysicallyseparated

    machines.TheMapReduceapproachisextremelygeneral.CompleteMapReduce

    systemscanbeimplementedinavarietyoflanguagesalthoughthemostsignificant

    implementationisinJava.MapReduceisreallyaUDF(userdefinedfunction)

    executionframework,wherethe"F"canbeextraordinarilycomplex.Originally

    targetedtobuildingGoogle'swebpagesearchindex,aMapReducejobcanbe

    definedforvirtuallyanydatastructureandanyapplication.Thetargetprocessorsthat

    actuallyperformtherequestedcomputationcanbeidentical(a"cluster"),orcanbea

    heterogeneousmixofprocessortypes(a"grid").Thedataineachprocessorupon

    whichtheultimatecomputationisperformedcanbestoredinadatabase,ormorecommonlyinafilesystem,andcanbeinanydigitalformat.

    ThemostsignificantimplementationofMapReduceisApacheHadoop,knownsimply

    asHadoop.Hadoopisanopensource,top-levelApacheproject,withthousandsof

    contributorsandawholeindustryofdiverseapplications.Hadooprunsnativelyonits

    owndistributedfilesystem(HDFS)andcanalsoreadandwritetoAmazonS3and

    others.Conventionaldatabasevendorsarealsoimplementinginterfacestoallow

    Hadoopjobstoberunovermassivelydistributedinstancesoftheirdatabases.

    AswewillseewhenwegiveabriefoverviewofhowaHadoopjobworks,bandwidth

    betweentheseparateprocessorscanbeahugeissue.HDFSisaso-called"rackaware"filesystembecausethecentralnamenodeknowswhichnodesresideonthe

    samerackandwhichareconnectedbymorethanonenetworkhop.Hadoopexploits

    therelationshipbetweenthecentraljobdispatcherandHDFStosignificantlyoptimize

    amassivelydistributedprocessingtaskbyhavingdetailedknowledgeofwheredata

    actuallyresides.Thisalsoimpliesthatacriticalaspectofperformancecontrolisco-

    locatingsegmentsofdataonactualphysicalhardwarerackssothattheMapReduce

    communicationcanbeaccomplishedatbackplanespeedsratherthanslowernetwork

    speeds.Notethatremotecloud-basedfilesystemssuchasAmazonS3and

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    16/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics14

    CloudStoreare,bytheirnature,unabletoprovidetherackawarebenefit.Ofcourse,

    cloud-basedfilesystemshaveanumberofcompellingadvantageswhichwe'll

    discusslater.

    HowMapReduceworksinHadoop

    AMapReducejobissubmittedtoacentralizedJobTracker,whichinturnschedules

    partsofthejobtoanumberofTaskTrackernodes.Although,ingenerala

    TaskTrackermayfailanditstaskcanbereassignedbytheJobTracker,the

    JobTrackerisasinglepointoffailure.IftheJobTrackerhalts,theMapReducejob

    mustberestartedorberesumedfromintermediatesnapshots.

    AMapReducejobisalwaysdividedintotwodistinctphases,mapandreduce.The

    overallinputtoaMapReducejobisdividedintomanyequalsizedsplits,eachof

    whichisassignedamaptask.Themapfunctionisthenappliedtoeachrecordin

    eachsplit.Forlargejobs,thejobtrackerschedulesthesemaptasksinparallel.The

    overallperformanceofaMapReducejobdependssignificantlyonachievinga

    balanceofenoughparallelsplitstokeepmanymachinesbusy,butnotsomany

    parallelsplitsthattheinterprocesscommunicationofmanagingallthesplitsbogs

    downtheoveralljob.WhenMapReduceisrunovertheHDFSfilesystem,atypical

    defaultsplitsizeis64MBofinputdata.

    Asthenamesuggests,themaptaskisthefirsthalfoftheMapReducejob.Eachmap

    taskproducesasetofintermediateresultrecordswhicharewrittentothelocaldiskof

    themachineperformingthemaptask.ThesecondhalfoftheMapReducejob,the

    reducetask,mayrunonanyprocessingnode.Theoutputsofthemappers(nodes

    runningmaptasks)aresortedandpartitionedinsuchawaythattheseoutputscanbe

    transferredtothereducers(nodesrunningthereducetask).Thefinaloutputsofthe

    reducerscomprisethesortedandpartitionedresultssetoftheoverallMapReduce

    job.InMapReducerunningoverHDFS,theresultssetiswrittentoHDFSandis

    replicatedforreliability.

    InFigure3,weshowthistaskflowforaMapReducejobwiththreemappernodes

    feedingtworeducernodes,byreproducingfigure2.3fromTomWhite'sbook,

    Hadoop,TheDefinitiveGuide,2ndEdition,(O'Reilly,2010).

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    17/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics15

    Figure 3. An example MapReduce jobInTomWhite'sbook,asimpleMapReducejobisdescribedwhichweextend

    somewhathere.Supposethattheoriginaldatabeforethesplitsareappliedconsists

    ofaverylargenumber(perhapsbillions)ofunsortedtemperaturemeasurements,one

    perrecord.Suchmeasurementscouldcomefrommanythousandsofautomatic

    sensorslocatedaroundtheUnitedStates.Thesplitsareassignedtotheseparate

    mappernodestoequalizeasmuchaspossiblethenumberofrecordsgoingtoeach

    node.Theactualformofthemapperinputsarekey-valuepairs,inthiscasea

    sequentialrecordidentifierandthefullrecordcontainingthetemperature

    measurementsaswellasotherdata.Thejobofeachmapperissimplytoparsetherecordspresentedtoitandextracttheyear,thestate,andthetemperature,which

    becomesthesecondsetofkey-valuepairspassedfromthemappertothereducer.

    Thejobofeachreduceristofindthemaximumreportedtemperatureforeachstate,

    andeachdistinctyearintherecordspassedtoit.Eachreducerisresponsiblefora

    state,soinordertoaccomplishthetransfer,theoutputofeachmappermustbe

    sortedsothatthekey-valuepairscanbedispatchedtotheappropriatereducers.In

    thiscasetherewouldbe50reducers,oneforeachstate.Thesesortedblocksare

    thentransferredtothereducersinastepwhichisacriticalfeatureoftheMapReduce

    architecture,whereitiscalledthe"shuffle.

    Noticethattheshuffleinvolvesatruephysicaltransferofdatabetweenprocessing

    nodes.Thismakesthevalueoftherackawarefeaturemoreobvious,sincealotof

    dataneedstobemovedfromthemapperstothereducers.Thecleverreadermay

    wonderifthisdatatransfercouldbereducedbyhavingthemapperoutputscombined

    sothatmanyreadingsfromasinglestateandyeararegiventothereducerasa

    singlekey-valuepairratherthanmany.Theanswerisyes,andHadoopprovidesa

    combinerfunctiontoaccomplishexactlythisend.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    18/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics16

    Eachreducerreceivesalargenumberofstate/year-temperaturekey-valuepairs,and

    findsthemaximumtemperatureforagivenyear.Thesemaximumtemperaturesfor

    eachyeararethefinaloutputfromeachreducer.

    Thisapproachcanbescaledmoreorlessindefinitely.ReallyseriousMapReduce

    jobsrunningonHDFSmayhavehundredsorthousandsofmappersandreducers,

    processingpetabytesofinputdata.

    AtthispointtheappealoftheMapReduce/Hadoopapproachshouldbeclear.There

    arevirtuallynorestrictionsontheformoftheinputstotheoveralljob.Thereonly

    needstobesomerationalbasisforcreatingsplitsandreadingrecords,inthiscase

    therecordidentifierinTomWhite'sexample.Actuallogicinthemappersandthe

    reducerscanbeprogrammedinvirtuallyanyprogramminglanguageandcanbeas

    simpleastheaboveexample,ormuchmorecomplicatedUDFs.Thereadershould

    beabletovisualizehowsomeofthemorecomplexusecases(e.g.,comparisonof

    satelliteimages)describedearlierinthepapercouldfitintothisframework.

    ToolsfortheHadoopenvironment

    Whatwehavedescribedthusfaristhecoreprocessingcomponentwhen

    MapReduceisrunintheHadoopenvironment.Thisisroughlyequivalentto

    describingtheinnerprocessingloopinarelationaldatabasemanagementsystem.In

    bothcasesthere'salotmoretothesesystemstoimplementacompletefunctioning

    environment.ThefollowingisabriefoverviewoftypicaltoolsusedinaMapReduce/

    Hadoopenvironment.Wegroupthesetoolsbyoverallfunction.TomWhite'sbook,

    mentionedabove,isanexcellentstartingpointforunderstandinghowthesetoolsare

    used.

    Gettingdatainandgettingdataout

    ETLplatforms--ETLplatforms,withtheirlonghistoryofimportingand

    exportingdatatorelationaldatabases,providespecificinterfacesformoving

    dataintoandoutofHDFS.Theplatform-basedapproach,ascontrastedwith

    handcoding,providesextensivesupportformetadata,dataquality,

    documentation,andavisualstyleofsystembuilding.

    SqoopSqoop,developedbyCloudera,isanopensourcetoolthatallows

    importingdatafromarelationalsourcetoHDFSandexportingdatafrom

    HDFStoarelationaltarget.DataimportedbySqoopintoHDFScanbeused

    bothbyMapReduceapplicationsandHBaseapplications.HBaseisdescribed

    below.

    ScribeScribe,developedatFacebookandreleasedasopensource,isused

    toaggregatelogdatafromalargenumberofWebservers.

    FlumeFlume,developedbyCloudera,isadistributedreliablestreamingdata

    collectionservice.ItusesacentralconfigurationmanagedbyZookeeperand

    supportstunablereliabilityandautomaticfailoverandrecovery.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    19/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics17

    Programming

    Low-levelMapReduceprogramming--primarycodeformappersand

    reducerscanbewritteninanumberoflanguages.Hadoop'snativelanguage

    isJavabutHadoopexposesAPIsforwritingcodeinotherlanguagessuchas

    RubyandPython.AninterfacetoC++isprovided,whichisnamedHadoop

    Pipes.ProgrammingMapReduceatthelowestlevelobviouslyprovidesthemostpotentialpower,butthislevelofprogrammingisverymuchlike

    assemblylanguageprogramming.Itcanbeverylaborious,especiallywhen

    attemptingtodoconceptuallysimpletaskslikejoiningtwodatasets.

    HighlevelMapReduceprogramming--ApachePig,orsimplyPig,isaclient-

    sideopen-sourceapplicationprovidingahighlevelprogramminglanguagefor

    processinglargedatasetsinMapReduce.Theprogramminglanguageitselfis

    calledPigLatin.Hiveisanalternativeapplicationdesignedtolookmuchmore

    likeSQL,andisusedfordatawarehousingusecases.Whenemployedforthe

    appropriateusecases,PigandtheHiveprovideenormousprogramming

    productivitybenefitsoverlow-levelMapReduceprogramming,oftenbya

    factorof10ormore.PigandHivelifttheapplicationdeveloper sperspective

    upfrommanagingthedetailedmapperandreducerprocessestomoreofan

    applicationsfocus.

    IntegrateddevelopmentenvironmentMapReduce/Hadoopdevelopment

    needstomovedecisivelyawayfrombarehandcodingtobeadoptedby

    mainstreamITshops.Anintegrateddevelopmentenvironmentfor

    MapReduce/Hadoopneedstoincludeeditorsforsourcecode,compilers,tools

    forautomatingsystembuilds,debuggers,andaversioncontrolsystem.

    Integratedapplicationenvironmentanevenhigherlayeraboveanintegrated

    developmentenvironmentcouldbecalledanintegratedapplication

    environment,wherecomplexreusableanalyticroutinesareassembledintocompleteapplicationsviaagraphicaluserinterface.Thiskindofenvironment

    mightbeabletouseopensourcealgorithmssuchasprovidedbytheApache

    MahoutprojectwhichdistributesmachinelearningalgorithmsonHadoop

    platform.

    Cascading--Cascadingisanothertoolthatisanabstractionlayerforwriting

    complexMapReduceapplications.ItisbestdescribedasathinJavalibrary

    typicallyinvokedfromcommandlinetobeusedasaqueryAPIandprocess

    scheduler.ItisnotintendedtobeacomprehensivealternativetoPigorHive.

    HBase--HBaseisanopen-source,nonrelational,columnorienteddatabase

    thatrunsdirectlyonHadoop.ItisnotaMapReduceimplementation.AprincipaldifferentiatorofHBasefromPigorHive(MapReduce

    implementations)istheabilitytoprovidereal-timereadandwriterandom-

    accesstoverylargedatasets.

    Oozie--Oozieisaserver-basedworkflowenginespecializedinrunning

    workflowjobswithactionsthatexecuteHadoopjobs,suchasMapReduce,

    Pig,Hive,Sqoop,HDFSoperations,andsub-workflows .

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    20/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics18

    ZooKeeperZooKeeperisacentralizedconfigurationmanagerfordistributed

    applications.ZookeepercanbeusedindependentlyofHadoopaswell.

    Administering

    EmbeddedHadoopadminfeaturesHadoopsupportsacomprehensive

    runtimeenvironmentincludingeditlog,safemodeoperation,auditlogging,

    filesystemcheck,datanodeblockverifier,datanodeblockdistribution

    balancer,performancemonitor,comprehensivelogfiles,metricsfor

    administrators,countersforMapReduceusers,metadatabackup,data

    backup,filesystembalancer,commissioninganddecommissioningnodes.

    JavamanagementextensionsastandardJavaAPIformonitoringand

    managingapplications

    GangliaContextanopensourcedistributedmonitoringsystemforverylarge

    clusters

    Featureconvergenceinthecomingdecade

    ItissafetosaythatrelationaldatabasemanagementsystemsandMapReduce/

    Hadoopsystemswillincreasinglyfindwaystocoexistgracefullyinthecoming

    decade.Butthesystemshavedistinctcharacteristics,asdepictedinthefollowing

    table:

    IntheupcomingdecadeRDBMSswillextendtheirsupportforhostingcomplexdata

    typesas"blobs,andwillextendAPIsforarbitraryanalyticroutinestooperateonthe

    contentsofrecords.MapReduce/Hadoopsystems,especiallyHive,willdeepentheir

    supportforSQLinterfacesandfullersupportofthecompleteSQLlanguage.But

    neitherwilltakeoverthemarketforbigdataanalyticsexclusively.Asremarked

    earlier,RDBMSscannotprovide"relational"semanticsformanyofthecomplexuse

    casesrequiredbybigdataanalytics.Atbest,RDBMSswillproviderelationalstructure

    surroundingthecomplexpayloads.

    RelationalDBMSs MapReduce/Hadoop

    Proprietary,mostly Opensource

    Expensive Lessexpensive

    Datarequiresstructuring Datadoesnotrequirestructuring

    Greatforspeedyindexedlookups Greatformassivefulldatascans

    Deepsupportforrelationalsemantics Indirectsupportforrelationalsemantics,e.g.Hive

    Indirectsupportforcomplexdatastructures Deepsupportforcomplexdatastructures

    Indirectsupportforiteration,complexbranching

    Deepsupportforiteration,complexbranching

    Deepsupportfortransactionprocessing Littleornosupportfortransactionprocessing

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    21/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics19

    Similarly,MapReduce/HadoopsystemswillnevertakeoverACID-compliant

    transactionprocessing,orbecomesuperiortoRDBMSsforindexedqueriesonrow

    andcolumnorientedtables.

    Asthispaperisbeingwritten,significantadvancesarebeingmadeindeveloping

    hybridsystemsusingbothrelationaldatabasetechnologyandMapReduce/Hadoop

    technology.Figure4illustratestwoprimaryalternatives.ThefirstalternativedeliversthedatadirectlyintoaMapReduce/Hadoopconfigurationforprimarynon-relational

    analysis.Aswehavedescribed,thisanalysiscanrangethefullgamutfromcomplex

    analyticalroutinestosimplesortingthatlookslikeaconventionalETLstep.Whenthe

    MapReduce/Hadoopstepiscomplete,theresultsareloadedintoanRDBMSfor

    conventionalstructuredqueryingwithSQL.

    ThesecondalternativeconfigurationloadsthedatadirectlytoanRDBMS,evenwhen

    theprimarydatapayloadsarenotconventionalscalarmeasurements.Atthatpoint

    twoanalysismodesarepossible.Thedatacanbeanalyzedwithspeciallycrafted

    user-definedfunctions,effectivelyfromtheBIlayer,orpassedtoadownstream

    MapReduce/Hadoopapplication.

    Inthefutureevenmorecomplexcombinationswilltiethesearchitecturesmore

    closelytogether,includingMapReducesystemswhosemappersandreducersare

    actuallyrelationaldatabases,andrelationaldatabasesystemswhoseunderling

    storageconsistsofHDFSfiles.

    Figure 4. Alternative hybrid architectures using both RDBMS and Hadoop.ItwillprobablybedifficultforITorganizationstosortoutthevendorclaimswhichwill

    almostcertainlyclaimthattheirsystemsdoeverything.Insomecasestheseclaims

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    22/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics20

    are"objectionremovers"whichmeansthattheyareclaimsthathaveagrainoftruth

    tothem,andaremadetomakeyoufeelgood,butdonotstanduptoscrutinyina

    competitiveandpracticalenvironment.Buyerbeware!

    Reusableanalytics

    Uptothispointwehavebeggedtheissueofwheredoesallthespecialanalytic

    softwarecomefrom.Bigdataanalyticswillneverprosperifeveryinstanceisa

    customcodedsolution.BoththeRDBMSandtheopen-sourcecommunities

    recognizethisandtwomaindevelopmentthemeshaveemerged.High-endstatistical

    analysisvendors,suchasSAS,havedevelopedextensiveandproprietaryreusable

    librariesforawiderangeofanalyticapplications,includingadvancedstatistics,data

    mining,predictiveanalytics,featuredetection,linearmodels,discriminantanalysis,

    andmanyothers.Theopensourcecommunityhasanumberofinitiatives,themost

    notableofwhichareHadoop-MLandApacheMahout.QuotingfromHadoop- MLs

    website:

    Hadoop-ML(is)aninfrastructuretofacilitatetheimplementationofparallel

    machinelearning/datamining(ML/DM)algorithmsonHadoop.Hadoop-ML

    hasbeendesignedtoallowforthespecificationofbothtask-parallelanddata-

    parallelML/DMalgorithms.Furthermore,itsupportsthecompositionof

    parallelML/DMalgorithmsusingbothserialaswellasparallelbuildingblocks

    --thisallowsonetowritereusableparallelcode.Theproposedabstraction

    easestheimplementationprocessbyrequiringtheusertoonlyspecify

    computationsandtheirdependencies,withoutworryingaboutscheduling,

    datamanagement,andcommunication.Asaconsequence,thecodesare

    portableinthattheuserneverneedstowriteHadoop-specificcode.This

    potentiallyallowsonetoleveragefutureparallelizationplatformswithout

    rewritingone'scode.

    ApacheMahoutprovidesfreeimplementationsofmachinelearningalgorithmson

    Hadoopplatform.

    Complexeventprocessing(CEP)

    Complexeventprocessing(CEP)consistsofprocessingeventshappeninginsideand

    outsideanorganizationtoidentifymeaningfulpatternsinordertotakesubsequent

    actioninrealtime.Forexample,CEPisusedinutilitynetworks(electrical,gasand

    water)toidentifypossibleissuesbeforetheybecomedetrimental.TheseCEP

    deploymentsallowforreal-timeinterventionforcriticalnetworkorinfrastructure

    situations.ThecombinationofdeepDWanalyticsandCEPcanbeappliedinretail

    customersettingstoanalyzebehaviorandidentifysituationswhereacompanymay

    loseacustomerorbeabletosellthemadditionalproductsorservicesatthetimeof

    theirdirectengagement.Inbanking,sophisticatedanalyticsmighthelptoidentifythe

    10mostcommonpatternsoffraudandCEPcanthenbeusedtowatchforthose

    patternssotheymaybethwartedbeforealoss.

    Atthetimeofthiswhitepaper,CEPisnotgenerallythoughtofaspartoftheEDW,

    butthisauthorbelievesthattechnicaladvancesincontinuousqueryprocessingwill

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    23/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics21

    causeCEPandEDWtosharedataandworkmorecloselytogetherinthecoming

    decade.

    Data warehouse cultural changes in the coming decadeTheenterprisedatawarehousemustabsolutelystayrelevanttothebusiness.Asthe

    valueandthevisibilityofbigdataanalyticsgrows,thedatawarehousemust

    encompassthenewculture,skills,techniques,andsystemsrequiredforbigdata

    analytics.

    Sandboxes

    Forexample,bigdataanalysisencouragesexploratorysandboxesfor

    experimentation.Thesesandboxesarecopiesorsegmentsofthemassivedatasets

    beingsourcedbytheorganization.Individualanalystsorverysmallgroupsare

    encouragedtoanalyzethedatawithaverywidevarietyoftools,rangingfromserious

    statisticaltoolslikeSAS,MatlaborR,topredictivemodels,andmanyformsofadhocqueryingandvisualizationthroughadvancedBIgraphicalinterfaces.Theanalyst

    responsibleforagivensandboxisallowedtodoanythingwiththedata,usinganytool

    theywant,evenifthetoolstheyusearenotcorporatestandards.Thesandbox

    phenomenonhasenormousenergybutitcarriesasignificantrisktotheIT

    organizationandEDWarchitecturebecauseitcouldcreateisolatedandincompatible

    stovepipesofdata.Thispointisamplifiedinthesectiononorganizationalchanges,

    below.

    Exploratorysandboxesusuallyhavealimitedtimeduration,lastingweeksoratmost

    afewmonths.Theirdatacanbeafrozensnapshot,orawindowonacertainsegment

    ofincomingdata.Theanalystmayhavepermissiontorunanexperimentchangingafeatureontheproductorserviceinthemarketplace,andthenperformingA/Btesting

    toseehowthechangeaffectscustomerbehavior.Typically,ifsuchanexperiment

    producesasuccessfulresult,thesandboxexperimentisterminated,andthefeature

    goesintoproduction.Atthatpoint,trackingapplicationsthatmayhavebeen

    implementedinthesandboxusingaquickanddirtyprototypinglanguage,areusually

    reimplementedbyotherpersonnelintheEDWenvironmentusingcorporatestandard

    tools.Inseveralofthee-commerceenterprisesinterviewedforthiswhitepaper,

    analyticsandboxeswereextremelyimportant,andinsomecaseshundredsofthe

    sandboxexperimentswereongoingsimultaneously.Asoneintervieweecommented,

    "newlydiscoveredpatternshavethemostdisruptivepotential,andinsightsfromthem

    leadtothehighestreturnsoninvestment."

    Architecturally,sandboxesshouldnotbebruteforcecopiesofentiredatasets,or

    evenmajorsegmentsofthesedatasets.Indimensionalmodelingparlance,the

    analystneedsmuchmorethanjustafacttabletoruntheexperiment.Ataminimum

    theanalystalsoneedsoneormoreverylargedimensiontables,andpossibly

    additionalfacttablesforcomplete"drillacross"analysis.If100analystsarecreating

    bruteforcecopyversionsofthedataforthesandboxestherewillbeenormous

    wastingofdiskspaceandresourcesforalltheredundantcopies.Rememberthatthe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    24/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics22

    largestdimensiontables,suchascustomerdimensions,canhave500millionrows!

    Therecommendedarchitectureforaserioussandboxenvironmentistobuildeach

    sandboxusingconformed(shared)dimensionswhichareincorporatedintoeach

    sandboxasrelationalviews,ortheirequivalentunderHadoopapplications.

    Lowlatency

    Anelementarymistakewhengatheringbusinessrequirementsduringthedesignofa

    datawarehouseistoaskthebusinessuseriftheywant"realtime"data.Usersare

    likelytosay"ofcourse!"Althoughperhapsthisanswerhasbeensomewhatgratuitous

    inthepast,agoodbusinesscasecannowbemadeinmanysituationsthatmore

    frequentupdatesofdatadeliveredtothebusinesswithlowerandlowerlatenciesare

    justified.BothRDBMSsandMapReduce/Hadoopsystemsstrugglewithloading

    giganticamountsofdataandmakingthatdataavailablewithinsecondsofthatdata

    beingcreated.Butthemarketplacewantsthis,andregardlessofatechnologists

    doubtabouttherequirement,therequirementisrealandoverthenextdecadeitmust

    beaddressed.

    Aninterestingangleonlowlatencydataisthedesiretobeginseriousanalysisonthe

    dataasitisstreamingin,butpossiblyfarbeforethedatacollectionprocesseven

    terminates.Thereissignificantinterestinstreaminganalysissystemswhichallow

    SQL-likequeriestoprocessthedataasitflowsintothesystem.Insomeusecases

    whentheresultsofastreamingquerysurpassathreshold,theanalysiscanbehalted

    withoutrunningthejobtothebitterend.Anacademiceffort,knownascontinuous

    querylanguage(CQL),hasmadeimpressiveprogressindefiningtherequirements

    forstreamingdataprocessingincludingcleversemanticsfordynamicallymovingtime

    windowsonthestreamingdata.LookforCQLlanguageextensionsandstreaming

    dataquerycapabilitiesintheloadprogramsforbothRDBMSsandHDFSdeployed

    datasets.Anidealimplementationwouldallowstreamingdataanalysistotakeplacewhilethedataisbeingloadedatgigabytespersecond.

    Theavailabilityofextremelyfrequentandextremelydetailedeventmeasurements

    candriveinteractiveintervention.Theusecaseswherethisinterventionisimportant

    spansmanysituationsrangingfromonlinegamingtoproductoffersuggestionsto

    financialaccountfraudresponsestothestabilityofnetworks.

    Continuousthirstformoreexquisitedetail

    Analystsareforeverthirstingformoredetailineverymarketplaceobservation,

    especiallyofcustomerbehavior.Forexampleeverywebpageevent(apagebeing

    paintedonauser'sscreen)spawnshundredsofrecordsdescribingeveryobjecton

    thepage.Inonlinegames,whereeverygestureentersthedatastream,asmanyas

    100descriptorsareattachedtoeachofthesegesturemicro-events.Forinstance,ina

    hypotheticalonlinebaseballgame,whenthebatterswingsatapitch,everything

    describingthepositionoftheplayers,thescore,runnersonthebases,andeventhe

    characteristicsofthepitch,areallstoredwiththatindividualrecord.Inbothofthese

    examples,thecompletecontextmustbecapturedwithinthecurrentrecord,because

    itisimpracticaltocomputethisdetailedcontextafterthefactfromseparatedata

    sources.Thelessonforthecomingdecadeisthatthisthirstforexquisitedetailwill

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    25/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics23

    onlygrow.Itispossibletoimaginethousandsofattributesbeingattachedtosome

    micro-events,andthecategoriesandnamesoftheseattributeswillgrowin

    unpredictableways.Thismakesthedatabagapproachdiscussedearlierinthepaper

    muchmoreimportant.Itmeansthatpositionallydependentschemas,withthekeys

    (namesofthedata)pre-declaredascolumnnamesisanunworkabledesign.

    Finally,aperfecthistoricalreconstructionofinterestingeventssuchaswebpageexposuresneedstobemorethanjustalistofattributesonthewebpagewhenitwas

    displayed,evenifthatlistisenormouslydetailed.Aperfecthistoricalreconstructionof

    thewebpageneedstobeseenthroughamultimediauserinterface,i.e.,abrowser.

    Lighttouchdatawaitsforitsrelevancetobeexposed

    Lighttouchdataisanaspectoftheexquisitedetaildatadescribedintheprevious

    section.Forexample,ifacustomerbrowsesawebsiteextensivelybeforemakinga

    purchase,agreatdealofmicro-contextisstoredinallthewebpageeventspriortothe

    purchase.Whenthepurchaseismade,someofthatmicro-contextsuddenly

    becomesmuchmoreimportant,andiselevatedfrom"lighttouchdata"torealdata.At

    thatpointthesequenceofexposurestotheselectedproductortocompetitive

    productsinthesamespacebecomespossibletobesessionized.Thesemicro-events

    areprettymuchmeaninglessbeforethepurchaseevent,becausetherearesomany

    conceivableandirrelevantthreadsthatwouldbedeadendsforanalysis.This

    requiresoceansoflighttouchdatatobestored,waitingfortherelevanceofselected

    threadsofthesemicro-eventstoeventuallybeexposed.Conventionalseasonality

    thinkingsuggeststhatatleastfivequarters(15months)ofthislighttouchdataneeds

    tobekeptonline.Thisisoneinstanceofaremarkmadeconsistentlyduring

    interviewsforthiswhitepaperthatanalystswant"longertails"whichmeansthatthey

    wantmoresignificanthistoriesthantheycurrentlyget.

    Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata

    Althoughdatasamplinghasneverbeenapopulartechniqueindatawarehousing,

    surprisinglythearrivalofenormouspetabytesizeddatasetshasnotincreasedthe

    interestinanalyzingasubsetofthedata.Onthecontrary,anumberofanalystspoint

    outthatmonetizableinsightscanbederivedfromverysmallpopulationsthatcouldbe

    missedbyonlysamplingsomeofthedata.Ofcoursethisisasomewhatcontroversial

    point,sincethesameanalystsadmitthatifyouhave1trillionbehaviorobservation

    records,youmaybeabletofindanybehaviorpatternifyoulookhardenough.

    Anothersomewhatcontroversialpointraisedbysomeanalystsistheirconcernthat

    anyformofdatacleaningontheincomingdatacoulderaseinterestinglow-frequency

    "edgecases.Ultimatelyboththecasesofmisleadingrarebehaviorpatterns,and

    misleadingcorrupteddataneedtobegentlyfilteredoutofthedata.

    Assumingthatthebehaviorinsightsfromverysmallpopulationsarevalid,thereis

    widespreadrecognitionthatmicro-marketingtothesmallpopulationsispossible,and

    doingenoughofthiscanbuildasustainablestrategicadvantage.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    26/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics24

    Afinalargumentinfavorofanalyzingcompletedatasetsisthatthese"relationscans"

    donotrequireindexesoraggregationstobecomputedinadvanceoftheanalysis.

    ThisapproachfitswellwiththebasicMapReducedistributedanalysisarchitecture.

    Datastructuresshouldbedeclaredatquerytime,notatdataloadtime

    Anumberofanalystsinterviewedforthiswhitepapersaidthattheenormousdata

    setstheyweretryingtoanalyzeneededtobeloadedinaqueryablestatebeforethe

    structureandcontentofthedatasetswerecompletelyunderstood.Again,thinkingof

    thedatabagkindofmarketplaceobservationwherewithinawell-structured

    dimensionalmeasurementprocesstheactualobservationisadisorderlyand

    potentiallyunpredictablesetofkeyvaluepairs,thestructureofthisdatabagmay

    needtobediscovered,andalternateinterpretationofthestructuresmayneedtobe

    possiblewithoutreloadingthedatabase.Onerespondentremarkedthatyesterdays

    fringedataistomorrowswell -structureddata,implyingthatweneedexceptional

    flexibilityasweexplorenewkindsofdatasources.

    AkeydifferentiatorbetweentheRDBMSapproachandtheMapReduce/Hadoop

    approachisthedeferralofthedatastructuredeclarationuntilquerytimeinthe

    MapReduce/Hadoopsystems.AnobjectionfromtheRDBMScommunitythatforcing

    everyMapReducejobtodeclarethetargetdatastructurepromotesakindofchaos

    becauseeveryanalystcandotheirownthing.Butthatobjectionseemstomissthe

    pointthatastandarddatastructuredeclarationcaneasilybepublishedasalibrary

    modulethatcanbepickedupbyeveryanalystwhentheyareimplementingtheir

    application.

    TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep

    CohenandDolanintheirseminalbutsomewhatcontroversialpaperonbigdata

    analyticsarguethatEDWsmustshedsomeoldorthodoxiesinordertobemagnetic,agile,anddeep.Amagneticenvironmentplacestheleastimpedimentsonthe

    incorporationofnew,unexpected,andpotentiallydirtydatasources.Specifically,this

    supportstheneedtodeferdeclarationofdatastructuresuntilafterthedataisloaded.

    AccordingtoCohenandDolan,anagileenvironmenteschewslong-rangecareful

    designandplanning!Andadeepenvironmentallowsrunningsophisticatedanalytic

    algorithmsonmassivedatasetswithoutsampling,orperhapsevencleaning.We

    havemadethesepointselsewhereinthiswhitepaperbutCohenandDolanspaper

    isaparticularlypotent,ifunusual,argument.Readthispapertogetsomeprovocative

    perspectives!AlinktoCohenandDolanspaperisprovidedinthereferencessection

    attheendofthiswhitepaper.

    Theconflictbetweenabstractionandcontrol

    IntheMapReduce/Hadoopworld,PigandHivearewidelyregardedasvaluable

    abstractionsthatallowtheprogrammertofocusondatabasesemanticsratherthan

    programmingdirectlyinJava.Butseveralanalystsinterviewedforthispaper

    remarkedthattoomuchabstractionandtoomuchdistancingfromwherethedata

    actuallyisstoredcanbedisastrouslyinefficient.Thisseemslikeareasonable

    concernwhendealingwiththeverylargestdatasets,whereabadalgorithmcould

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    27/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics25

    resultinruntimesmeasuredindays.Forthebreakingwaveofthebiggestdatasets,

    programmingtoolswillneedtoallowconsiderablecontroloverthestoragestrategy,

    andtheprocessingapproaches,butwithoutrequiringprogrammingusingthelowest

    levelcode.

    Data warehouse organization changes in the coming decadeThegrowingimportanceofbigdataanalyticsamountstosomethingbetweena

    midcoursecorrectionandarevolutionforenterprisedatawarehousing.Newskillsets,

    neworganizations,newdevelopmentparadigms,andnewtechnologywillneedtobe

    absorbedbymanyenterprises,especiallythosefacingtheusecasesdescribedinthis

    paper.Noteveryenterpriseneedstojumpintothepetabyteocean,butitisthis

    author'spredictionthattheupcomingdecadewillseeasteadygrowthinthe

    percentageoflargeenterprisesrecognizingthevalueofbigdataanalytics.

    Mostobserverswouldagreethatbigdataanalyticsfallswithin"information

    management,"butthesameobserversmayquibbleaboutwhetherthisaffectsthe

    "datawarehouse."Ratherthanworryingaboutwhethertheboxontheorganization

    chartlabeledEDWhasresponsibilityforbigdataanalytics,wetaketheperspective

    thatenterprisedatawarehousingwithoutthecapitallettersabsolutelyencompasses

    bigdataanalytics.Havingsaidthat,therewillbemanydifferentorganizational

    structuresandmanagementperspectivesasindustriesexpandtheirinformation

    management.Thiskindoftinkeringandadjustingtothenewparadigmisnormaland

    expected.Wewentthroughaverysimilarphaseinthemid1980swhendata

    warehousingitselfwasanewparadigmforITandthebusiness.Manyofthemost

    successfulearlydatawarehousinginitiativesstartedinthebusinessorganizations

    andwereeventuallyincorporatedintothoseITorganizationsthatthenmademajorcommitmentstobeingbusinessrelevant.Itislikelythesameevolutionwilltakeplace

    withbigdataanalytics.

    Thechallengebeforeinformationmanagersinlargeenterprisesishowtoencourage

    threeseparatedatawarehouseendeavors:conventionalRDBMSapplications,

    MapReduce/Hadoopapplications,andadvancedanalytics.

    Technicalskillsetsrequired

    Itisworthrepeatingherethemessageoftheveryfirstsentenceofthiswhitepaper.

    Petabytescaledatasetsareofcourseabigchallengebutbigdataanalysisisoften

    aboutdifficultiesotherthandatavolume.Youcanhavefastarrivingdataorcomplexdataorcomplexanalyseswhichareverychallengingevenifallyouhaveare

    terabytesofdata!

    ThecareandfeedingofRDBMS-orienteddatawarehousesinvolvesa

    comprehensivesetofskillsthatisprettywellunderstood:SQLprogramming,ETL

    platformexpertise,databasemodeling,taskscheduling,systembuildingand

    maintenanceskills,oneormorescriptinglanguagessuchasPythonorPerl,UNIXor

    Windowsoperatingsystemskills,andbusinessintelligencetoolsskills.SQL

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    28/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics26

    programming,whichisatthecoreofanRDBMSimplementation,isadeclarative

    language,whichcontrastswiththemindsetoftheprocedurallanguageskillsneeded

    forMapReduce/Hadoopprogramming,atleastinJava.Thedatawarehouseteam

    alsoneedstohaveagoodpartnershipwithinotherareasofITincludingstorage

    management,security,networking,andsupportofmobiledevices.Finally,gooddata

    warehousingalsorequiresanextensiveinvolvementwiththebusinesscommunity,

    andwiththecognitivepsychologyofend-users!

    ThecareandfeedingofMapReduce/Hadoopdatawarehouses,includinganyofthe

    bigdataanalyticsusecasesdescribedinthispaper,involvesasetofskillsthatonly

    partiallyoverlaptraditionalRDBMSdatawarehouseskills.Thereinliesasignificant

    challenge.Thesenewskillsincludelower-levelprogramminglanguagessuchas

    Java,C++,Ruby,Python,andMapReduceinterfacesmostcommonlyavailablevia

    Java.Althoughtherequirementtoprogramviaproceduralbasedlower-level

    programminglanguageswillbereducedsignificantlyduringtheupcomingdecadein

    favorofPig,Hive,andHBase,itmaybeeasiertorecruitMapReduce/Hadoop

    applicationdevelopersfromtheprogrammingcommunityratherthanthedata

    warehousecommunity,ifthedatawarehousejobapplicantslackprogrammingandUNIXskills.IfMapReduce/Hadoopdatawarehousesaremanagedexclusivelywith

    opensourcetools,thenZookeeperandOozieskillswillbeneededtoo.Keepinmind

    thattheopen-sourcecommunityinnovatesquickly.Hive,PigandHBasearenotthe

    lastwordinhigh-levelinterfacestoHadoopforanalysis.Itislikelythatwewillsee

    muchmoreinnovationinthisdecadeincludingentirelynewinterfaces.

    ETLplatformprovidershaveabigopportunitytoprovidemuchofthegluethatwilltie

    togetherthebigdatasources,MapReduce/Hadoopapplications,andexisting

    relationaldatabases.DeveloperswithETLplatformskillswillbeabletoleveragea

    greatdealoftheirexperienceandinstinctsinsystembuildingwhentheyincorporate

    MapReduce/Hadoopapplications.

    Finally,theanalystswhomwehavedescribedasoftenworkinginsandbox

    environmentswillarrivewithaneclecticandunpredictablesetofskillsstartingwith

    deepanalyticexpertise.Forthesepeopleitisprobablymoreimportanttobe

    conversantinSAS,Matlab,orRthantohavespecificprogramminglanguageor

    operatingsystemskills.SuchindividualstypicallywillarrivewithUNIXskills,and

    somereasonableprogrammingproficiency,andmostofthesepeopleareextremely

    tolerantoflearningnewcomplextechnicalenvironments.Perhapsthebiggest

    challengewithtraditionalanalystsisgettingthemtorelyontheotherresources

    availabletothemwithinIT,ratherthanbuildingtheirownextractanddatadelivery

    pipelines.Thisisatrickybalancebecauseyouwanttogivetheanalystsunusualfreedom,butyouneedtolookovertheirshoulderstomakesurethattheyarenot

    wastingtheirtime.

    Neworganizationsrequired

    Atthisearlystageofthebigdataanalyticsrevolution,thereisnoquestionthatthe

    analystsmustbepartofthebusinessorganization,bothtounderstandthe

    microscopicworkingsofthebusiness,butalsotobeabletoconductthekindofrapid

    turnaroundexperimentsandinvestigationswehavedescribedinthispaper.Aswe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    29/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics27

    havedescribed,theseanalystsmustbeheavilysupportedinatechnicalsense,with

    potentiallymassivecomputepoweranddatatransferbandwidth.Soalthoughthe

    analystsmayresideinthebusinessorganizations,thisisagreatopportunityforITto

    gaincredibilityandpresencewiththebusiness.Itwouldbeasignificantmistakeand

    alostopportunityfortheanalystsandtheirsandboxestoexistasroguetechnical

    outpostsinthebusinessworldwithoutrecognizingandtakingadvantageoftheirdeep

    dependenceonthetraditionalITworld.

    Insomeorganizationsweinterviewedforthiswhitepaper,wesawseparateanalytic

    groupsembeddedwithindifferentbusinessorganizations,butwithoutverymuch

    crosscommunicationorcommonidentityestablishedamongtheanalyticgroups.In

    somenoteworthycases,thislackofan"analyticcommunity"ledtolostopportunities

    toleverageeachother'swork,andledtomultiplegroupsreinventingthesame

    approaches,andduplicatingprogrammingeffortsandinfrastructuredemandsasthey

    madeseparatecopiesofthesamedata.

    Werecommendthatacrossdivisionalanalyticscommunitybeestablishedmimicking

    someofthesuccessfuldatawarehousecommunitybuildingeffortswehaveseeninthepastdecade.Suchacommunityshouldhaveregularcrossdivisionalmeetings,as

    wellasakindofprivateLinkedInapplicationtopromoteawarenessofallthecontacts

    andperspectivesandresourcesthattheseindividualscollectintheirown

    investigations,andaprivatewebportalwhereinformationandnewseventsare

    shared.Periodictalkscanbegiven,hopefullyinvitingmembersofthebusiness

    communityaswell,andabovealltheanalyticscommunityneedsT-shirtsandmugs!

    Newdevelopmentparadigmsrequired

    Evenbeforethearrivalofbigdataanalytics,datawarehousinghasbeentransforming

    itselftoprovidemorerapidresponsetonewopportunitiesandtobemoreintouch

    withthebusinesscommunity.Someofthepracticesoftheagilesoftware

    developmentmovementhavebeensuccessfullyadoptedbythedatawarehouse

    community,althoughrealisticallythishasnotbeenahighlyvisibletransformation.

    But,inparticular,theagiledevelopmentapproachsupportsthedatawarehouseby

    beingorganizedaroundsmallteamsdrivenbythebusiness,nottypicallybyIT.An

    agiledevelopmenteffortalsoproducesfrequenttangibledeliveries,deemphasizes

    documentationandformaldevelopmentmethodologies,andtoleratesmidcourse

    correctionandtheincrementalacceptanceofnewrequirements.Themostsensitive

    ingredientforsuccessofagiledevelopmentprojectsisthepersonalityandskillsofthe

    businessleaderwhoultimatelyisincharge.Theagilebusinessleaderneedstobea

    thoughtfulandsophisticatedobserverofthedevelopmentprocessandtherealitiesoftheinformationworld.Hopefullytheagilebusinessleaderisaprettygoodmanageras

    well.

    Bigdataanalyticscertainlyopensthedoortobusinessinvolvementsincethecentral

    analysisisprobablydoneinthebusinessenvironmentdirectly.Butitisprobably

    unlikelythattheprofessionalanalystistherightpersontobetheoverallagiledata

    warehouseprojectleader.Theagileprojectleaderneedstobewellskilledin

    facilitatingshorteffectivemeetings,resolvingissuesanddevelopmentchoices,

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    30/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics28

    determiningthetruthofprogressreportsfromindividualdevelopers,communicating

    withtherestoftheorganization,andgettingfundingforinitiatives.

    Traditionaldatawarehousedevelopmenthasdiscoveredtheattractivenessof

    buildingincrementallyfromamodeststart,butwithagoodarchitecturalfoundation

    thatprovidesablueprintforwherefuturedevelopmentwillgo.Thisauthorhas

    describedinmanypapersthetechniquesfor"gracefulmodification"ofdimensionaldatawarehouseschemas.Inadimensionallymodeleddatawarehouse,new

    measurementfacts,newdimensionalattributes,andevennewdimensionscanbe

    addedtoexistingdatawarehouseapplicationswithoutchanging,invalidating,or

    rollingoverexistinginformationdeliverypipelinestotheendusers.Manyoftheuse

    caseswehavedescribedinthispaperforbigdataanalyticssuggestthatnewfacts,

    newattributes,andnewdimensionswillroutinelybecomeavailable.

    Integrationofnewdatasourcesintoadatawarehousehasalwaysbeena

    significantchallenge,sinceoftenthesenewdatasourcesarrivewithoutanythought

    tointegrationwithexistingdatasources.Thiswillcertainlybethecasewithbigdata

    analytics.Againfordimensionallymodeleddatawarehouses,thisauthorhasdescribedtechniquesforincrementalintegration,where"enterprisedimensional

    attributes"aredefinedandplantedinthedimensionsoftheseparatedatasources.

    Wecalltheseconformeddimensions.Thedevelopmentanddeploymentof

    conformeddimensionsfitstheagiledevelopmentapproachbeautifully,sincethis

    kindofintegrationcanbeimplementedonedatasourceatatime,andone

    dimensionalattributeatatime,againinawaythatisnondestructivetoexisting

    applications.Pleaseseethereferencessectionattheendofthiswhitepaperfor

    moreinformationonconformeddimensions.

    Finally,atleastoneorganizationinterviewedforthiswhitepaperhastakenagilityto

    itslogicalextreme.Individualdevelopersaregivencompleteend-to-endresponsibilityforaproject,allthewayfromoriginalsourcingofthedata,through

    experimentalanalysis,re-implementingtheprojectforproductionuse,andworking

    withtheend-usersandtheirBItoolsinsupportivemode.Althoughthisdevelopment

    approachremainsanexperiment,earlyresultsareveryinterestingbecausethese

    developersfeelasignificantsenseofresponsibilityandpridefortheirprojects.

    Lessonsfromtheearlydatawarehousingera

    Ittookmostofthe1990sfororganizationstounderstandwhatadatawarehouse

    wasandhowtobuildandmanagethosekindsofsystems.Interestingly,attheend

    ofthe1990s,datawarehousingwaseffectivelyrelabeledasbusinessintelligence.

    Thiswasaverypositivedevelopmentbecauseitreflectedtheneedforthebusiness

    toownandtakeresponsibilityfortheusesofdata.

    Theearliestdatawarehousepioneershadnochoicebuttodotheirownsystems

    integration,assemblingbest-of-breedcomponents,andcopingwiththeinevitable

    incompatibilitiesinissuesofdealingwithmultiplevendors.Bytheendofthe1990s,

    thebestofbreedapproachgavewaytovendorstacksofintegratedproducts,a

    trendwhichcontinuesuntiltoday.Atthispoint,thereareonlyafewindependent

    vendorsinthedatawarehousespace,andthosevendorshavesucceededby

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    31/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics29

    interfacingwithnearlyeveryconceivableformatandinterface,therebyproviding

    bridgesbetweenthemorelimitedproprietaryvendorstacks.

    Withthebenefitofhindsightgainedfromthetraditionaldatawarehouseexperience,

    thebigdataanalyticsversionofdatawarehousingislikelytoconsolidatequite

    quickly.Onlythebravestorganizationswithverystrongsoftwaredevelopmentskills

    shouldconsiderrollingtheirownbigdataanalyticsapplicationsdirectlyonrawMapReduce/Hadoop.Forinformationmanagementorganizationswishingtofocuson

    thebusinessissuesratherthanonthebreakingwaveofsoftwaredevelopment,a

    packagedHadoopdistribution(e.g.,Cloudera)makesalotofsense.TheleadingETL

    platformvendorslikelywillalsointroducepackagedenvironmentsforhandlingmany

    ofthephasesofMapReduce/Hadoopdevelopment.

    Analyticsinthecloud

    Thiswhitepaperhasnotdiscussedcloudimplementationsofbigdataanalytics.Most

    oftheenterprisesinterviewedforthiswhitepaperwerenotusingpubliccloud

    implementationsfortheirproductionanalytics.Nevertheless,cloudimplementations

    maybeveryattractiveinthestartupphaseforananalyticseffort.Acloudservicecan

    provideinstantscalabilityduringthisstartupphase,withoutcommittingtoamassive

    legacyinvestmentinhardware.Dataanalysisprojectscanbeturnedonandturned

    offonshortnotice.Recallthattypicalanalyticenvironmentsmayinvolvehundredsof

    separatesandboxesandparallelexperiments.

    Manyoftheorganizationsinterviewedforthispaperstatedthatmatureanalytics

    shouldbebroughtin-house,perhapsimplementedtechnicallyasacloudbutwithin

    theconfinesoftheorganization.Ofcourse,suchanin-housecloudmayreducefears

    ofsecurityandprivacybreaches(fairlyornot).

    Aremotecloudimplementationraisesissuesofnetworkbandwidth,especiallyinabroadlyintegratedapplicationwithmultipleverylargedatasetsindifferentlocations.

    Imaginesolvingthebigjoinproblemwhereyourtrillionrowfacttableisoutonthe

    cloud,andyourbillionrowdimensiontableislocatedin-house.

    Althoughthebestperformingsystemstrytoachieveathree-waybalanceamong

    CPU,diskspeed,andbandwidth,mostorganizationsinterviewedforthispaper

    predictedthatbandwidthwouldemergeasthenumberonelimitingfactorforbigdata

    analyticssystemperformance.

    WhitherEDW?

    Theenterprisedatawarehousemustexpandtoencompassbigdataanalyticsaspart

    ofoverallinformationmanagement.Themissionofthedatawarehousehasalways

    beentocollectthedataassetsoftheorganizationandstructuretheminawaythatis

    mostusefultodecision-makers.Althoughsomeorganizationsmaypersistwithabox

    ontheorgchartlabeledEDWthatisrestrictedtotraditionalreportingactivitieson

    transactionaldata,thescopeoftheEDWshouldgrowtoreflectthesenewbigdata

    developments.InsomesensethereareonlytwofunctionsofIT:gettingthedatain

    (transactionprocessing),andgettingthedataout.TheEDWisgettingthedataout.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    32/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics30

    Thebigchoicefacingshopswithgrowingbigdataanalyticsinvestmentsiswhetherto

    chooseanRDBMS-onlysolution,oradualRDBMSandMapReduce/Hadoop

    solution.Thisauthorpredictsthatthedualsolutionwilldominate,andinmanycases

    thetwoarchitectureswillnotexistasseparateislandsbutratherwillhaverichdata

    pipelinesgoinginbothdirections.Itissafetosaythatbotharchitectureswillevolve

    hugelyoverthenextdecade,butthisauthorpredictsthatbotharchitectureswillshare

    thebigdataanalyticsmarketplaceattheendofthedecade.

    Sometimeswhenanexcitingnewtechnologyarrives,thereisatendencytoclosethe

    dooronoldertechnologiesasiftheyweregoingtogoaway.Datawarehousinghas

    builtanenormouslegacyofexperience,bestpractices,supportingstructures,

    technicalexpertise,andcredibilitywiththebusinessworld.Thiswillbethefoundation

    forinformationmanagementintheupcomingdecadeasdatawarehousingexpands

    toincludebigdataanalytics.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    33/33

    AcknowledgmentsThisauthorisgratefulforInformaticassponsoringofthiswhitepaperandfor

    providingabsolutelyno"vendorbias.Theopinionsinthiswhitepaperaresolelythe

    responsibilityoftheauthor.

    Anumberofsmartandknowledgeablebigdatapractitionersmadethemselvesavailableduringtheresearchphaseofthewhitepaperforinterviews.These

    individualsprovidedmanyusefulinsights.Inalphabeticorderbyorganization,we

    thank

    AmrAwadallah,MikeOlson,Cloudera

    BrianDolan,Discovix

    OliverRatzesberger,eBay

    AlexIgnatius,ElectronicArts

    WilliamSchmarzo,EMC

    AshishThusoo,Facebook

    JuliannaDeLua,JohnHaddad,SanjayKrishnamurthi,RonLunasin,Informatica

    NicholasWakefield,LinkedIn

    DanGraham,DilipKrishna,RonKunze,Teradata

    ProfessorMichaelFranklin,ComputerScienceDepartment,U.C.Berkeley

    RaymieStata,Yahoo!

    DanMcCaffrey,KenRudin,Zynga

    ReferencesAnArchitectureforDataQuality,aKimballGroupWhitepaper:http://

    vip.informatica.com/?elqPURLPage=8784

    EssentialStepsfortheIntegratedEDW,aKimballGroupWhitepaper:http://

    vip.informatica.com/?elqPURLPage=8785

    Hadoop,TheDefinitiveGuide, 2ndEdition,TomWhite,OReilly (2011)

    Hadoop-MLwebsite:http://videolectures.net/nipsworkshops09_pednault_hmli/

    MADSkills:NewAnalysisPracticesforBigData,Cohen,Dolanetal,http://

    db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf