Top Banner
Hadoop Integration Benchmark Product Profile and Evaluation: Talend and Informatica By William McKnight and Jake Dolezal October 2015 Sponsored by
23

Hadoop DI Benchmark

Jan 24, 2017

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop DI Benchmark

Hadoop Integration Benchmark

Product Profile and Evaluation: Talend and Informatica By William McKnight and Jake Dolezal October 2015

Sponsored by

Page 2: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 2

Executive Summary

Hadoophasbecomeincreasinglyprevalentoverthepastdecadeontheinformationmanagementlandscapebyempoweringdigitalstrategieswithlargescalepossibilities.WhileHadoophelpsdata-driven(orthosewhodesiretobe)organizationsovercomeobstaclesofstorageandprocessingscaleandcostindealingwithever-increasingvolumesofdata,itdoesnot(inandofitself)solvethechallengeofhighperformancedataintegration—anactivitythatconsumesanestimated80%ormoreoftimeingettingactionableinsightsfrombigdata.Moderndataintegrationtoolswerebuiltinaworldaboundwithstructureddata,relationaldatabases,anddatawarehouses.ThebigdataandHadoopparadigmshifthaschangedanddisruptedsomeofthewayswederivebusinessvaluefromdata.Unfortunately,thedataintegrationtoollandscapehaslaggedbehindinthisshift.Earlyadoptersofbigdatafortheirenterprisearchitecturehaveonlyrecentlyfoundsomevarietyandchoicesindataintegrationtoolsandcapabilitiestoaccompanytheirincreaseddatastoragecapabilities.Hadoophasevolvedgreatlysinceitsearlydaysofmerelybatch-processinglargevolumesofdatainanaffordableandscalableway.NumerousapproachesandtoolshavearisentoelevateHadoop’scapabilitiesandpotentialuse-cases.OneofthemoreexcitingnewtoolsisSpark—amulti-stage,in-memoryprocessingenginethatoffershigh-performanceandmuchfasterspeedsthanitsdisk-basedpredecessors.Evenwhilereachingouttograspalltheseexcitingcapabilities,companiesstillhavetheirfeetfirmlyplantedintheoldparadigmofrelational,structured,OLTPsystemsthatruntheirday-in-day-outbusiness.Thatworldisandwillbearoundforalongtime.Thekeythenistoomarrycapabilitiesandbringthesetwoworldstogether.Dataintegrationisthatkey—tobringthetransactionalandmasterdatafromtraditionalSQL-based,relationaldatabasesandthebigdatafromavastarrayandvarietyofsourcestogether.ManydataintegrationvendorshaverecognizedthiskeyandhavesteppeduptotheplatebyintroducingbigdataandHadoopcapabilitiestotheirtoolsets.Theideaistogivedataintegrationspecialiststheabilitytoharnessthesetoolsjustliketheywouldthetraditionalsourcesandtransformationstheyareusedto.Withmanyvendorsthrowingtheirhatinthebigdataarena,itwillbeincreasinglychallengingtoidentifyandselecttheright/besttool.ThekeydifferentiatorstowatchwillthedepthbywhichatoolleveragesHadoopandtheperformanceofintegrationjobs.Asvolumesofdatatobeintegratedexpands,sotoowilltheprocessingtimesofintegrationjobs.Thiscouldspellthedifferencefora“just-in-time”answertoabusinessquestionanda“too-little-too-late”result.

Page 3: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 3

Thebenchmarkpresentedherefocusesonthesetwodifferentiators.ThevendorschosenforthisbenchmarkareTalendandInformatica.Thebenchmarkwasdesignedtosimulateasetofbasicscenariostoanswersomefundamentalbusinessquestionsthatanorganizationfromnearlyanyindustrysectormightencounterandask,asopposedtosomenichescenariooran“academicexercise”use-case.Indoingso,wehopedtoidentify,notonlyperformance,bututilityaswell.TheresultsofthebenchmarkshowSparktobetherealwinner—evidencedbythenearly8xperformanceadvantageinlargestscalescenario.Basedonthetrends,wealsoanticipatethegaptowidenevenfurtherwithlargerdatasetsandlargerclusterenvironments.Inaddition,thebenchmarkalsorevealstheadvantageofleveragingSparkdirectlythroughtheintegrationtool,asopposedtoattemptingtothroughanothermedium(Hive,inthiscase),whichprovedtobefutileduetolackofsupportabilitybyevenenterprisedistributionsofHadoop.

Page 4: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 4

About the Authors

WilliamisPresidentofMcKnightConsultingGroupGlobalServices(www.mcknightcg.com).Heisaninternationallyrecognizedauthorityininformationmanagement.HisconsultingworkhasincludedmanyoftheGlobal2000andnumerousmidmarketcompanies.Histeamshavewonseveralbestpracticecompetitionsfortheirimplementationsandmanyofhisclientshavegonepublicwiththeirsuccessstories.Hisstrategiesformtheinformationmanagementplanforleadingcompaniesinvariousindustries.Williamisauthorofthebook“InformationManagement:StrategiesforGainingaCompetitiveAdvantagewithData”.Williamisaverypopularspeakerworldwideandaprolificwriterwithhundredsofarticlesandwhitepaperspublished.Williamisadistinguishedentrepreneur,andaformerFortune50technologyexecutiveandsoftwareengineer.Heprovidesclientswithstrategies,architectures,platformandtoolselection,andcompleteprogramstomanageinformation.

JakeDolezalhasover16yearsexperienceintheInformationManagementfieldwithexpertiseinbusinessintelligence,analytics,datawarehousing,statistics,datamodelingandintegration,datavisualization,masterdatamanagement,anddataquality.Jakehasexperienceacrossabroadarrayofindustries,including:healthcare,education,government,manufacturing,engineering,hospitalityandgaming.JakeearnedhisDoctorateinInformationManagementfromSyracuseUniversity—consistentlythe#1rankedgraduateschoolforinformationsystemsbyU.S.NewsandWorldReport.HeisalsoaCertifiedBusinessIntelligenceProfessionalthroughTDWIwithanemphasisinDataAnalysis.Inaddition,heisacertifiedleadershipcoachandhashelpedclientsacceleratetheircareersandearnseveralexecutivepromotions.JakeisPracticeLeadatMcKnightConsultingGroupGlobalServices.

Page 5: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 5

About MCG Global Services

Withaclientlistthatisthe“Alist”ofcomplex,politicallysustainableandsuccessfulinformationmanagement,MCGhasbroadinformationmanagementmarkettouchpoints.Ouradviceisaninfusionofthelatestbestpracticesculledfromrecent,personalexperience.Itispractical,nottheoretical.Weanticipateourcustomer’sneedswellintothefuturewithourfulllifecycleapproach.Ourfocused,experiencedteamsgenerateefficient,economic,timelyandpoliticallysustainableresultsforourclients.

• Wetakeakeenfocusonbusinessjustification.• Wetakeaprogram,notaproject,approach.• Webelieveinamodelofblendingwithclientstaffandwetakeafocusonknowledge

transfer.• Weengineerclientworkforcesandprocessestocarryforward.• We’revendorneutralsoyoucanrestassuredthatouradviceiscompletelyclientoriented.• Weknow,define,judgeandpromotebestpractices.• Wehaveencounteredandovercomemostconceivableinformationmanagementchallenges.• Weensurebusinessresultsaredeliveredearlyandoften.

MCGservicesspanstrategy,implementationandtrainingforturninginformationintotheassetitneedstobeforyourorganization.Westrategize,designanddeployinthedisciplinesofMasterDataManagement,BigData,DataWarehousing,AnalyticDatabasesandBusinessIntelligence.

Hadoop and Data Integration

ThisbenchmarkispartofresearchintotheperformanceofloadsonHadoopclusters-anincreasinglyimportantplatformforstoringthedatapoweringdigitalstrategies.Hadoopusagehasevolvedsincetheearlydayswhenthetechnologywasinventedtomakebatch-processingbigdataaffordableandscalable.Today,withalivelycommunityofopen-sourcecontributorsandvendorsinnovatingaplethoraoftoolsthatnativelysupportHadoopcomponents,usageanddataisexpanding.DataLakesembodytheideaofalldatastorednativelyinHadoop.Traditionallydatapreparationhasconsumedanestimated80%ofanalyticdevelopmentefforts.OneofthemostcommonusesofHadoopistodrivethisanalyticoverheaddown.DatapreparationcanbeaccomplishedthroughatraditionalETLprocess:extractingdatafromsources,transformingit(cleansing,normalizing,integrating)tomeetrequirementsofthedatawarehouseordownstream

Page 6: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 6

repositoriesandapps,andloadingitintothosedestinations.Asintherelationaldatabaseworld,manyorganizationspreferELTprocesses,wherehigherperformanceisachievedbyperformingtransformationsafterloading.Insteadofburdeningthedatawarehousewiththisprocessing,however,youdotransformsinHadoop.Thisyieldshigh-performance,fault-tolerant,elasticprocessingwithoutdetractingfromqueryspeeds.InHadoopenvironments,youalsoneedmassiveprocessingpowerbecausetransformationsofteninvolveintegratingverydifferenttypesofdatafromamultitudeofsources.YouranalysesmightencompassdatafromERPandCRMsystems,in-memoryanalyticenvironments,andinternalandexternalappsviaAPIs.YoumightwanttoblendanddistilldatafromcustomermasterfileswithclickstreamdatastoredincloudsandsocialmediadatafromyourownNoSQLdatabasesoraccessedfromthird-partyaggregationservices.Youmightwanttoexaminevastquantitiesofhistoricaltransactionaldataalongwithdatastreaminginrealtimefromconsumertransactionsormachines,sensors,andchips.Hadoopcanencompassallofthisstructured,unstructured,semi-structured,andmulti-structureddatabecauseitallowsschema-lessstorageandprocessing.WhenbringingdataintoHadoop,there’snoneedtodeclareadatamodelormakeassociationswithtargetapplications.Insteadloose-bindingisusedtoapplyordiscoveradataschemaaftertransformation,whenthedataiswrittentothetargetproductionapplicationorspecializeddownstreamanalyticrepository.Thisseparationofdataintegrationprocessesfromtherun-timeenginemakesitpossibleto:

• Shareandreusetransformsacrossdataintegrationprojects• Viewdataindifferentwaysandexposenewdataelementsbyapplyingdifferentschema• Enhancedatasetsbyaddingnewdimensionsasnewinformationbecomesavailableor

additionalquestionsemergeThereareinnumerablesourcesofbigdataandmorespringingupallthetime.Sincebeinglimitedbylow-performingopensourceSqoop,Flume,commandlineHDFSandHiveintheearlydays,numerousapproachesandtoolshavearisentomeettheHadoopdataintegrationchallenge.WhilemuchhasbeenstudiedonqueryingthedatainHadoop,thesamecannotbesaidforgettingthedataintoHadoopclusters.ManyenterprisesarebriningthefeaturesandfunctionstheyareusedtointherelationalworldintotheHadoopworld,aswellasthenon-functionalrequirements,developerproductivity,metadatamanagement,documentationandbusinessrulesmanagementyougetfromamoderndataintegrationtool.VendorsbenchmarkedforthisreportareTalendandInformatica.

Page 7: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 7

Talend Product Profile

CompanyProfile

ProductName TalendBigDataIntegration

InitialLaunch 2012

CurrentReleaseandDate

Version6.0,shippedSeptember2015

KeyFeaturesNativesupportforApacheSparkandSparkStreamingleveragingover100Sparkcomponents,oneclickconversionofMapReducejobstoSpark,supportforContinuousDelivery,MasterDataManagement(MDM)RESTAPIandQueryLanguageSupport,DataMaskingandSemanticAnalytics

HadoopDICompetitors Informatica,Pentaho,Syncsort

CompanyFounded 2006

Focus HighlyscalableintegrationsolutionsaddressingBigData,ApplicationIntegration,DataIntegration,DataQuality,MDM,BPM

Employees 500

Headquarters RedwoodCity,CA

Ownership Private

Page 8: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 8

Informatica Product Profile

CompanyProfile

ProductName InformaticaBigDataEdition

InitialLaunch 2012

CurrentReleaseandDate Version9.6.1,shipped2015

KeyFeatures AllfunctionsonHadoop,UniversalDataAccess,High-SpeedDataIngestionandExtraction,Real-TimeDataCollectionandStreaming,DataDiscoveryonHadoop,NaturalLanguageProcessingonHadoop,AttunityReplicatefullyembedded;additionalGUIimprovements;moretargetplatforms

HadoopDICompetitors Talend,Pentaho,Syncsort,Tervela

Founded 1998

Focus World’snumberoneindependentproviderofdataintegrationsoftware.

Employees 5,500+

Headquarters RedwoodCity,CA

Ownership Private(previouslypublic)

Page 9: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 9

Benchmark Overview

Theintentofthebenchmark’sdesignwastosimulateasetofbasicscenariostoanswersomefundamentalbusinessquestionsthatanorganizationfromnearlyanyindustrysectormightencounterandask.Thesecommonbusinessquestionsformulatedforthebenchmarkareasfollows:

• Whatimpactdoescustomers’viewsofpagesandproductsonourwebsitehaveonsales?Howmanypageviewsbeforetheymakeapurchasedecision(whetheronlineorin-store)?

• Howdoourcouponpromotionalcampaignsimpactourproductsalesorserviceutilization?Doourcustomerswhovieworreceiveourcouponpromotioncometoourwebsiteandbuymoreoradditionalproductstheymightnototherwisewithoutthecoupon?

• Howmuchdoesourrecommendationengineinfluenceordriveproductsales?Docustomerstendtobuyadditionalproductsbasedontheserecommendations?

Ourexperienceworkingwithclientsfrommanyindustrysectorsoverthepastdecadetellsusthesearenotnicheuse-cases,butcommoneverydaybusinessquestionssharedbymanycompanies.Thebenchmarkwasdesignedtodemonstratehowacompanymightapproachansweringthesebusinessquestionsbybringingdifferentsourcesofinformationintoplay.WealsohavetakentheopportunitytoshowhowHadoopcanbeleveraged,becausesomeofthedataofinterestinthisanalyticalcaseislikelyofalargevolumeandnon-relationalorsemi-toun-structuredinnature.Inthesecases,usingHadoopwouldlikelybeourbestcourseofactionwewouldrecommendtoaclientseekingtoanswerthesequestions.Thebenchmarkwasalsosetupasadataintegrationbenchmark,becauseitisalsohighlyprobablythatthedatarequiredresidesindifferentsource.Someofthesesourcesarealsoprobablynotbeingconsumedandaggregatedintoanenterprisedatawarehouse—duetothehighvolumeanddifficultyinintegratingvoluminousamountsofsemi-structureddataintoatraditionaldatawarehouse.Thus,thebenchmarkwasdesignedtomimicacommonscenarioandthechallengesfacedbyorganizationseekingtointegratedatatoaddresstheseandsimilarbusinessquestions.

Page 10: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 10

Benchmark Setup

Thebenchmarkwasexecutedusingthefollowingsetup,environment,standards,andconfigurations.

Server Environment

Figure1:ServerEnvironmentandSetup

Feature Selection

HadoopDistribution ClouderaCDH5.4.7

EC2Instance Memoryoptimizedr3.xlarge(4vCPUs,30.5GBMemory,200GBStorage)

OS UbuntuLTS14.04(TrustyTahr)

SourceDataTypes Text-basedlogfilesandarelationaldatabase

DataVolume 20GB(Logfiles)and12,000,000rows(RDBMS)

TPC-HScaleFactor 0.125x–2x

RDBMS MySQL5.6

JavaVersion 1.6.0_87

Page 11: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 11

ThebenchmarkwassetupusingAmazonWebServices(AWS)EC2InstancesdeployedintoanAWSVirtualPrivateCloud(VPC)withinthesamePlacementGroup.AccordingtoAmazon,allinstanceslaunchedwithinaPlacementGrouphavelowlatency,fullbisection,10Gigabitspersecondbandwidthbetweeninstances.ThefirstEC2instancewasamemory-optimizedr3.xlargewithfourvCPUsand30.5GBofRAM.Astoragevolumeof200GBwasattachedandmountedontheinstance.ThisLinuxinstancewasrunningUbuntuLTS14.04(nicknamedTrustyTahr).Onthisinstance,wedeployedClouderaCDH5.4.7asourHadoopdistributionalongwithMySQL5.6toserveasasourcerelationaldatabase.ThesecondEC2instancewasanr3.xlargemachinerunningWindowsServer2012.TheWindowsinstancewasinstalledwiththeTalendStudioandInformaticaBigDataEdition(BDE)developersuites.NOTE:InordertorunInformaticaBDE,athirdEC2instancewasusedtorunthedomainandrepositoryservicesrequiredforInformatica.ItwasaRedHatLinuxmachineandonlywasusedtoserveastheInformaticarepositoryservice.Itwasnotinvolvedinprocessingofdataorexecutionofthebenchmark.Therelationalsourceforthebenchmark(storedinMySQL5.6ontheLinuxinstance)wasconstructedusingtheTransactionProcessingPerformanceCouncilTPCBenchmarkH(TPC-H)Revision2.17.1StandardSpecification.TheTPC-Hdatabasewasconstructedtomimicareal-lifepoint-of-salesystemaccordingtotheentity-relationshipdiagramandthedatatypeandscalespecificationsprovidedbytheTPC-H.Thedatabasewaspopulatedbyscriptsthatwereseededwithrandomnumberstocreatethemockdataset.TheTPC-Hspecificationshaveascalefactorbywhichtherecordcountforeachtableisderived.Forthisbenchmark,weselectedamaximumscalefactorof2.Inthiscase,theTPC-Hdatabasecontained3millionrecordsintheORDERStableand12millionrecordsintheLINEITEMtable.

Page 12: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 12

Relational Source

Figure2:TPC-HERDiagram©1993-2014TransactionProcessingPerformanceCouncil

Log Fi le Sources TheotherthreedatasourcestobeusedintheintegrationroutinesforthebenchmarkwerecreatedasASCIItext-based,semi-structuredlogfiles.Therewerethreelogfilescreatedtomimicreal-lifelogfilesforthreecases:

1. Web-clicklog

2. Couponlog

3. Recommendationenginelog

Web-cl ick log Aweb-clicklogwasgeneratedusingthesamefashionasastandardApachewebserverlogfile.Thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)completelyrandompage

Page 13: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 13

views(seededbyrandomnumbers)and2)web-clicksthatcorrespondtoactualpageviewsoforderedproducts(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”weblogentriesappearedinavarietyofpossibilitiesbutfollowingasimilarformatconsistentwithanApacheweb-clicklogentry.Alldatawererandomlyselected.Thedateswererestrictedtotheyear2015.Forexample:249.225.125.203 - anonymous [01/Jan/2015:16:02:10 -0700] "GET /images/footer-basement.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/index.php" "Windows NT 6.0"

The“signal”weblogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:154.3.64.53 - anonymous [02/Jan/2015:06:03:09 -0700] "GET /images/side-ad.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/product-search.php?partkey=Q44271" "Android 4.1.2"

Theweb-clicklogfilecontained26,009,338entriesandwas5.1GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,12%oftheweb-clicklogentriescorrespondedtoorders.Therestoftheentrieswererandom.Coupon log AcouponlogwasgeneratedusingthesamefashionasacustomizedApachewebserverlogfile.Thecouponlogwasdesignedtomimicaspecialcaselogfilegeneratedwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromacouponadcampaign.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)completelyrandompageviews(seededbyrandomnumbers)and2)pageviewsthatcorrespondtoactualpageviewsoforderedproductsbyactualcustomersviathecouponadcampaign(seededbyrandomrecordsintheTPC-HORDERS,LINEITEMS,andCUSTOMERStables).The“dummy”or“noise”couponlogentrydatawererandomlyselected.Thedateswererestrictedtotheyear2015.The“signal”couponlogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:49.243.50.31 - anonymous [01/Jan/2015:18:28:14 -0700] "GET /images/header-logo.png HTTP/1.0" 200 75422 "http://www.acmecompany.com/product-view.php?partkey=S22211" "https://www.coupontracker.com/campaignlog.php?couponid=LATEWINTER2015&customerid=C019713&trackingsnippet=LDGU-EOEF-LONX-WRTQ" "Windows Phone OS 7.5"

Page 14: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 14

Thecouponlogfilecontained20,567,665entriesandwas6.6GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,15%ofthecouponlogentriescorrespondedtoorders.Therestoftheentrieswererandom.Recommendation log ArecommendationenginelogwasgeneratedasanXMLtaggedoutputlogfile.Therecommendationlogwasdesignedtomimicwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromarecommendeditemonthecurrentitempage.Inthiscase,thecompany’swebsiterecommended5additionalorrelateditemsforeveryitemviewed.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1)“dummy”orcompletelyrandompageviewsandrandomrecommendations(seededbyrandomnumbers)and2)pageviewsthatcorrespondtoactualpageviewsoforderedproductsviatherecommendations(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”recommendationlogentrydatawererandomlyselected.Thedateswererestrictedtotheyear2015.The“signal”recommendationlogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thehighlightedsegmentsrepresentthosethatpotentiallycorrespondtoactualorders:<itemview>

<cartid>326586002341098</cartid> <timestamp>01/01/2015:16:53:07</timestamp> <itemid>Q48769</itemid> <referreditems>

<referreditem1> <referreditemid1>R33870</referreditemid1> </referreditem1> <referreditem2> <referreditemid2>P19289</referreditemid2> </referreditem2> <referreditem3> <referreditemid3>R28666</referreditemid3> </referreditem3> <referreditem4> <referreditemid4>P72469</referreditemid4> </referreditem4> <referreditem5> <referreditemid5>S23533</referreditemid5> </referreditem5>

</referreditems> <sessionid>CGYB-KWEH-YXQO-XXUG</sessionid> </itemview>

Therecommendationlogfilecontained13,168,336entriesandwas8.3GBinsize.Therewasoneweb-clickentryforarandomLINEITEMrecordforeachandeveryORDERSrecord(3,000,000total).Therefore,23%oftherecommendationlogentriescorrespondedtoorders.Therestoftheentrieswererandom.

Page 15: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 15

Scale Factor Eachofthedatasourcesdescribedabovewerealsoscaledtodifferentscalefactors,sotheintegrationroutines(describedinthenextsection)couldbeexecutedagainstdatasourcesofvarioussizes.TheTPC-Hdatabaseandlogfileswerescaledinthefollowingmanner:

ScaleFactor 2.0 1.0 0.5 0.25 0.125LINEITEMrecords 12,000,000 6,000,000 3,000,000 1,500,000 750,000ORDERSrecords 3,000,000 1,500,000 750,000 375,000 187,500CUSTOMERSrecords 300,000 150,000 75,000 37,500 18,750LogSizeTotal 20GB 10GB 5GB 2.5GB 1.25GBWeb-clickLogSize 5.1GB 2.6GB 1.3GB 638MB 319MBWeb-clickLogEntries 26,009,338 13,004,669 6,502,335 3,251,167 1,625,584CouponLogSize 6.6GB 3.3GB 1.7GB 825MB 413MBCouponLogEntries 20,567,665 10,283,833 5,141,916 2,570,958 1,285,479RecommendationLogSize

8.3GB 4.2GB 2.1GB 1.0GB 519MB

RecommendationLogEntries

13,168,336 6,584,168 3,292,084 1,646,042 823,021

Table1:TPC-HDatabaseRecordCountsandLogFilesatDifferentScaleFactors

Integration Routines Thebigdatausecaseofthebenchmarkwasdesignedtodemonstratereal-lifescenarioswherecompaniesdesiretointegratedatafromtheirtransactionalsystemswithunstructuredandsemi-structuredbigdata.ThebenchmarkdemonstratesthisbyexecutingroutinesthatintegratetheTPC-Hrelationalsourcedatawiththeindividuallogfilesdescribedabove.Thefollowingintegrationroutineswerecreatedforthebenchmark.Inbothcases,bestpracticeswereobservedtooptimizetheperformanceofeachjob.Forexample,therewerenodatatypeconversions,datesweretreatedasstrings,andthesizesofdatatypeswereminimized.Informatica Workflow ThefollowingdiagramrepresentstheworkflowthatwascreatedinInformaticaDeveloper.

Page 16: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 16

Figure3:OverviewofInformaticaWorkflowInformaticaBigDataEditionofferstwowaysofinteractingwithHadoop:viaHiveordirectlytotheHadoopFileSystem(HDFS).ThefirstpartwastheextractionoftheMySQLdatausingtheexactsamequeryagainstMySQLaswiththeTalendjob.ThisdatawasstoredinHadoopusingHive.Next,theweb,coupon,andrecommendationlogfileswereparsedusingthesameregularexpressionsaswillbeusedwiththeTalendjobsandloadeddirectlyintoHDFS.ThethirdpartconsistedoftheintegrationoftheTPC-HdatawiththelogfilesusingInformatica’sLookuptransformation.TheresultingdelimitedoutfileswerealsostoreddirectlyonHDFS.Theseroutineswillbedescribedingreaterdetailinthenextsection.Note:Atthetimeofthebenchmark,InformaticaBDEdidnothaveaMySQLconnectorincludedinthepackageweinstalled.AgenericODBCconnectorwasusedinstead.Thisdidnotseemtohaveanoverallimpactontheperformanceoftheintegrationjob.

Page 17: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 17

Talend Job Design ThefollowingdiagramrepresentsthejobdesignthatwascreatedinTalendStudio.

Figure4:OverviewofTalendJobDesignTalendStudio6offersanewandpowerfulcapabilityofinteractingwithSparkonHadoopinadditiontotheconventionalapproachviaHiveordirectlytoHadoopFileSystem(HDFS).ThefirstpartwastheextractionoftheMySQLdatausingthefollowingqueryagainstMySQL:SELECT L_ORDERKEY, L_PARTKEY, O_CUSTKEY, O_ORDERDATE FROM LINEITEM LEFT OUTER JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY;

ThisdatawasstoredinSpark.Second,theweb,coupon,andrecommendationlogfileswereparsedusingregularexpressionsandloadeddirectlyintoSpark.ThethirdpartconsistedoftheintegrationoftheTPC-HdatawiththelogfilesusingTalend’stMaptransformation.TheresultingdelimitedfileswerestoreddirectlyontheHDFS.Informatica and Spark WeconsideredandevenattemptedrunningthebenchmarkusingInformaticaBDEonSparkviaHive.WewereabletoconfigureHivetoleverageSparkforexecutionusingtheClouderaCDHadministrationtools.However,thereweresignificantinconsistenciesinexecutionmakingthisusageinstable.AccordingtoCloudera,regardingtheuseofHiveonSparkthereisthefollowingdisclaimer:

Important:HiveonSparkisincludedinCDH5.4butisnotcurrentlysupportedorrecommendedforproductionuse.Ifyouareinterestedinthisfeature,tryitoutinatestenvironmentuntilweaddresstheissuesandlimitationsneededforproduction-readiness.

Page 18: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 18

Therefore,thismethodwasdeemedunstableandanunsuitableusecaseatthetimeofthebenchmark.

Benchmark Results

Use Case 1: E-Inf luence Thegoalofthefirstusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththepageviewsonthee-commercewebsite.Theintegrationjobswerewrittentomapthepageviewstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinktheweb-clicklogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEY.InTalend,thetMaptransformationwasused.

Figure5:E-InfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheE-Influence(web-click)mappinginseconds.Web-click(UseCase1) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 5,221 1,977 622 308 111Talend-Spark 579 383 257 148 88Table2:E-InfluenceMappingExecutionTimes

Page 19: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 19

Use Case 2: Coupon Inf luence Theobjectiveofthesecondusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwithacouponadvertisementcampaign.Theintegrationjobswerewrittentomapthecoupon-ad-relatedpageviewsandcustomerstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinkthecouponlogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEYandCUSTKEY.InTalend,thetMaptransformationwasused.

Figure6:CouponInfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheCouponInfluencemappinginseconds.Coupon(UseCase2) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 8,573 3,197 1,210 693 396Talend-Spark 882 478 331 202 169Table3:CouponMappingExecutionTimes

Page 20: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 20

Use Case 3: Recommendation Inf luence Theintentofthethirduse-caseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththeinfluenceofarecommendationengine.Theintegrationjobswerewrittentomaptherecommendedproductstoproductsordered.Thefollowingdiagramisaconceptualmappingoftheintegration.InInformatica,theLookuptransformationwasusedtolinkthecouponlogwithrecordsintheTPC-HdatabaseusingthefieldPARTKEYfivetimes.InTalend,thetMaptransformationwasused.

Figure7:RecommendationInfluenceMappingExecution Times ThefollowingtableliststheexecutiontimesoftheRecommendationInfluencemappinginseconds.Recommendation(UseCase3) SF2.0 SF1.0 SF0.5 SF0.25 SF0.125Informatica-MapReduce 14,973 6,544 2,802 1,272 590Talend-Spark 1,313 701 473 236 188Table4:RecommendationMappingExecutionTimes

Page 21: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 21

Results Summary Thefollowingtablesarethecompleteresultset.Allexecutiontimesaredisplayedinseconds(unlessotherwisenoted.)Scale Factor Summary SF2.0 SF1.0 SF0.5 SF0.25 SF0.125TPCHLineItemRecords 12,000,000 6,000,000 3,000,000 1,500,000 750,000LogFileSize(Total) 20GB 10GB 5GB 2.5GB 1.25GB

+ MapReduce Execution Times Informatica-MapReduce SF2.0 SF1.0 SF0.5 SF0.25 SF0.125LoadTPCH(MySQL)intoHadoop 859 711 349 164 95LoadWebLogintoHadoop 388 360 76 43 22LoadCouponLogintoHadoop 504 404 111 58 32LoadRecommendationLogintoHadoop 578 442 129 65 33Web-Click(UseCase1) 5,221 1,977 622 308 111Coupon(UseCase2) 8,573 3,197 1,210 693 396Recommendation(UseCase3) 14,973 6,544 2,802 1,272 590TOTAL 31,096 13,635 5,299 2,603 1,279TOTAL(Minutes) 518.3 227.3 88.3 43.4 21.3TOTAL(Hours) 8.6 3.8 1.5 0.7 0.4

+ Execution Times Talend-Spark SF2.0 SF1.0 SF0.5 SF0.25 SF0.125LoadTPCH(MySQL)intoSpark 748 642 302 143 82LoadWebLogintoSpark 158 79 33 21 12LoadCouponLogintoSpark 213 123 59 33 17LoadRecommendationLogintoSpark 204 117 55 31 16Web-Click(UseCase1) 579 383 257 148 88Coupon(UseCase2) 882 478 331 202 169Recommendation(UseCase3) 1,313 701 473 236 188TOTAL 4,097 2,523 1,510 814 572TOTAL(Minutes) 68.3 42.1 25.2 13.6 9.5TOTAL(Hours) 1.1 0.7 0.4 0.2 0.2Table5:ExecutionSummary

Page 22: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 22

ThefollowingchartisacomparisonofthetotalexecutiontimesofboththeInformaticaworkflowandTalendjobstocompleteallloading,transformations,andfilegeneration.Regressionlineswereadded.

+ SF0.125 SF0.25 SF0.5 SF1.0 SF2.0TimesFaster 2.2x 3.2x 3.5x 5.4x 7.6xFigure8:ComparisonofTotalExecutionTimes

Conclusion

LeveragingSparkdirectlygivesaclearadvantageintheabilitytoprocessincreasinglylargerdatasets.Whenthesizeofthedatasetswasrelativelysmall(i.e.,lessthan3milliontransactionalrecordsand5GBlogfiles),thedifferenceinexecutiontimewasnegligible,becausebothtoolswereabletocompletethejobsinlessthananhour.Howeveroncethescalefactorof0.5wasreached,thedifferencebecameimpactful.Atascalefactorof2.0,theInformaticajobtookover8hourstocomplete.Increasingexecutiontimeswillbefrustratingforanalystsanddatascientistswhoareeagertoconducttheirdataexperimentsandadvancedanalytics,butwillbelefttowaitadayfortheirtargetdatasetstobeintegrated.Theadditionofcomplexitiesinthemappings(multiplelookupkeys)

Page 23: Hadoop DI Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2015 www.mcknightcg.com Page 23

seemedtofurtherexacerbatetheconcern.ByusingMapReduce,theLookuptransformationsdigsalargeperformanceholeforitselfasthelookupdatacacheandindexcachegetoverwhelmed.Byleveragingthein-memorycapabilitiesofSpark,userscanintegratedatasetsatmuchfasterrates.SparkusesfastRemoteProcedureCallsforefficienttaskdispatchingandscheduling.ItalsoleveragesathreadpoolforexecutionoftasksratherthanapoolofJavaVirtualMachineprocesses.ThisenablesSparktoscheduleandexecutetasksatratemeasuredinmilliseconds,whereasMapReduceschedulingtakessecondsandsometimesminutesinbusyclusters.SparkthroughHivewouldnotbeanapproachforachievingSparkresultsshownhere.Executionresultswillbeinconsistenttoindefinite.DevelopingagooduseofSparkforavendorisnotdissimilartodevelopingtheproductinitially.Ittakesyearsandmillionsofdollars.UtilizingthefullcapabilityofSparkforDataIntegrationwithHadoopisawinningapproach.