Top Banner
Hadoop Data Integration Benchmark Product Profile and Evaluation: RedPoint Data Management for Hadoop By William McKnight and Jake Dolezal August 2016 Sponsored by RedPoint Global Inc.
21

Hadoop Data Integration Benchmark

Apr 08, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop Data Integration Benchmark

Hadoop Data Integration Benchmark

Product Profile and Evaluation:

RedPoint Data Management for Hadoop

By William McKnight and Jake Dolezal August 2016 Sponsored by RedPoint Global Inc.

Page 2: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 2

Table of Contents

EXECUTIVESUMMARY 3

HADOOPINTHEENTERPRISE 4

THEEVOLUTIONOFHADOOPDATAINTEGRATION 5

REDPOINTPRODUCTPROFILE 6

COMPANYPROFILE 6

BENCHMARKOVERVIEW 7

BENCHMARKSETUP 8

VIRTUALSERVERENVIRONMENT 8REDPOINTINSTANCES 9RELATIONALDATABASEINSTANCE 9HADOOPCLUSTER 9SOURCEDATA 9RELATIONALDATASOURCE 9WEB-CLICKLOG 10COUPONLOG 10NAMEANDADDRESSCSVFILE 11DATAVOLUME 11DATAMANAGEMENTJOBS 12WEB-COUPONLOGONHADOOPJOINWITHORDERSJOBDESIGN 12ADDRESSSTANDARDIZATIONJOBDESIGN 13NAMEMATCHINGJOBDESIGN 14

BENCHMARKRESULTS 16

USECASE1:WEB-COUPONLOGONHADOOPJOINWITHORDERS 16EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 16VENDORCOMPARISON 17USECASES2AND3:ADDRESSSTANDARDIZATIONANDNAMEMATCHING 18EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 18PERCEIVEDUSABILITYASSESSMENT 18

CONCLUSION 19

ABOUTMCGGLOBALSERVICES 20

ABOUTREDPOINTGLOBAL 21

Page 3: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 3

Executive Summary

ThisbenchmarkispartofresearchintotheperformanceofloadsonHadoopclusters—anincreasinglyimportantplatformforstoringdata-poweringcorporatestrategies.Theintentofthebenchmark’sdesignistosimulateasetofbasic-loadscenariostoanswersomefundamentalbusinessquestionsthatorganizationsfromnearlyanyindustrymightencounterandask.Foragrowingindustry,thereareasurprisingvarietyofapproachesandvendorarchitecturesforHadoop-loadingproducts(suchas:MapReduce,Spark,SparkthroughHive,YARN,nifi,Sqoop,Sqoopinterfaces,Flumeinterfaces,andinterfacestocommandlineHDFS).Basedonthedifferencesintheresultswe’vefound,thisarchitecturefoundationgreatlyinfluencesperformance.RedPointDataManagementforHadoopisbasedonYARN,aresourcenegotiatora.k.a.operatingsystem,whichisthefoundationofHadoop2.0.Inthecaseofourqueries,RedPointwasabletocompleteworkloadsinaveryshorttimeframe,wellwithinenterpriserequirementsandfasterthanwhatweimaginedpossible.Comparedtoapreviousbenchmark,oneworkloadran550%fasterthanaproductusingSparkand1900%fasterthanaproductusingMapReduce.RedPoint’splatform,continuallyfine-tunedforoveradecade,hasachievedunparalleledhighperformanceinutilizingYARNwithouttheoverheadofotherHadoopcomponents.Thispaperfurtherexploresandinvestigatestheseresults.

Page 4: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 4

Hadoop in the Enterprise

Companiesareclamoringtocaptureasmuchdataaspossibleandharnessthatdataasmeaningfulinformationtodrivetheirbusinesses.Today,thisinformation,or“bigdata,”wouldincludealldatageneratedbyacompany’sdigitalstrategy.Itwouldalsoincludealldatathatpasttechnologieswereunabletorecordandanalyzeforbusinessuse.Bigdataisnotonlycontrollabletoday,butitsimplementationisalsoessentialinconductingbusiness.Machinesareprimarilyresponsibleforbigdata.Machinedatacontainscriticalinsights;itallowsustoconductunprecedentedtriangulationofphysicalobjects.Unliketraditionalstructureddata(forexample,datastoredinatraditionalrelationaldatabaseforbatchreporting)machinedataisnon-standard,highlydiverse,dynamic,andhigh-volume.Wecanbuildacomprehensivepictureofactivitywhenwecorrelateandvisualizetherelatedeventsacrossdisparatesources.Thechallengeisinbringingthedatatogether.Companiesthatcancaptureandharnessthisdatawillbenefitaccordingly.Inotherwords,themorecompaniesstoreandprocessdata,themoresuccesstheycantapinto.Businessesacrossindustriesshowclear,upwardtrendsinspendingonbigdata,anditisprojectedtobethetopbudgetiteminmanysectors.Hadoopisatechnologythatwasformedin2006tomeettheneedsoftheSiliconValleydataelite.Previously,thesecompanieshaddataneedsthatfarsurpassedbudgetsforthedatabasemanagementsystems(DBMS)outthere.ThescaletheywereusingwasanotherorderofmagnitudeawayfromthetargetfortheDBMS.Andthetimingofthescalewasnotcertain,giventhevariabilityofthedata.Hadoopisquicklybeingadoptedbybusinessesfromstart-upcompaniestotheFortune1000becauseitscalesverywellandrelativelycheaply.Thismeansyoudonothavetoaccuratelypredictthedatasizeattheoutset.Hadoopisagreatfitformanytypesofdatainanorganization.Sensordata,clickstreamdata,socialdata,serverlogs,smartgriddata,electronicmedicalrecords,videoandpictures,unstructuredtext,geolocationdata,high-volumedata,and“cold”enterprisedataareallagreatfitintheHadoopopen-sourcesoftwareframeworkforstoringdataonclustersofcommodityhardware.Scale-outfilesystemsthatmaybelackinginfunctionality,butcanhandlemodernlevelsofcomplexdataareheretostay.Hadoopistheepitomeofthatideaandanecosystemisbuildinguparoundit.WhilethereusedtobelittleoverlapbetweenreasonableselectionofHadoopandreasonableselectionofaDBMS,thathaschanged.Hadoophaswithstoodthetestoftimeandhasgrown

Page 5: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 5

tothepointwherequiteafewapplicationsarchitectedonaDBMSwillbemovedtoHadoop.Thecostsavings,combinedwiththeabilitytoexecutethecompleteapplicationwillbepersuasive.Itisespeciallyusefulasacollectionpointforpost-operationaldataacrosstheenterprise,notallofwhichmaybedestinedforarelationaldatawarehouse.This“datalake”canbeleftatlowrefinement,whichisjustfinefortheemergingclassofdatascientistsandothersinneedofdeepinsight.Traditionally,datapreparationhasconsumedanestimated80percentoflegacydatadevelopmentefforts.LoadingHadoopclusterswillcontinuethistraditionasatopjobatarangeofcompanies.Luckily,itispossibletolessenthecostandriskofthisworkwitharobustdataintegrationtool.

The Evolution of Hadoop Data Integration

Intheearlydays,low-performing,opensourcevendorarchitectureslikeSqoop,Flume,commandlineHDFSandHivewerelimiting.Sincethen,numerousapproachesandtoolshavearisentomeettheHadoopdataintegrationchallenge.MapReducewastheoriginal[andinHadoop1.0,theonly]data-processingengineforHadoop.However,ithasprovedunwieldyandunabletomeetincreasinglycomplexworkloads,sufferingfromissuessuchasaninabilitytoscaleindex-basedlookups.SparkemergedasareplacementforMapReduce.Byutilizingapoolofpersistent"executorservices"itcannearlyeliminateinter-stagestartupcosts—oneofMapReduce'sbigweakness.Inaddition,SparkusesResilientDistributedDatasets(RDDs)forinter-stagestorage.RDDsareaformofHDFS-backedmemoryimagesthatcombinethefastaccessofmemorywiththefault-toleranceofHDFS.Sparkcanbeusedtoachieveveryfastthroughputforcertainworkloads.SparkisalsobeingleveragedtoimprovetheperformanceofHiveprocessing,specificallyHQLqueries.So-called"HiveonSpark"hastheabilitytoaccelerateHiveitself,butdoesn'tserveasageneraldata-integrationplatform.ButevenSparkhasitslimitations.Theamountofmemoryrequiredtoprocessadatasetcanbeanorderofmagnitudelargerthantheinputdatasetsize.Iflessmemoryisavailableduetovariousfactors(suchasclusterload,nodedowntime,orunexpecteddatascale),Spark'sperformancedegradationcurvecanbeseverelynon-linear,evenbecominga"cliff”beyondwhichjobssimplyfail.Itisincreasinglyimpossibletoexpecta“reserved”clusterforHadoopactivity,whichmeansacluster’smemoryresourcesareincreasinglylimitedandunpredictable.Still,Sparkwouldbethenumberonechoiceformostworkloadsifthesewereyouronlyoptions.

Page 6: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 6

However,byapplyingengineeringtotheclustertoachievehigherperformingresultswithtruecommoditynodes—withouttheaddedmemory—somehaveimproveduponprecedingmodels.Forexample,RedPointusesanativeengineontopofYARN,aresourcenegotiatorandoperatingsystem,whichisthefoundationofHadoop2.0.Itisthelayerthatintegratesandmanagesresources,includingstorageresources,CPU,I/Oandmemory.RedPointisbasedaroundYARN,whichrunsinthecluster.ByleveragingYARN,itcanruninmassiveparallelismwithouttheassumptionthatallthedatamustfitintomemory.Workloadperformanceismorepredictableaswell,givenitslackofdependencyonmemory.Additionally,thedegradationcurvewhenfacedwithlimitedresourcesismoregentle.RedPointDataManagement™forHadoopleveragesRedPoint’s10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationtool.Itusesavisual-designdataflowmodel,allowingnon-programmerstocreatecomplexdatatransformations.OrganizationswithexistingdatastaffshouldfindthistechnologytohaveafasterandmoreaffordableadoptioncurvethanwhenhiringforSpark.

RedPoint Product Profile

Company Profi le

ProductName RedPointDataManagementforHadoop

InitialLaunch 2013

CurrentReleaseandDate

7.3.1,June2016

KeyFeatures

BasedonYARN;Companywith10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationanddataqualitytool;Predictablehighperformance

HadoopDICompetitors Informatica,Pentaho,Syncsort,Talend

CompanyFounded 2006

Focus Empowerdata-drivenorganizationsbyunlockingthefullvalueoftheirdatatodriveconsumerengagementandprofitable,sustainedgrowth.

Page 7: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 7

Benchmark Overview

Theintentofthebenchmark’sdesignistosimulateasetofbasicscenariostoaddresssomefundamentalbusinessproblemsthatanorganizationfromnearlyanyindustrysectormightencounterandask.Thesecommonbusinessquestionsformulatedforthebenchmarkandfromourexperienceworkingwitharangeofclientsoverthepastdecadeare:

• Whatimpactdoescustomers’viewsofpagesandproductsonourwebsitehaveonsales?Whatistheaveragenumberofpageviewsbeforecustomersmakeapurchasedecision(onlineorin-store)?

• Howdoourcouponpromotionalcampaignsimpactourproductsalesorserviceutilization?Doourcustomerswhovieworreceiveourcouponpromotionscometoourwebsiteandbuymoreoradditionalproductsthantheymayhaveotherwisepurchased?

• Howcanweidentifyandremovepotentialduplicatesfromacustomerdatasourceofquestionabledataquality?

• Howcanwestandardizecustomermailingaddressestoimprovethequalityofourgeographicdataforsame-householdrecognitionandfortheefficacyofourmail-marketingcampaigns?

Thebenchmarkwasdesignedtodemonstratehowacompanymightapproachaddressingthesebusinessproblemsbybringingdifferentsourcesofinformationintoplay.WealsohavetakentheopportunitytoshowhowHadoopcanbeleveraged,becausesomeofthedataofinterestinthesedatamanagementcasesarelikelyofalargevolumeandnon-relationalorsemi-toun-structuredinnature.Inthesecases,usingHadoopwouldbethebestcourseofactionforclientsseekingtoanswerthesequestions.Sinceitishighlyprobablethatthedatarequiredresidesindifferentsources,thebenchmarkwasalsosetupfordataintegration.Someofthesesourcesarealsoprobablynotbeingconsumedandaggregatedintoanenterprisedatawarehouseduetotheirhighvolumeandthedifficultyinintegratingvoluminousamountsofsemi-structureddataintoatraditionaldatawarehouse.Thus,thebenchmarkwasdesignedtomimiccommonscenariosandthechallengesfacedbyorganizationsseekingtointegratedatatoaddresstheseandsimilarbusinessproblems.

Employees 120

Headquarters WellelseyHills,MA

Ownership Private

Page 8: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 8

Benchmark Setup

Thebenchmarkwasexecutedusingthefollowingsetup,environment,standards,andconfigurations.

Virtual Server Environment

Feature Selection

HadoopDistribution HortonworksDataPlatform2.4.2(HDFS,MapReduce2,YARN,Tez,Hive,Pig,ZooKeeper,andAmbariinstalled)

EC2Instance Memoryoptimizedm3.xlarge(4vCPUs,16GBMemory)

OS CentOS6.7

SourceDataTypes Text-basedlogfiles,arelationaldatabase,andcomma-separatedvalue(CSV)files

DataVolume 20GB(Logfiles);7,500,000rows(RDBMS);and10,000,000lines(CSV)

TPC-HScaleFactor 1x

RDBMS PostgreSQL9.4

JavaVersion 1.8.0_91

Figure1andTable1:ServerEnvironmentandSetup

Page 9: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 9

ThebenchmarkwassetupusingAmazonWebServices(AWS)EC2instancesdeployedintoanAWSVirtualPrivateCloud(VPC)withinthesamePlacementGroup.AccordingtoAmazon,allinstanceslaunchedwithinaPlacementGrouphavelowlatency,fullbisectionand10Gigabitspersecondbandwidthbetweeninstances.

RedPoint Instances

TheRedPointClientEC2instancewasageneralpurposet2.largewith2vCPUsand8GBofRAMrunningCentOS6.7.ThisWindowsinstanceranMicrosoftServer2012.Onthisinstance,weinstalledtheRedPointDataManagementforHadoopClientversion7.3.1.TheRedPointExecutionandSiteServerEC2instancewasageneral-purpose,m4.xlargemachinewith4vCPUsand16GBofRAMrunningCentOS6.7.Inthisinstance,weinstalledtheRedPointDataManagementExecutionandSiteServersversion7.3.1.

Relational Database Instance

Therelationalsourceforthebenchmarkwasam4.xlargeEC2instancerunningCentOS6.7.WeinstalledPostgreSQL9.4onthisserver.

Hadoop Cluster

TheHadoopclusterforthebenchmarkconsistingof3identicalnodes,eacham4.xlargeEC2instancerunningCentOS6.7.WeinstalledHortonworksDataPlatformHadoopdistribution.UsingAmbari,weinstalledthefollowingHadoopservices:HDFS,MapReduce2,YARN,Tez,Hive,Pig,andZooKeeper.Thisisaminimumviableproduct(MVP)setup.

Source Data Wecreatedthedatasourcesusedinthebenchmarktomimicreal-lifeusecases:

• Relationaldata• Web-clicklog• Couponlog• Customernamesandaddresses

Relational Data Source

Therelationalsourceforthebenchmark(storedinPostgreSQL)wasconstructedusingtheTransactionProcessingPerformanceCouncilTPCBenchmarkH(TPC-H)Revision2.17.1StandardSpecification.TheTPC-Hdatabasewasconstructedtomimicareal-lifepoint-of-salesystemaccordingtotheentity-relationshipdiagramandthedatatypeandscalespecificationsprovidedby

Figure2:TPC-HERDiagram©1993-2014TransactionProcessingPerformanceCouncil

Page 10: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 10

theTPC-H.Wepopulatedthedatabasewithscriptsthatwereseededwithrandomnumberstocreatethemockdataset.TheTPC-Hspecificationshaveascalefactorbywhichtherecordcountforeachtableisderived.Forthisbenchmark,weselectedascalefactorof1.Inthiscase,theTPC-Hdatabasecontained1.5millionrecordsintheORDERStableand6millionrecordsintheLINEITEMtable.

Web-Click Log

Aweb-clicklogwasgeneratedusingthesamefashionasastandardApachewebserverlogfile.Thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.web-clicksthatcorrespondtoactualpageviewsoforderedproducts(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”web-logentriesappearedinavarietyofpossibilitiesbutfollowedthesameformatconsistentwithanApacheweb-clicklogentry.Alldatawererandomlyselected.Forexample:249.225.125.203 - anonymous [01/Jan/2015:16:02:10 -0700] "GET /images/footer-basement.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/index.php" "Windows NT 6.0"

The“signal”weblogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:154.3.64.53 - anonymous [02/Jan/2015:06:03:09 -0700] "GET /images/side-ad.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/product-search.php?partkey=Q44271" "Android 4.1.2"

Theweb-clicklogfilecontained64,000,000linesandwas5.4GBinsize.Therewererandomly-inserted,web-clickentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000oftheweb-clicklogentriescorrespondedtoorders.Therestoftheentrieswererandom.

Coupon Log

AcouponlogwasgeneratedusingthesamefashionasacustomizedApachewebserverlogfile.Thecouponlogwasdesignedtomimicaspecialcaselogfilegeneratedwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromacouponadcampaign.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.pageviewsthatcorrespondtoactualpageviewsof

Page 11: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 11

orderedproductsbyactualcustomersviathecouponadcampaign(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”couponlog-entrydatawererandomlyselected.The“signal”couponlogentriesthatcorrespondedto,andwereseededwith,actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:49.243.50.31 - anonymous [01/Jan/2015:18:28:14 -0700] "GET /images/header-logo.png HTTP/1.0" 200 75422 "http://www.acmecompany.com/product-view.php?partkey=S22211" "https://www.coupontracker.com/campaignlog.php?couponid=LATEWINTER2015&customerid=C019713&trackingsnippet=LDGU-EOEF-LONX-WRTQ" "Windows Phone OS 7.5"

Thecouponlogfilecontained16,000,000entriesandwas14.3GBinsize.Therewererandomly-insertedcouponentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000ofthecouponlogentriescorrespondedtoorders.Therestoftheentrieswererandom.

Name and Address CSV Fi le

Thecustomernameandaddressdatawasinacomma-separatedvaluefileformatandstoredintheHadoopDistributedFileSystemonourcluster.Thelayoutofthefileisdemonstratedbythefirstfewlinesofthe10millionrows:"NAME","ADDRESS","CITY","STATE","ZIP","PHONE","ID" CELESTE A ZIENUK,125 MINOT AVE,EAST WAREHAM,MA,02538,,100000022 SEBASTIAO C BARBOSA,15 HOOSAC ST,ADAMS,MA,01220,,100000064 GREG S STURGEON,1640 ALVIN LN,BROOKFIELD,WI,53045,,100000075 RENAE BATTISTELLA,15 COMMOMWEALTH AVE,QUINCY,MA,02169,,100000080

Thenameswererandomlygeneratedfromagenericnamedatabase.Theaddressesarerealaddresses.However,justover2millionoftheaddresseswere“dirty,”i.e.,notuptoUSPSstandards.SinceRedPointusesaCASS(CodingAccuracySupportSystem)standardizationmodulevalidatedbytheUnitedStatesPostalService(USPS),itwasnecessarytocorrectandmatchUSstreetaddressesforthese2millionentries.

Data Volume DataSet Type Location Rows SizeonDisk

WebLog ApacheLog HDFS 64,000,000 5.5GB

CouponLog ApacheLog HDFS 16,000,000 14.3GB

Orders RDBMS PostgreSQL 1,500,000 N/A

LineItems RDBMS PostgreSQL 6,000,000 N/A

NamesandAddresses CSV HDFS 10,000,000 0.6GB

Page 12: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 12

Table2:Benchmarksourcedatavolumes

Eachofthedatasources(theTPC-Hdatabase,logfiles,andcustomeraddressCSVfile)werealsoscaledtodifferentscalefactors,sothattheintegrationroutines(describedinthenextsection)couldbeexecutedagainstdatasourcesofvarioussizes.

Data Management Jobs Theusecaseofthebenchmarkwasdesignedtodemonstratereal-lifedatamanagementscenarioswherecompaniesdesiretointegratedatafromtheirtransactionalsystemswithunstructuredandsemi-structureddata.Thebenchmarkdemonstratesthisbyexecutingroutinesthat:

• IntegratetheTPC-Hrelationalsourcedatawiththeindividuallogfiles• Standardizecustomeraddresses• Identifyduplicatecustomerrecords

Thefollowingdatamanagementandintegrationroutineswerecreatedforthebenchmark.Inallcases,bestpracticeswereobservedtooptimizetheperformanceofeachjob.

Web-Coupon Log on Hadoop Join with Orders Job Design

ThepurposeoftheWeb-CouponLogonHadoopJoinwithOrderswastotestthecapabilityofthevendorsoftwaretoefficientlycombineavarietyofdatafrommultiplesources,bothonandoffHadoop.Figure3representsthejobdesignthatwascreatedintheRedPointDataManagementClient.RedPointoffersaParallelSectiontoolwithinputsthatdefineallthesplittabledataavailabletotheParallelSectiontransforms.Splittabledataisthendividedupamongasetoftaskstobeprocessedinparallel.InputtoolswithintheParallelSectiontool'sprocessingareareadtheirentireinputdataineachtaskandareusedtodefineanddrivedataparallelism.

WithintheHadoopParallelSection,twoCSVinputsourceswereread:WebLogandCouponLog.

TheNumberRecordstoolwasusedtogenerateasequenceofnumericidentifiersforindividualrecordsineachCSVinputrow.

TheCalculatetoolwasusedtoconvertthestringApachelogdatetoadateformatwiththeRedPointScanDateTimefunction:ScanDateTime(Trim(DATESTR, "[ "), "DD/Mmm/YYYY:HH:mm:ss")

Figure3:TheWeb-CouponLogOnHadoopJoinwithOrdersJobDesign

Page 13: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 13

TheSelecttoolissimilartotheSQLSELECTclause.Weusedthistooltoselectonlyafew,necessaryfieldsfromtheloginputs.Theselectedsetoffieldswasusedforthejoinandtheoutputtable.

TheJointoolacceptstwoinputs—LeftandRight—andmatchesrecordsfrombothinputsonasinglekeyfieldorcolumn.WeusedtheCartesianJoinoptiontocombinethematchedLeft(WebLog)andRight(CouponLog)recordsintoasingle"wide"recordcontainingallfieldsfrombothinputs.ThisfunctionissimilartoanSQLjoin.WebLog CouponLog Join Output

IP IP þ þ

PARTKEY PARTKEY þ þ

DATE DATE þ þ

COUPONID ¨ þ

CUSTOMERID ¨ þ

Table3:FieldsselectedfromtheWebandCouponlogsusedfortheJoinandoutput

TheresultingoutputcompletedtheprecedingParallelSectionwithinHadoop.However,whiletheseparalleltaskswereprocessing,theRedPointExecutionServerwasalsoprocessingtheRDBMSinputtask.

WeusedtheRDBMSInputtooltoreaddatafromthePostgreSQLTPC-Hdatabaseandtablesbyexecutingthefollowingquery:SELECT L_ORDERKEY, L_PARTKEY, O_CUSTKEY, O_ORDERDATE FROM LINEITEM LEFT OUTER JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY;

WeattachedaDataViewertotheoutputofthefinalJoinbetweenthejoinedWeb-CouponlogHadoopoutputandtheRDBMStoinspecttheresultantdataset.Theresultingexecutiontimesandexpectedoutputarediscussedinthenextsection.

Address Standardization Job Design

ThepurposeoftheAddressStandardizationjobwastoassesstheabilityoftheRedPointplatformtoquicklyandaccuratelydetectandcorrectmalformedUSpostaladdressesinasinglesourceofdataonHadoop.Figure4representsthejobdesignthatwascreatedinRedPointDataManagementClient.

Page 14: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 14

Again,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.

The10-million-itemcustomernameandaddressCSVfilewasusedastheprimaryinput.Forthisjob,wesettheworkloadtobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Thismadethestandardizationmoreefficientbyorganizingtherecords.WealsosetthePartitionModetoSegment,becauseaSegmentpartitionisfasterthanonebasedonasort,accordingtothevendor’sdocumentation.

WeusedtheRedPointAOAddressQualitytooltoprovidetheaddresscorrection,parsing,andstandardization.Youcanenablegeocodeassignmentwithasingleoption.Forthisworkload,weloadedtheUSPSCASS-certifiedcompressedtarfile(tgz)rightontoHDFS,andtheRedPointExecutionServerwasabletobringitdirectlyintotheParallelprocessingsegmentofthejob.ThetoolwentthroughthedatasetandstandardizedthetheCSVfile.

Next,weusedtheFiltertooltoselectonlythoseaddressesthatwerestandardizedandchanged.

Again,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.

Name Matching Job Design

ThepurposeoftheNameMatchingjobwastoassesstheabilityoftheplatformtoquicklyandaccuratelydetectpotentialduplicatecustomerrecordsbynameandaddresswithinasinglesourceofdataonHadoop.Figure5representsthejobdesigncreatedintheRedPointDataManagementClient.Onceagain,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.

The10-million-itemcustomernameandaddressCSVfile(thesameoneusedintheAddressStandardizationjob)wasusedastheprimaryinput.Forthisjob,wesettheworkload

Figure4:TheAddressStandardizationJobDesign

Page 15: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 15

tobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Sincetheaddressisimportanttoidentifyingmatches,theZIPwasanefficientmeansofgettingpotentialmatchesgroupedclosertogether,insteadofinrandomorder.WealsosetthePartitionModetoSegmentforperformancepurposes,justaswedidintheAddressStandardizationjob.

WeusedtheAOConsumerMatchmacrotomatchindividualsusingnameandaddressinformation—inthiscase,wesetthesegmentationtoZIP+addressparts.TheAOConsumerMatchcanalsobeusedtomatchtheindividual(fullname),thefamily(lastnameonly)orbyaddress(nonamecomponents).Itevenhasadditionalparametersdesignedtomatchfemaleindividualswhomayhavechangedtheirsurnames.Weusedthedefaultscoresproducedbythematchingalgorithmanddidnotfine-tunetheminanyway.

Next,weusedtheFiltertooltoremoveunmatchedrecordsoutofthedataoutput.

Then,weusedtheCalculatetooltooffsetthegroupidentifierproducedbytheAOConsumerMatchtoolbytasknumber.Thismadethemgloballyunique.

AsthefinaltaskintheParallelSection,wesortedthedatasetbythegroupidentifier,sowecouldseematchesadjacenttoeachother.

Finally,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.

Figure5:TheNameMatchingJobDesign

Page 16: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 16

Benchmark Results

Use Case 1: Web-Coupon Log on Hadoop Join with Orders

Thegoalofthefirstusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththepageviewsandcouponcampaignclick-throughsonane-commercewebsite.Theintegrationjobwaswrittentomapthepageviewsandcouponstoproductsordered.Figure6isaconceptualmappingofthisintegration.

Figure6:Web-CouponLogOnHadoopJoinwithOrdersMapping

Execution Time and Actual-Versus-Expected Results

Table4liststhemedianexecutiontimesoftheWeb-CouponLogOnHadoopJoinwithOrdersjob.

Job TrialsMedian

RunTimeOutputRows

Web-CouponLogOnHadoopJoinwithOrders 5 3m47s 160,176

Table4:Web-CouponLogOnHadoopJoinwithOrdersBenchmarkResults

Page 17: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 17

Vendor Comparison

Asacomparisonwiththerestofthedatamanagementindustry,theresultsofthisbenchmarkwerecomparedagainstabenchmarkrunbyMCGGlobalServicesinlate2015,comparingTalendandInformatica.1HadoopMapReduce,ApacheSpark,andYARNrepresentacriticalarchitecturalchoicethatmanyinformationmanagementprofessionalsmustmake.Thus,theresultsofthepreviousbenchmarkarevaluablewhenevaluatingRedPoint’sperformanceandcapabilities.TheWeb-CouponLogOnHadoopJoinwithOrdersjobcreatedinRedPointusedthesamedatavolumeandvariety,anearlyidenticaljobdesign,andcomparableEC2instancestotheachievethebenchmarkworkloadoutputasthepreviousbenchmark.

VendorPlatform ExecutionTime

HadoopMapReduce 1h11m52m

ApacheSpark 20m43s

RedPointonHadoop(YARNonly) 3m47s

Table5:RedPointperformancecomparedtoapreviousbenchmark

RedPointwasabletocompletethesameworkload550%fasterthanTalendusingSparkand1900%fasterthanInformaticausingHadoopMapReduce.ThisdemonstrateshowRedPointdesigneditsplatformandperformanceoverthespanofadecade.Moreover,itindicateshowRedPointachievedwiththeirplatformthathasbeencontinuallytunedforoveradecadeandutilizesYARN.

1“HadoopIntegrationBenchmark,”ProductProfileandEvaluation:TalendandInformatica,availableat:https://info.talend.com/hadoopintegrationinformatica.html.

Page 18: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 18

Use Cases 2 and 3: Address Standardization and Name Matching Thegoalofthesecondandthirdusecasesforthebenchmarkwastopreparedatasetsofsanitizedcustomeraddressesandmatchingcustomerduplicates.ThedataqualityjobswerewrittentomakeuseofandassessRedPoint’stoolset.

Execution Time and Actual-Versus-Expected Results

Table6liststhemedianexecutiontimesoftheAddressStandardizationandNameMatchingjobs.

Job TrialsMedian

RunTimeOutputRows

AddressStandardization 5 0:02:30 2,005,055

NameMatching 5 0:02:52 6,367,507

Table6:AddressStandardizationandNameMatchingBenchmarkResults

Thebenchmarkproducedverysatisfactorydataqualityoutputwithinarangeweexpectedbasedontheoriginalsourcedatagenerated.WhatwasimpressivewasRedPoint’sperformance.Whilewehavenootherpreviousbenchmarkwithwhichtocomparetheseresults,theAddressStandardizationworkloadprocessed10millionrecordsatarateof66,667recordspersecond,andtheNameMatchingwasachievedat58,140recordspersecond.TheseresultsareatestamenttothepowerofRedPoint’sabilitytoleveragetheHadoopclusterforparallelprocessingviaYARNwithminimaloverhead.

Perceived Usabi l ity Assessment Important,butoften-overlooked,considerationswhenbenchmarkingandevaluatingdatamanagementtoolsareproductusabilityandmaturity.Inpreviousbenchmarksandclientengagements,wehaveseentoolsthatrankhighlyforhoweasytheyaretoinstall,configure,understand,anduse.Wehavealsoseensomethatarequitedifficulttouse.Additionally,wehaveevaluatedRedPoint’sperceivedease-of-use.Forthisassessment,weusedtherubricinTable7(whichisbasedonanISO/IEC9126-4approachtousabilitymetrics)andevaluatedtheRedPointDataManagementtoolaccordingly.

Page 19: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 19

Measure Result

Efficiency—Easeofinstallation,setup,andconfiguration

• Usingthevendor’sdocumentation,howmucheffort(in-personhours)wasrequiredtoinstallandsetupthesoftwareoncethetargetinstance(s)wereavailable?

• Howmucheffort(inperson-hours)wasrequiredtoconfigurethenecessaryHadoopcomponentstogetthejobstoexecute?

TheinstallationandsetupofRedPointDataManagementSiteandExecutionServersandClienttooltooklessthan1.5person-hours.TheconfigurationofHadooptoolstooklessthan0.5person-hours.

Effectiveness—Jobexecutioncompletionrate

• Onceadatamanagement/integrationjobiscreatedandrunssuccessfullyonatestsetofdata,howmanybenchmarkjobsfailedtocompleteduetoproblemswiththevendorsoftwareorHadoop?

Nofailures.RedPointDataManagementsuccessfullycompletedeverybenchmarktestafterwecofirmedthejobwasproperlyformedbyrunningatestdataset.

Satisfaction—UserInterface

• Onascalefromverydifficulttoveryeasy,howdidwefindourexperiencebuildingthedataintegration/managementjobs?

Veryeasy.Theuserinterfaceisintuitive.Dataintegration/managementcomponentsareclearlyidentifiedandconfigurationoptionswereeasytoset.Weonlyreferredtothedocumentationandin-toolhelpcontent(whichwasverythorough)toconfirmourusageandsettingsofcomponents.

Inourexperience,mostothervendortoolsratefromeasytomoderatelydifficult.

Table7:RedPoint’sperceivedusabilitytests

Conclusion

TherearemultiplewaystointegratedataintoHadoop.Therearevastdifferencesinthearchitecturesofthevendors,wrappingopensourcetoolslikeMapReduceandSpark.YoucannotbesatisfiedwiththefunctionalityofaHadoopload;youmustalsobeconcernedwithperformance.Ensurethewindisinyoursailswithyourtoolselectionbyleavingyourselfroomforexperimentation,error,andgrowth.Performancewillbethereforthevastcyclesofdevelopment,testing,qualityassuranceand,ofcourse,production.Ultimately,theproofisinthetestingoutcomes.Ourbenchmarkresultswerebeyondwhatwethoughtpossible.VendorarchitectureisimportantinintegratingdatawithHadoop,yetthedifferencesarevast.RedPointisbasedonafoundationofYARN,whichhasproventobeagoodchoice.

Page 20: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 20

About MCG Global Services

WilliamMcKnightisPresidentofMcKnightConsultingGroup(MCG)GlobalServices(http://www.mcknightcg.com).Heisaninternationallyrecognizedauthorityininformationmanagement.HisconsultingworkhasincludedmanyoftheGlobal2000andnumerousmidmarketcompanies,andhisteamshavewonseveralbestpracticecompetitionsfortheirimplementationsandmanyofhisclientshavegonepublicwiththeirsuccessstories.McKnight’sstrategiesformtheinformationmanagementplanforleadingcompaniesinvariousindustries.JakeDolezalhasover17yearsofexperienceintheInformationManagementfieldwithexpertiseinbusinessintelligence,analytics,datawarehousing,statistics,datamodelingandintegration,datavisualization,masterdatamanagement,anddataquality.Dolezalhasexperienceacrossabroadarrayofindustries,including:healthcare,education,government,manufacturing,engineering,hospitality,andgaming.WithanA-listofclientsrepresentingcomplexandhighly-successfulinformationmanagement,MCGhasbroadcatalogueofexperience.Ouradviceisacombinationofthelatestbestpracticeswithourpersonalexperienceandexpertise.Itispractical,nottheoretical.

• Wetakeakeenfocusonbusinessjustification.• Wetakeaprogramatic,notaproject-based,approach.• Webelieveinintegratingwithclientstaffandprioritizeknowledgetransfer.• Weengineerclientworkforcesandprocessestocarryyouforward.• We’revendorneutralsoyoucanrestassuredthatouradviceiscompletelyclient

oriented.• Weknow,define,judge,andpromotebestpractices.• Wehaveencounteredandovercomemostconceivableinformationmanagement

challenges.• Weensurebusinessresultsaredeliveredearlyandoften.

Weanticipateourcustomer’sneedswellintothefuturewithourfulllifecycleapproach.Ourfocused,experiencedteamsgenerateefficient,economic,timely,andsustainableresultsforourclients.

Page 21: Hadoop Data Integration Benchmark

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 21

About RedPoint Global

RedPointGlobaloffersacomprehensivesetofworld-classETL,dataquality,anddataintegrationapplicationsthatoperateinandacrossbothtraditionalandHadoop2.0/YARNenvironments.Thecompanyalsooffersdata-drivencustomerengagementsolutionsthathelpcompaniesderiveinsightsfromcustomerbehaviorsandcreateconsistent,relevant,andprecisemessagingacrossanyandallchannels.AllRedPointapplicationsofferauniquevisualuserinterfacethateliminatestheneedforprogrammingskills.Thisallowsenterprisestoutilizealldatatoachievetheirstrategicbusinessgoals.Formoreinformation,visitwww.redpoint.netoremail:[email protected].

ARMCGUS0816-01