Hadoop Data Integration Benchmark

Hadoop Data Integration Benchmark

Product Profile and Evaluation:

RedPoint Data Management for Hadoop

By William McKnight and Jake Dolezal August 2016 Sponsored by RedPoint Global Inc.

MCG Global Services Hadoop Integration Benchmark

© MCG Global Services 2016 www.mcknightcg.com Page 2

Table of Contents

EXECUTIVESUMMARY 3

HADOOPINTHEENTERPRISE 4

THEEVOLUTIONOFHADOOPDATAINTEGRATION 5

REDPOINTPRODUCTPROFILE 6

COMPANYPROFILE 6

BENCHMARKOVERVIEW 7

BENCHMARKSETUP 8

VIRTUALSERVERENVIRONMENT 8REDPOINTINSTANCES 9RELATIONALDATABASEINSTANCE 9HADOOPCLUSTER 9SOURCEDATA 9RELATIONALDATASOURCE 9WEB-CLICKLOG 10COUPONLOG 10NAMEANDADDRESSCSVFILE 11DATAVOLUME 11DATAMANAGEMENTJOBS 12WEB-COUPONLOGONHADOOPJOINWITHORDERSJOBDESIGN 12ADDRESSSTANDARDIZATIONJOBDESIGN 13NAMEMATCHINGJOBDESIGN 14

BENCHMARKRESULTS 16

USECASE1:WEB-COUPONLOGONHADOOPJOINWITHORDERS 16EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 16VENDORCOMPARISON 17USECASES2AND3:ADDRESSSTANDARDIZATIONANDNAMEMATCHING 18EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 18PERCEIVEDUSABILITYASSESSMENT 18

CONCLUSION 19

ABOUTMCGGLOBALSERVICES 20

ABOUTREDPOINTGLOBAL 21



Executive Summary

ThisbenchmarkispartofresearchintotheperformanceofloadsonHadoopclusters—anincreasinglyimportantplatformforstoringdata-poweringcorporatestrategies.Theintentofthebenchmark’sdesignistosimulateasetofbasic-loadscenariostoanswersomefundamentalbusinessquestionsthatorganizationsfromnearlyanyindustrymightencounterandask.Foragrowingindustry,thereareasurprisingvarietyofapproachesandvendorarchitecturesforHadoop-loadingproducts(suchas:MapReduce,Spark,SparkthroughHive,YARN,nifi,Sqoop,Sqoopinterfaces,Flumeinterfaces,andinterfacestocommandlineHDFS).Basedonthedifferencesintheresultswe’vefound,thisarchitecturefoundationgreatlyinfluencesperformance.RedPointDataManagementforHadoopisbasedonYARN,aresourcenegotiatora.k.a.operatingsystem,whichisthefoundationofHadoop2.0.Inthecaseofourqueries,RedPointwasabletocompleteworkloadsinaveryshorttimeframe,wellwithinenterpriserequirementsandfasterthanwhatweimaginedpossible.Comparedtoapreviousbenchmark,oneworkloadran550%fasterthanaproductusingSparkand1900%fasterthanaproductusingMapReduce.RedPoint’splatform,continuallyfine-tunedforoveradecade,hasachievedunparalleledhighperformanceinutilizingYARNwithouttheoverheadofotherHadoopcomponents.Thispaperfurtherexploresandinvestigatestheseresults.



Hadoop in the Enterprise

Companiesareclamoringtocaptureasmuchdataaspossibleandharnessthatdataasmeaningfulinformationtodrivetheirbusinesses.Today,thisinformation,or“bigdata,”wouldincludealldatageneratedbyacompany’sdigitalstrategy.Itwouldalsoincludealldatathatpasttechnologieswereunabletorecordandanalyzeforbusinessuse.Bigdataisnotonlycontrollabletoday,butitsimplementationisalsoessentialinconductingbusiness.Machinesareprimarilyresponsibleforbigdata.Machinedatacontainscriticalinsights;itallowsustoconductunprecedentedtriangulationofphysicalobjects.Unliketraditionalstructureddata(forexample,datastoredinatraditionalrelationaldatabaseforbatchreporting)machinedataisnon-standard,highlydiverse,dynamic,andhigh-volume.Wecanbuildacomprehensivepictureofactivitywhenwecorrelateandvisualizetherelatedeventsacrossdisparatesources.Thechallengeisinbringingthedatatogether.Companiesthatcancaptureandharnessthisdatawillbenefitaccordingly.Inotherwords,themorecompaniesstoreandprocessdata,themoresuccesstheycantapinto.Businessesacrossindustriesshowclear,upwardtrendsinspendingonbigdata,anditisprojectedtobethetopbudgetiteminmanysectors.Hadoopisatechnologythatwasformedin2006tomeettheneedsoftheSiliconValleydataelite.Previously,thesecompanieshaddataneedsthatfarsurpassedbudgetsforthedatabasemanagementsystems(DBMS)outthere.ThescaletheywereusingwasanotherorderofmagnitudeawayfromthetargetfortheDBMS.Andthetimingofthescalewasnotcertain,giventhevariabilityofthedata.Hadoopisquicklybeingadoptedbybusinessesfromstart-upcompaniestotheFortune1000becauseitscalesverywellandrelativelycheaply.Thismeansyoudonothavetoaccuratelypredictthedatasizeattheoutset.Hadoopisagreatfitformanytypesofdatainanorganization.Sensordata,clickstreamdata,socialdata,serverlogs,smartgriddata,electronicmedicalrecords,videoandpictures,unstructuredtext,geolocationdata,high-volumedata,and“cold”enterprisedataareallagreatfitintheHadoopopen-sourcesoftwareframeworkforstoringdataonclustersofcommodityhardware.Scale-outfilesystemsthatmaybelackinginfunctionality,butcanhandlemodernlevelsofcomplexdataareheretostay.Hadoopistheepitomeofthatideaandanecosystemisbuildinguparoundit.WhilethereusedtobelittleoverlapbetweenreasonableselectionofHadoopandreasonableselectionofaDBMS,thathaschanged.Hadoophaswithstoodthetestoftimeandhasgrown



tothepointwherequiteafewapplicationsarchitectedonaDBMSwillbemovedtoHadoop.Thecostsavings,combinedwiththeabilitytoexecutethecompleteapplicationwillbepersuasive.Itisespeciallyusefulasacollectionpointforpost-operationaldataacrosstheenterprise,notallofwhichmaybedestinedforarelationaldatawarehouse.This“datalake”canbeleftatlowrefinement,whichisjustfinefortheemergingclassofdatascientistsandothersinneedofdeepinsight.Traditionally,datapreparationhasconsumedanestimated80percentoflegacydatadevelopmentefforts.LoadingHadoopclusterswillcontinuethistraditionasatopjobatarangeofcompanies.Luckily,itispossibletolessenthecostandriskofthisworkwitharobustdataintegrationtool.

The Evolution of Hadoop Data Integration

Intheearlydays,low-performing,opensourcevendorarchitectureslikeSqoop,Flume,commandlineHDFSandHivewerelimiting.Sincethen,numerousapproachesandtoolshavearisentomeettheHadoopdataintegrationchallenge.MapReducewastheoriginal[andinHadoop1.0,theonly]data-processingengineforHadoop.However,ithasprovedunwieldyandunabletomeetincreasinglycomplexworkloads,sufferingfromissuessuchasaninabilitytoscaleindex-basedlookups.SparkemergedasareplacementforMapReduce.Byutilizingapoolofpersistent"executorservices"itcannearlyeliminateinter-stagestartupcosts—oneofMapReduce'sbigweakness.Inaddition,SparkusesResilientDistributedDatasets(RDDs)forinter-stagestorage.RDDsareaformofHDFS-backedmemoryimagesthatcombinethefastaccessofmemorywiththefault-toleranceofHDFS.Sparkcanbeusedtoachieveveryfastthroughputforcertainworkloads.SparkisalsobeingleveragedtoimprovetheperformanceofHiveprocessing,specificallyHQLqueries.So-called"HiveonSpark"hastheabilitytoaccelerateHiveitself,butdoesn'tserveasageneraldata-integrationplatform.ButevenSparkhasitslimitations.Theamountofmemoryrequiredtoprocessadatasetcanbeanorderofmagnitudelargerthantheinputdatasetsize.Iflessmemoryisavailableduetovariousfactors(suchasclusterload,nodedowntime,orunexpecteddatascale),Spark'sperformancedegradationcurvecanbeseverelynon-linear,evenbecominga"cliff”beyondwhichjobssimplyfail.Itisincreasinglyimpossibletoexpecta“reserved”clusterforHadoopactivity,whichmeansacluster’smemoryresourcesareincreasinglylimitedandunpredictable.Still,Sparkwouldbethenumberonechoiceformostworkloadsifthesewereyouronlyoptions.



However,byapplyingengineeringtotheclustertoachievehigherperformingresultswithtruecommoditynodes—withouttheaddedmemory—somehaveimproveduponprecedingmodels.Forexample,RedPointusesanativeengineontopofYARN,aresourcenegotiatorandoperatingsystem,whichisthefoundationofHadoop2.0.Itisthelayerthatintegratesandmanagesresources,includingstorageresources,CPU,I/Oandmemory.RedPointisbasedaroundYARN,whichrunsinthecluster.ByleveragingYARN,itcanruninmassiveparallelismwithouttheassumptionthatallthedatamustfitintomemory.Workloadperformanceismorepredictableaswell,givenitslackofdependencyonmemory.Additionally,thedegradationcurvewhenfacedwithlimitedresourcesismoregentle.RedPointDataManagement™forHadoopleveragesRedPoint’s10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationtool.Itusesavisual-designdataflowmodel,allowingnon-programmerstocreatecomplexdatatransformations.OrganizationswithexistingdatastaffshouldfindthistechnologytohaveafasterandmoreaffordableadoptioncurvethanwhenhiringforSpark.

RedPoint Product Profile

Company Profi le

ProductName RedPointDataManagementforHadoop

InitialLaunch 2013

CurrentReleaseandDate

7.3.1,June2016

KeyFeatures

BasedonYARN;Companywith10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationanddataqualitytool;Predictablehighperformance

HadoopDICompetitors Informatica,Pentaho,Syncsort,Talend

CompanyFounded 2006

Focus Empowerdata-drivenorganizationsbyunlockingthefullvalueoftheirdatatodriveconsumerengagementandprofitable,sustainedgrowth.



Benchmark Overview

Theintentofthebenchmark’sdesignistosimulateasetofbasicscenariostoaddresssomefundamentalbusinessproblemsthatanorganizationfromnearlyanyindustrysectormightencounterandask.Thesecommonbusinessquestionsformulatedforthebenchmarkandfromourexperienceworkingwitharangeofclientsoverthepastdecadeare:

• Whatimpactdoescustomers’viewsofpagesandproductsonourwebsitehaveonsales?Whatistheaveragenumberofpageviewsbeforecustomersmakeapurchasedecision(onlineorin-store)?

• Howdoourcouponpromotionalcampaignsimpactourproductsalesorserviceutilization?Doourcustomerswhovieworreceiveourcouponpromotionscometoourwebsiteandbuymoreoradditionalproductsthantheymayhaveotherwisepurchased?

• Howcanweidentifyandremovepotentialduplicatesfromacustomerdatasourceofquestionabledataquality?

• Howcanwestandardizecustomermailingaddressestoimprovethequalityofourgeographicdataforsame-householdrecognitionandfortheefficacyofourmail-marketingcampaigns?

Thebenchmarkwasdesignedtodemonstratehowacompanymightapproachaddressingthesebusinessproblemsbybringingdifferentsourcesofinformationintoplay.WealsohavetakentheopportunitytoshowhowHadoopcanbeleveraged,becausesomeofthedataofinterestinthesedatamanagementcasesarelikelyofalargevolumeandnon-relationalorsemi-toun-structuredinnature.Inthesecases,usingHadoopwouldbethebestcourseofactionforclientsseekingtoanswerthesequestions.Sinceitishighlyprobablethatthedatarequiredresidesindifferentsources,thebenchmarkwasalsosetupfordataintegration.Someofthesesourcesarealsoprobablynotbeingconsumedandaggregatedintoanenterprisedatawarehouseduetotheirhighvolumeandthedifficultyinintegratingvoluminousamountsofsemi-structureddataintoatraditionaldatawarehouse.Thus,thebenchmarkwasdesignedtomimiccommonscenariosandthechallengesfacedbyorganizationsseekingtointegratedatatoaddresstheseandsimilarbusinessproblems.

Employees 120

Headquarters WellelseyHills,MA

Ownership Private



Benchmark Setup

Thebenchmarkwasexecutedusingthefollowingsetup,environment,standards,andconfigurations.

Virtual Server Environment

Feature Selection

HadoopDistribution HortonworksDataPlatform2.4.2(HDFS,MapReduce2,YARN,Tez,Hive,Pig,ZooKeeper,andAmbariinstalled)

EC2Instance Memoryoptimizedm3.xlarge(4vCPUs,16GBMemory)

OS CentOS6.7

SourceDataTypes Text-basedlogfiles,arelationaldatabase,andcomma-separatedvalue(CSV)files

DataVolume 20GB(Logfiles);7,500,000rows(RDBMS);and10,000,000lines(CSV)

TPC-HScaleFactor 1x

RDBMS PostgreSQL9.4

JavaVersion 1.8.0_91

Figure1andTable1:ServerEnvironmentandSetup



ThebenchmarkwassetupusingAmazonWebServices(AWS)EC2instancesdeployedintoanAWSVirtualPrivateCloud(VPC)withinthesamePlacementGroup.AccordingtoAmazon,allinstanceslaunchedwithinaPlacementGrouphavelowlatency,fullbisectionand10Gigabitspersecondbandwidthbetweeninstances.

RedPoint Instances

TheRedPointClientEC2instancewasageneralpurposet2.largewith2vCPUsand8GBofRAMrunningCentOS6.7.ThisWindowsinstanceranMicrosoftServer2012.Onthisinstance,weinstalledtheRedPointDataManagementforHadoopClientversion7.3.1.TheRedPointExecutionandSiteServerEC2instancewasageneral-purpose,m4.xlargemachinewith4vCPUsand16GBofRAMrunningCentOS6.7.Inthisinstance,weinstalledtheRedPointDataManagementExecutionandSiteServersversion7.3.1.

Relational Database Instance

Therelationalsourceforthebenchmarkwasam4.xlargeEC2instancerunningCentOS6.7.WeinstalledPostgreSQL9.4onthisserver.

Hadoop Cluster

TheHadoopclusterforthebenchmarkconsistingof3identicalnodes,eacham4.xlargeEC2instancerunningCentOS6.7.WeinstalledHortonworksDataPlatformHadoopdistribution.UsingAmbari,weinstalledthefollowingHadoopservices:HDFS,MapReduce2,YARN,Tez,Hive,Pig,andZooKeeper.Thisisaminimumviableproduct(MVP)setup.

Source Data Wecreatedthedatasourcesusedinthebenchmarktomimicreal-lifeusecases:

• Relationaldata• Web-clicklog• Couponlog• Customernamesandaddresses

Relational Data Source

Therelationalsourceforthebenchmark(storedinPostgreSQL)wasconstructedusingtheTransactionProcessingPerformanceCouncilTPCBenchmarkH(TPC-H)Revision2.17.1StandardSpecification.TheTPC-Hdatabasewasconstructedtomimicareal-lifepoint-of-salesystemaccordingtotheentity-relationshipdiagramandthedatatypeandscalespecificationsprovidedby

Figure2:TPC-HERDiagram©1993-2014TransactionProcessingPerformanceCouncil



theTPC-H.Wepopulatedthedatabasewithscriptsthatwereseededwithrandomnumberstocreatethemockdataset.TheTPC-Hspecificationshaveascalefactorbywhichtherecordcountforeachtableisderived.Forthisbenchmark,weselectedascalefactorof1.Inthiscase,theTPC-Hdatabasecontained1.5millionrecordsintheORDERStableand6millionrecordsintheLINEITEMtable.

Web-Click Log

Aweb-clicklogwasgeneratedusingthesamefashionasastandardApachewebserverlogfile.Thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.web-clicksthatcorrespondtoactualpageviewsoforderedproducts(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”web-logentriesappearedinavarietyofpossibilitiesbutfollowedthesameformatconsistentwithanApacheweb-clicklogentry.Alldatawererandomlyselected.Forexample:249.225.125.203 - anonymous [01/Jan/2015:16:02:10 -0700] "GET /images/footer-basement.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/index.php" "Windows NT 6.0"

The“signal”weblogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:154.3.64.53 - anonymous [02/Jan/2015:06:03:09 -0700] "GET /images/side-ad.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/product-search.php?partkey=Q44271" "Android 4.1.2"

Theweb-clicklogfilecontained64,000,000linesandwas5.4GBinsize.Therewererandomly-inserted,web-clickentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000oftheweb-clicklogentriescorrespondedtoorders.Therestoftheentrieswererandom.

Coupon Log

AcouponlogwasgeneratedusingthesamefashionasacustomizedApachewebserverlogfile.Thecouponlogwasdesignedtomimicaspecialcaselogfilegeneratedwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromacouponadcampaign.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.pageviewsthatcorrespondtoactualpageviewsof



orderedproductsbyactualcustomersviathecouponadcampaign(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”couponlog-entrydatawererandomlyselected.The“signal”couponlogentriesthatcorrespondedto,andwereseededwith,actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:49.243.50.31 - anonymous [01/Jan/2015:18:28:14 -0700] "GET /images/header-logo.png HTTP/1.0" 200 75422 "http://www.acmecompany.com/product-view.php?partkey=S22211" "https://www.coupontracker.com/campaignlog.php?couponid=LATEWINTER2015&customerid=C019713&trackingsnippet=LDGU-EOEF-LONX-WRTQ" "Windows Phone OS 7.5"

Thecouponlogfilecontained16,000,000entriesandwas14.3GBinsize.Therewererandomly-insertedcouponentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000ofthecouponlogentriescorrespondedtoorders.Therestoftheentrieswererandom.

Name and Address CSV Fi le

Thecustomernameandaddressdatawasinacomma-separatedvaluefileformatandstoredintheHadoopDistributedFileSystemonourcluster.Thelayoutofthefileisdemonstratedbythefirstfewlinesofthe10millionrows:"NAME","ADDRESS","CITY","STATE","ZIP","PHONE","ID" CELESTE A ZIENUK,125 MINOT AVE,EAST WAREHAM,MA,02538,,100000022 SEBASTIAO C BARBOSA,15 HOOSAC ST,ADAMS,MA,01220,,100000064 GREG S STURGEON,1640 ALVIN LN,BROOKFIELD,WI,53045,,100000075 RENAE BATTISTELLA,15 COMMOMWEALTH AVE,QUINCY,MA,02169,,100000080

Thenameswererandomlygeneratedfromagenericnamedatabase.Theaddressesarerealaddresses.However,justover2millionoftheaddresseswere“dirty,”i.e.,notuptoUSPSstandards.SinceRedPointusesaCASS(CodingAccuracySupportSystem)standardizationmodulevalidatedbytheUnitedStatesPostalService(USPS),itwasnecessarytocorrectandmatchUSstreetaddressesforthese2millionentries.

Data Volume DataSet Type Location Rows SizeonDisk

WebLog ApacheLog HDFS 64,000,000 5.5GB

CouponLog ApacheLog HDFS 16,000,000 14.3GB

Orders RDBMS PostgreSQL 1,500,000 N/A

LineItems RDBMS PostgreSQL 6,000,000 N/A

NamesandAddresses CSV HDFS 10,000,000 0.6GB



Table2:Benchmarksourcedatavolumes

Eachofthedatasources(theTPC-Hdatabase,logfiles,andcustomeraddressCSVfile)werealsoscaledtodifferentscalefactors,sothattheintegrationroutines(describedinthenextsection)couldbeexecutedagainstdatasourcesofvarioussizes.

Data Management Jobs Theusecaseofthebenchmarkwasdesignedtodemonstratereal-lifedatamanagementscenarioswherecompaniesdesiretointegratedatafromtheirtransactionalsystemswithunstructuredandsemi-structureddata.Thebenchmarkdemonstratesthisbyexecutingroutinesthat:

• IntegratetheTPC-Hrelationalsourcedatawiththeindividuallogfiles• Standardizecustomeraddresses• Identifyduplicatecustomerrecords

Thefollowingdatamanagementandintegrationroutineswerecreatedforthebenchmark.Inallcases,bestpracticeswereobservedtooptimizetheperformanceofeachjob.

Web-Coupon Log on Hadoop Join with Orders Job Design

ThepurposeoftheWeb-CouponLogonHadoopJoinwithOrderswastotestthecapabilityofthevendorsoftwaretoefficientlycombineavarietyofdatafrommultiplesources,bothonandoffHadoop.Figure3representsthejobdesignthatwascreatedintheRedPointDataManagementClient.RedPointoffersaParallelSectiontoolwithinputsthatdefineallthesplittabledataavailabletotheParallelSectiontransforms.Splittabledataisthendividedupamongasetoftaskstobeprocessedinparallel.InputtoolswithintheParallelSectiontool'sprocessingareareadtheirentireinputdataineachtaskandareusedtodefineanddrivedataparallelism.

WithintheHadoopParallelSection,twoCSVinputsourceswereread:WebLogandCouponLog.

TheNumberRecordstoolwasusedtogenerateasequenceofnumericidentifiersforindividualrecordsineachCSVinputrow.

TheCalculatetoolwasusedtoconvertthestringApachelogdatetoadateformatwiththeRedPointScanDateTimefunction:ScanDateTime(Trim(DATESTR, "[ "), "DD/Mmm/YYYY:HH:mm:ss")

Figure3:TheWeb-CouponLogOnHadoopJoinwithOrdersJobDesign



TheSelecttoolissimilartotheSQLSELECTclause.Weusedthistooltoselectonlyafew,necessaryfieldsfromtheloginputs.Theselectedsetoffieldswasusedforthejoinandtheoutputtable.

TheJointoolacceptstwoinputs—LeftandRight—andmatchesrecordsfrombothinputsonasinglekeyfieldorcolumn.WeusedtheCartesianJoinoptiontocombinethematchedLeft(WebLog)andRight(CouponLog)recordsintoasingle"wide"recordcontainingallfieldsfrombothinputs.ThisfunctionissimilartoanSQLjoin.WebLog CouponLog Join Output

IP IP þ þ

PARTKEY PARTKEY þ þ

DATE DATE þ þ

COUPONID ¨ þ

CUSTOMERID ¨ þ

Table3:FieldsselectedfromtheWebandCouponlogsusedfortheJoinandoutput

TheresultingoutputcompletedtheprecedingParallelSectionwithinHadoop.However,whiletheseparalleltaskswereprocessing,theRedPointExecutionServerwasalsoprocessingtheRDBMSinputtask.

WeusedtheRDBMSInputtooltoreaddatafromthePostgreSQLTPC-Hdatabaseandtablesbyexecutingthefollowingquery:SELECT L_ORDERKEY, L_PARTKEY, O_CUSTKEY, O_ORDERDATE FROM LINEITEM LEFT OUTER JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY;

WeattachedaDataViewertotheoutputofthefinalJoinbetweenthejoinedWeb-CouponlogHadoopoutputandtheRDBMStoinspecttheresultantdataset.Theresultingexecutiontimesandexpectedoutputarediscussedinthenextsection.

Address Standardization Job Design

ThepurposeoftheAddressStandardizationjobwastoassesstheabilityoftheRedPointplatformtoquicklyandaccuratelydetectandcorrectmalformedUSpostaladdressesinasinglesourceofdataonHadoop.Figure4representsthejobdesignthatwascreatedinRedPointDataManagementClient.



Again,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.

The10-million-itemcustomernameandaddressCSVfilewasusedastheprimaryinput.Forthisjob,wesettheworkloadtobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Thismadethestandardizationmoreefficientbyorganizingtherecords.WealsosetthePartitionModetoSegment,becauseaSegmentpartitionisfasterthanonebasedonasort,accordingtothevendor’sdocumentation.

WeusedtheRedPointAOAddressQualitytooltoprovidetheaddresscorrection,parsing,andstandardization.Youcanenablegeocodeassignmentwithasingleoption.Forthisworkload,weloadedtheUSPSCASS-certifiedcompressedtarfile(tgz)rightontoHDFS,andtheRedPointExecutionServerwasabletobringitdirectlyintotheParallelprocessingsegmentofthejob.ThetoolwentthroughthedatasetandstandardizedthetheCSVfile.

Next,weusedtheFiltertooltoselectonlythoseaddressesthatwerestandardizedandchanged.

Again,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.

Name Matching Job Design

ThepurposeoftheNameMatchingjobwastoassesstheabilityoftheplatformtoquicklyandaccuratelydetectpotentialduplicatecustomerrecordsbynameandaddresswithinasinglesourceofdataonHadoop.Figure5representsthejobdesigncreatedintheRedPointDataManagementClient.Onceagain,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.

The10-million-itemcustomernameandaddressCSVfile(thesameoneusedintheAddressStandardizationjob)wasusedastheprimaryinput.Forthisjob,wesettheworkload

Figure4:TheAddressStandardizationJobDesign



tobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Sincetheaddressisimportanttoidentifyingmatches,theZIPwasanefficientmeansofgettingpotentialmatchesgroupedclosertogether,insteadofinrandomorder.WealsosetthePartitionModetoSegmentforperformancepurposes,justaswedidintheAddressStandardizationjob.

WeusedtheAOConsumerMatchmacrotomatchindividualsusingnameandaddressinformation—inthiscase,wesetthesegmentationtoZIP+addressparts.TheAOConsumerMatchcanalsobeusedtomatchtheindividual(fullname),thefamily(lastnameonly)orbyaddress(nonamecomponents).Itevenhasadditionalparametersdesignedtomatchfemaleindividualswhomayhavechangedtheirsurnames.Weusedthedefaultscoresproducedbythematchingalgorithmanddidnotfine-tunetheminanyway.

Next,weusedtheFiltertooltoremoveunmatchedrecordsoutofthedataoutput.

Then,weusedtheCalculatetooltooffsetthegroupidentifierproducedbytheAOConsumerMatchtoolbytasknumber.Thismadethemgloballyunique.

AsthefinaltaskintheParallelSection,wesortedthedatasetbythegroupidentifier,sowecouldseematchesadjacenttoeachother.

Finally,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.

Figure5:TheNameMatchingJobDesign



Benchmark Results

Use Case 1: Web-Coupon Log on Hadoop Join with Orders

Thegoalofthefirstusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththepageviewsandcouponcampaignclick-throughsonane-commercewebsite.Theintegrationjobwaswrittentomapthepageviewsandcouponstoproductsordered.Figure6isaconceptualmappingofthisintegration.

Figure6:Web-CouponLogOnHadoopJoinwithOrdersMapping

Execution Time and Actual-Versus-Expected Results

Table4liststhemedianexecutiontimesoftheWeb-CouponLogOnHadoopJoinwithOrdersjob.

Job TrialsMedian

RunTimeOutputRows

Web-CouponLogOnHadoopJoinwithOrders 5 3m47s 160,176

Table4:Web-CouponLogOnHadoopJoinwithOrdersBenchmarkResults



Vendor Comparison

Asacomparisonwiththerestofthedatamanagementindustry,theresultsofthisbenchmarkwerecomparedagainstabenchmarkrunbyMCGGlobalServicesinlate2015,comparingTalendandInformatica.1HadoopMapReduce,ApacheSpark,andYARNrepresentacriticalarchitecturalchoicethatmanyinformationmanagementprofessionalsmustmake.Thus,theresultsofthepreviousbenchmarkarevaluablewhenevaluatingRedPoint’sperformanceandcapabilities.TheWeb-CouponLogOnHadoopJoinwithOrdersjobcreatedinRedPointusedthesamedatavolumeandvariety,anearlyidenticaljobdesign,andcomparableEC2instancestotheachievethebenchmarkworkloadoutputasthepreviousbenchmark.

VendorPlatform ExecutionTime

HadoopMapReduce 1h11m52m

ApacheSpark 20m43s

RedPointonHadoop(YARNonly) 3m47s

Table5:RedPointperformancecomparedtoapreviousbenchmark

RedPointwasabletocompletethesameworkload550%fasterthanTalendusingSparkand1900%fasterthanInformaticausingHadoopMapReduce.ThisdemonstrateshowRedPointdesigneditsplatformandperformanceoverthespanofadecade.Moreover,itindicateshowRedPointachievedwiththeirplatformthathasbeencontinuallytunedforoveradecadeandutilizesYARN.

1“HadoopIntegrationBenchmark,”ProductProfileandEvaluation:TalendandInformatica,availableat:https://info.talend.com/hadoopintegrationinformatica.html.



Use Cases 2 and 3: Address Standardization and Name Matching Thegoalofthesecondandthirdusecasesforthebenchmarkwastopreparedatasetsofsanitizedcustomeraddressesandmatchingcustomerduplicates.ThedataqualityjobswerewrittentomakeuseofandassessRedPoint’stoolset.

Execution Time and Actual-Versus-Expected Results

Table6liststhemedianexecutiontimesoftheAddressStandardizationandNameMatchingjobs.

Job TrialsMedian

RunTimeOutputRows

AddressStandardization 5 0:02:30 2,005,055

NameMatching 5 0:02:52 6,367,507

Table6:AddressStandardizationandNameMatchingBenchmarkResults

Thebenchmarkproducedverysatisfactorydataqualityoutputwithinarangeweexpectedbasedontheoriginalsourcedatagenerated.WhatwasimpressivewasRedPoint’sperformance.Whilewehavenootherpreviousbenchmarkwithwhichtocomparetheseresults,theAddressStandardizationworkloadprocessed10millionrecordsatarateof66,667recordspersecond,andtheNameMatchingwasachievedat58,140recordspersecond.TheseresultsareatestamenttothepowerofRedPoint’sabilitytoleveragetheHadoopclusterforparallelprocessingviaYARNwithminimaloverhead.

Perceived Usabi l ity Assessment Important,butoften-overlooked,considerationswhenbenchmarkingandevaluatingdatamanagementtoolsareproductusabilityandmaturity.Inpreviousbenchmarksandclientengagements,wehaveseentoolsthatrankhighlyforhoweasytheyaretoinstall,configure,understand,anduse.Wehavealsoseensomethatarequitedifficulttouse.Additionally,wehaveevaluatedRedPoint’sperceivedease-of-use.Forthisassessment,weusedtherubricinTable7(whichisbasedonanISO/IEC9126-4approachtousabilitymetrics)andevaluatedtheRedPointDataManagementtoolaccordingly.



Measure Result

Efficiency—Easeofinstallation,setup,andconfiguration

• Usingthevendor’sdocumentation,howmucheffort(in-personhours)wasrequiredtoinstallandsetupthesoftwareoncethetargetinstance(s)wereavailable?

• Howmucheffort(inperson-hours)wasrequiredtoconfigurethenecessaryHadoopcomponentstogetthejobstoexecute?

TheinstallationandsetupofRedPointDataManagementSiteandExecutionServersandClienttooltooklessthan1.5person-hours.TheconfigurationofHadooptoolstooklessthan0.5person-hours.

Effectiveness—Jobexecutioncompletionrate

• Onceadatamanagement/integrationjobiscreatedandrunssuccessfullyonatestsetofdata,howmanybenchmarkjobsfailedtocompleteduetoproblemswiththevendorsoftwareorHadoop?

Nofailures.RedPointDataManagementsuccessfullycompletedeverybenchmarktestafterwecofirmedthejobwasproperlyformedbyrunningatestdataset.

Satisfaction—UserInterface

• Onascalefromverydifficulttoveryeasy,howdidwefindourexperiencebuildingthedataintegration/managementjobs?

Veryeasy.Theuserinterfaceisintuitive.Dataintegration/managementcomponentsareclearlyidentifiedandconfigurationoptionswereeasytoset.Weonlyreferredtothedocumentationandin-toolhelpcontent(whichwasverythorough)toconfirmourusageandsettingsofcomponents.

Inourexperience,mostothervendortoolsratefromeasytomoderatelydifficult.

Table7:RedPoint’sperceivedusabilitytests

Conclusion

TherearemultiplewaystointegratedataintoHadoop.Therearevastdifferencesinthearchitecturesofthevendors,wrappingopensourcetoolslikeMapReduceandSpark.YoucannotbesatisfiedwiththefunctionalityofaHadoopload;youmustalsobeconcernedwithperformance.Ensurethewindisinyoursailswithyourtoolselectionbyleavingyourselfroomforexperimentation,error,andgrowth.Performancewillbethereforthevastcyclesofdevelopment,testing,qualityassuranceand,ofcourse,production.Ultimately,theproofisinthetestingoutcomes.Ourbenchmarkresultswerebeyondwhatwethoughtpossible.VendorarchitectureisimportantinintegratingdatawithHadoop,yetthedifferencesarevast.RedPointisbasedonafoundationofYARN,whichhasproventobeagoodchoice.



About MCG Global Services

WilliamMcKnightisPresidentofMcKnightConsultingGroup(MCG)GlobalServices(http://www.mcknightcg.com).Heisaninternationallyrecognizedauthorityininformationmanagement.HisconsultingworkhasincludedmanyoftheGlobal2000andnumerousmidmarketcompanies,andhisteamshavewonseveralbestpracticecompetitionsfortheirimplementationsandmanyofhisclientshavegonepublicwiththeirsuccessstories.McKnight’sstrategiesformtheinformationmanagementplanforleadingcompaniesinvariousindustries.JakeDolezalhasover17yearsofexperienceintheInformationManagementfieldwithexpertiseinbusinessintelligence,analytics,datawarehousing,statistics,datamodelingandintegration,datavisualization,masterdatamanagement,anddataquality.Dolezalhasexperienceacrossabroadarrayofindustries,including:healthcare,education,government,manufacturing,engineering,hospitality,andgaming.WithanA-listofclientsrepresentingcomplexandhighly-successfulinformationmanagement,MCGhasbroadcatalogueofexperience.Ouradviceisacombinationofthelatestbestpracticeswithourpersonalexperienceandexpertise.Itispractical,nottheoretical.

• Wetakeakeenfocusonbusinessjustification.• Wetakeaprogramatic,notaproject-based,approach.• Webelieveinintegratingwithclientstaffandprioritizeknowledgetransfer.• Weengineerclientworkforcesandprocessestocarryyouforward.• We’revendorneutralsoyoucanrestassuredthatouradviceiscompletelyclient

oriented.• Weknow,define,judge,andpromotebestpractices.• Wehaveencounteredandovercomemostconceivableinformationmanagement

challenges.• Weensurebusinessresultsaredeliveredearlyandoften.

Weanticipateourcustomer’sneedswellintothefuturewithourfulllifecycleapproach.Ourfocused,experiencedteamsgenerateefficient,economic,timely,andsustainableresultsforourclients.



About RedPoint Global

RedPointGlobaloffersacomprehensivesetofworld-classETL,dataquality,anddataintegrationapplicationsthatoperateinandacrossbothtraditionalandHadoop2.0/YARNenvironments.Thecompanyalsooffersdata-drivencustomerengagementsolutionsthathelpcompaniesderiveinsightsfromcustomerbehaviorsandcreateconsistent,relevant,andprecisemessagingacrossanyandallchannels.AllRedPointapplicationsofferauniquevisualuserinterfacethateliminatestheneedforprogrammingskills.Thisallowsenterprisestoutilizealldatatoachievetheirstrategicbusinessgoals.Formoreinformation,visitwww.redpoint.netoremail:[email protected].

ARMCGUS0816-01

Hadoop Data Integration Benchmark

Documents