STORAGE-BASED CONVERGENCE BETWEEN HPC AND CLOUD TO HANDLE BIG DATA Deliverable number D2.1 Deliverable title Intermediate Report Big Data processing: state of the art WP2 DATA SCIENCE Editor Gabriel Antoniu (Inria) Main Authors Alvaro Brandon (UPM), Ovidiu Marcu (Inria), Pierre Matri (UPM) Grant Agreement number 642963 Project ref. no MSCA-ITN-2014-ETN-642963 Project acronym BigStorage Project full name BigStorage: Storage-based convergence between HPC and Cloud to handle Big Data Starting date (dur.) 1/1/2015 (48 months) Ending date 31/12/2018 Project website http://www.bigstorage-project.eu Coordinator María S. Pérez Address Campus de Montegancedo sn. 28660 Boadilla del Monte, Madrid, Spain Reply to [email protected]Phone +34-91-336-7380
36
Embed
STORAGE-BASED CONVERGENCE BETWEEN HPC AND CLOUD …bigstorage-project.eu/images/Deliverables/BigStorage-D2.1.pdf · Deliverable number D2.1 Deliverable title Intermediate Report Big
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
GrantAgreementnumber 642963Projectref.no MSCA-ITN-2014-ETN-642963Projectacronym BigStorageProjectfullname BigStorage: Storage-based convergence between HPC and Cloud to
This report presents an overview of each of these topics, including a briefanalysisofthestate-of-the-art ineachcaseandapreliminarypresentationofthespecificproblemstobeaddressedintheproject.
Keywords Data processing, data science, Big Data processing, MapReduce, streaming,workflowprocessing,Spark,Flink
Version Modification(s) Date Author(s)0.1 Initialtemplateandstructure 22.09.2016 GabrielAntoniu,Inria0.2 Sectionsfromallauthors 14.10.2016 Allauthors0.3 Internalversionforreview 7.12.2016 GabrielAntoniu,Inria
modelbyGooglein2004[Dea04].AmongthemotivationsthattriggeredtheproposaloftheMapReducemodel therewere two important challenges at that time: 1) provide effective scalability (hard to getimplemented efficiently in practice with traditional database-oriented techniques); 2) ensure that a
particular data analysis request; a Reduce stage, consisting in aggregating the intermediate data toproducethefinalresult.Eachstageis(potentially)highlyparallel:theinitialinputdatasetissplitintoanarbitrary large number of subsets, each of which is typically processed by a separateMap task; the
intermediate data produced by theMap tasks is shuffled and sorted, then processed by the Reducetasks,withfullparallelismagain.Typicallyboththeinputandtheoutputdataofthejobarestoredinafile-system.Theframeworktakescareofschedulingtasks,monitoringthemandre-executingthefailed
tasks.
Page10of36
MSCA-ITN-2014-ETN-642963
D4.1WP4IntermediateReport
GoogleMapReduce is the first reference implementationof theMapReducemodel.Over the years ithasbeenupdatedwithmultiplefeaturesandoptimizations[Aki16a],provingitsefficiencyforexample
tookbirthandbecame thede facto standard forMapReduceprocessing, through its adoptionby themaincloudcomputingproviders.Hadoopcurrentlyconsistsofthreemainprojects:
• Hadoop Distributed File System (HDFS) to provide high-throughput data access to the
The MapReduce model and its open-source implementation Hadoop MapReduce were quickly andwidely adopted by both industry and academia, mostly because of their simplified yet powerful
programming model. In order to understand how MapReduce works, its main components andarchitecture,wecanfollowthesimpletutorialfrom[Hadb],brieflysummarizedbelow.
InatypicalimplementationofMapReducesuchasHadoop,thecomputenodesandthestoragenodesare the same, that is, theMapReduce framework and theHadoopDistributed File System (seeHDFSArchitectureGuide)runonthesamesetofnodes.Thisconfigurationallowstheframeworktoeffectively
scheduletasksonthenodeswheredataisalreadypresent,resultinginveryhighaggregatebandwidthacrossthecluster.In a minimal utilization mode, applications specify the input/output locations and supplymap and
reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and
otherjobparameters,comprisethejobconfiguration.
The Hadoop job client then submits the job (jar/executable etc.) and configuration to a
ResourceManagerwhich thenassumes the responsibilityofdistributing the software/configuration to
the slaves, scheduling tasks andmonitoring them, providing status and diagnostic information to the
job-client.
Maps are individual tasks that transform input records into intermediate records. The transformed
3 Goingfaster:in-memoryMapReduceprocessingThevery simpleAPIofMapReducecomeswith the importantcaveat thatusersare forced toexpress
applications in termsofmapand reduce functions.However,moreapplicationsweredevelopedwiththe requirement that should be independent of any programming model. For instance, iterative
SparkandFlinkfacilitatethedevelopmentofmulti-stepdatapipelinesusingdirectlyacyclicgraph(DAG)patterns. At a higher level, both engines implement a driver program that describes the high-level
For instance, Spark’s reduceByKey operator (called on a dataset of key-value pairs to return a newdatasetofkey-valuepairswherethevalueofeachkeyisaggregatedusingthegivenreducefunction)isequivalenttoFlink’sgroupByfollowedbytheaggregateoperatorsumorreduce.
3.1 Spark
Apache Spark [Spark] introduced Resilient Distributed Datasets (RDDs) [Zah12a], a set of in-memory
data structures able to cache intermediate data across a set of nodes, in order to efficiently supportiterativealgorithms.RDDs(read-only,resilientcollectionsofobjectspartitionedacrossmultiplenodes)holdprovenanceinformation(referredtoaslineage).Theycanberebuiltincaseoffailuresbypartialre-
materialized,itwillbediscardedfrommemoryafteritsuse).However,sinceRDDsmightberepeatedlyneededduringcomputations, theusercanexplicitlymark themaspersistent,whichmoves them inadedicatedcacheforpersistentobjects.
Recently,thenewDataFrameAPIwasdevelopedforSpark[Data].ItisanextensionoftheRDDAPI,inwhichdata isnoworganized intonamedcolumns. It is very similar toa relationaldatabaseor to the
widelyusedmodelofdataframesinR/Python.Itsbenefitscomefromknowingthestructureofthedata.ThisallowsSparktodooptimizationsunderthehood.Spark’sCatalysoptimizercompilesoperationsintoJVMbytecodewithplans that involve intelligentactions likebroadcastsor skipping reading irrelevant
thesinglehighlevelAPIforSpark[Datb]consistingof:• An untyped API where a DataFrame is considered as a Dataset of a generic untyped object
“Row”(inotherwordsaDataset[Row]);
• A strongly-typed API where a Dataset is a collection of strongly typed JVM objects (in otherwordsDataset[T]).
ThebenefitofaDatasetoveraDataframeisthatitisstaticallytypedandissafeatruntimesincealltheerrors are checked at compile time. It also gives users a high level abstraction and a view of the
FlinkisbuiltontopofDataSets(collectionsofelementsofaspecifictypeonwhichoperationswithanimplicit type parameter are defined), Job Graphs and Parallelisation Contracts (PACTs) [War09]. Job
Graphs represent parallel data flows with arbitrary tasks, that consume and produce data streams.PACTs are second-order functions thatdefinepropertieson the input/outputdataof their associateduserdefined(firstorder)functions(UDFs);thesepropertiesarefurtherusedtoparallelizetheexecution
over thestorageapproachof intermediatedataproves tobeveryuseful forapplicationswithvaryingI/Orequirements.IterationhandlingAnotherimportantdifferencerelatestoiterationshandling.Sparkimplementsiterationsasregularfor-loops and executes them by loop unrolling. This means that for each iteration a new set of
tasks/operators is scheduled and executed. Each iteration operates on the result of the previousiterationwhichisheldinmemory.
In [Shi15] the authors analyze three major architectural components (shuffle, execution model and
caching)inHadoopandSparkandshowthatalthoughSparkismoreefficientthanHadoop,thereisonecase where for a Sort workload Hadoop MapReduce is twice faster than Spark, because of a moreefficient executionmodel for shuffling data:MapReduce can overlap the shuffle stagewith themap
stageinordertoeffectivelyhidethenetworkoverhead.
Sparkvs.Flink
In the framework of our BigStorage projectweperformed another comparative study [Mar16]whichevaluates the performance of Spark versus Flink in order to identify and explain the impact of thedifferent architectural choices and the parameter configurations on the perceived end-to-end
important role in the behavior of a Big Data framework:memorymanagement, pipelined execution,optimizations and parameter configuration easiness. What raises our attention is that a streamingengine (i.e. Flink) delivers in many benchmarks better performance than a batch-based engine (i.e.
4 BeyondMapReduce:more generic programmingmodels for data
analytics
4.1 SupportingSQL-likequeries
Hive
Hiveisadata-warehousesolutionthatisbuiltontopoftheHadoopEcosystem[Ash09].Itallowsusersto build SQL queries that extract data from several databases and file systems that integrate withHadoop.ThesequeriesarelatertranslatedintoMapReduce,TezorSparkJobs.ItwasinitiallydevelopedbyFacebookandisnowusedbyseveralcompaniesandofferedinserviceslikeAmazonWebServices.InfactitwaspartforalongtimeofthecoreofSparkSQL.
• Partitions: denote the distribution of the data inside the directory. This is done throughsubdirectories that are named with the specific values of the partition, e.g a partition oncolumnsdate and country.Datawith aparticular valueof 20091112andUSwill be stored insubdirectories/date=20091112/country=US
• Buckets:data inpartitionsare further splitted into filesbasedon thehashofacolumn in thetable.
The query language supports select, project, join, aggregate, union and subqueries. It also supportsUDFs and aggregations (UDAF) implemented in Java. Users can also create tables with specificserialisation,partitioningandbucketingoptionsandloadandinsertoperations.Moreover,userscanusemap-reducescriptsontherowsofthetables.
withtheHadoopecosystem;• Compiler:invokedbythedriverandtotransformthequeryintoanexecutionplan;• Optimiser: transforms the execution plan to get the best DAG to be executed in theHadoop
with machine learning. Moreover, it provides support for semistructured data sources like JSON orparquetandinferscharacteristicsofthedataautomatically.ItalsousesCatalyst[Arm15],anoptimizerbasedonprogramming language featuresofScala likepatternmatchingandquasiquotes.Thisengine
analysesthequeryoftheusertofirstcreateanoptimallogicalplan.Thenthisistranslatedintoseveralphysical plans that are evaluated thanks to a cost based model. The best physical plan is finally
The architecture of the system relies on optimizations made with Apache Calcite. The queries arevalidatedagainstregisteredtablesbytheuserandthenconvertedtoaCalcitelogicalplan.Thisplanis
The main characteristic of BlinkDB is to allow fast queries over big amounts of data by using data
samples [Aga13]. Different stratified samples of the data can be created through frequently usedqueries.Thesesamplesarelaterselectedbyastrategythatdependsonaconfidenceintervalandatimelimitforaquery,bothdefinedbytheuser.Thegoalistoachievefastquerytimesbylosingasmallpart
BigData streamprocessing startedmore consistentlywith the first ideas thatweregathered insideaproject that will later become Apache Storm. To represent a stream as a distributed abstraction (to
produceandprocessstreamsinparallel),twoconceptswereintroduced[Nat05]:aspoutproducesnewstreamsandabolttakesstreamsasinputandproducesotherstreamsasoutput;spoutsandboltsarehandled in parallel just likemappers and reducers inMapReduce; finally, a topology is a network of
which is responsible for the distribution and execution of a topology. Amaster coordinates a set ofworkersnodes,which runoneormoreworkerprocesses. Eachworkerprocess ismapped toa singletopologyandstartsaJVMinwhichoneormoreexecutorsrunoneormoretasks.
Pressedbyan increasedscaleofdataprocessingandthediversityofnewusecases,alongwithmanylimitations of Storm [Kul15], Twitter developed Heron, a new stream processing frameworkwhich is
Oneof themechanismsthatHeron implementedandwhichwasmissing inStorm is thebackpressuremechanism (handling spikesandcongestion): if the receiver component isunable tohandle incomingdata/tuples,inStormthesenderwillsimplydroptuples.Heron'stupleprocessingsemanticsaresimilar
tothatofStorm,henceneitherHeronimplementsexactly-onceguarantees,althoughtheauthorsclaimthat the design allows it and they are considering its implementation [Kul15]. Finally, the authorsmention thatatTwitter,Stormwas replacedbyHeron,showing large improvements inbothresource
came into play with its discretized streams efficient and fault-tolerant model for stream processing[Zah12].SparkStreamingtakeawaysarethehigh-levelfunctionalprogrammingAPI,strongconsistency(exactly-once) and efficient fault recovery (avoiding traditional replication or upstreambackupwhere
messagessentarebufferedandreplayedonacopyofthefaileddownstreamnode). It introducedtheconceptofD-Streams[Aki15]:
"Thekey ideabehindD-Streams is to treatastreamingcomputationasaseriesofdeterministicbatchcomputationsonsmalltimeintervals"anditisbasedonRDDs.SparkStreamingcapabilitiesarelimitedto in-order streaming processing and provide windowing semantics that are limited to tuple- or
processingtime-basedwindows”Samza
Apache Samza [Samza] is a distributed stream processing framework. It uses Apache Kafka formessaging and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and
(operationsthatrememberinformationacrossindividualevents,e.gwindowoperators).Flink’s basic building blocks are streams (intermediate results) and transformations (operations that
takesoneormorestreamsas input-sources,andcomputeoneormoreresultstreamsfromthemasoutput - sinks). When executed, Flink programs are mapped to streaming dataflows (may resemblearbitrary directed acyclic graphs - DAG), consisting of streams and transformation operators. Each
Parallelism in Flink. Programs in Flink are inherently parallel and distributed. Streams are split intostream partitions and operators are split into operator subtasks. The operator subtasks executeindependently from each other, in different threads and on different machines or containers. The
numberofoperatorsubtasksistheparallelismofthatparticularoperator.Theparallelismofastreamisalways that of its producing operator. Different operators of the program may have a differentparallelism.
One of the fundamental challenges of distributed stateful stream processing is providing processingguarantees under failures. To overcome existing approaches, which rely on periodic global state
tasks, coordinate checkpoints, coordinate recovery on failures, etc. (high-availability setups
includes multiple master processes, one of which is always the leader, the other remain instandby).
• The worker processes (also called TaskManagers, at least one) execute the subtasks of a
dataflow, and buffer and exchange the data streams. Each worker is a JVM process and issplittedintotaskslots,eachtaskslotrepresentingafixedsubsetoftheresources.
Apache Kafka is a distributed stream platform that provides durability and publish/subscribefunctionality for data streams (making streaming data available tomultiple consumers), being the de
facto open source solution (for data durability and availability) used in end-to-end pipelines withstreamingengineslikeSparkorFlink,whichattheirturnprovidesdatamovementandcomputation.
AKafkaclusterisasetofoneormoreserversthatstorestreamsofrecordsincategoriescalledtopics.EachtopiccanbesplitintomultiplepartitionsthataremanagedbyaKafkabroker,aservicecollocatedto thenode storing thepartition, that furtherenables consumersandproducers toaccess the topic’s
- MLlib: Apache’s Spark distributedmachine learning library. It inherits from the previous onesbutit’snowconsideredalibraryonitsownoutoftheMLBasebounds.Itiswidelysupportedbythecommunityandthemostcommonlyused.
- MLoptimiser:optimisesthewholemachinelearningpipeline.- MLI: anAPI that facilitates feature extraction and algorithmdevelopment through an easy to
Inaddition,internaloptimizationsareappliedtoadaptthesealgorithmstoadistributedsettinginSpark.Some examples are optimizing Java Garbage collection times, using feature discretization to reducecommunicationcostsandusingC++specializedlibrariesintheworkernodes[Men16].
the Flink engine. The main objective was to provide not only algorithms but also means to extractinformationfromdatasetsandbuildmaintainablemachinelearningpipelines.Atthecurrentversion1.2it only allows to apply these algorithms over batch data sources. Also there are fewer algorithms
Conclusion:Sparknotonlyhasmorealgorithmsandfeaturesbutitalsounifiesthestreaming,SQLandmachine learningcapabilitiesoftheframeworktousethemtogether.Thereason is thatSparkalways
likeSpark [GraphX]. It isa thin layer thatsitson topofSparkandhasseveral implementationsof themostimportantgraphalgorithmslikePageRank,TriangleCounttonameafew.Itallowstoeasilybuildgraphswith tabularorunstructureddata and view it asbothanormal collectionor a graph. TheAPI
providesgraphoperatorslikesubgraph,joinVerticesoraggregateMessagesthatfacilitateoperationstotheuser.ItalsopartitionsthedataofthegraphinapairofvertexandedgecollectionsthatareRDD’s[Gon14]. The collectionsprovidea special indexingandpartitioningadapted to thegraph. Toachieve
performqueriesaboutthegraphverticesandedgesandevensearchforstructuralpatternsinthedatalike,tripletsofnodesA,B,CwhereAisconnectedwithBandBconnectedwithC,butAisnotconnectedwith C. For that reason GraphX can be easily adapted in many scenarios and by users with a lot of
domainknowledge.
GellyonFlink
Gelly is an API for Flink [Gelly] to create graphs from a set of vertices and edges read from files or
collections. This Graph abstraction has several properties and functions that allows the user to getinformation of the graph and apply applications like map, translate or filter. Same as in GraphX itprovidesimplementationsforpopularalgorithmslikePageRankorTriangleCount.
triggeringmodels.TylerAkidau,oneof theauthorsof theDataflowmodel,makes in [Bey16a]and [Bey16b]ahigh-level
survey of modern data processing concepts: streaming, unbounded data (processing), event time vsprocessingtime,windows,sessions,watermarks,triggers,andaccumulation.
ApacheBeam
ApacheBeamistheopensourceextensionoftheGoogleDataFlowModel,stillaworkinprogresswithyetno stable releases, that aims toproposeaunifiedprogrammingmodel inorder tobuildpipelinesthatcanbefurtherexecutedbymultiplerunners,suchasApex,Flink,SparkorGoogleCloudDataflow
efficient (un)bounded data processing. This work is highly influenced by Google, which is againreshaping the world of modern Big Data processing, allowing the final user not only to choose thetradeoffs of its interest between correctness, latency, and cost, but also to leverage various engine
runners.
5 BigDataprocessing:whatrequirementsforstorage?
Big Data systems share the goals of traditional data warehousing systems (extract value from the
analysisofdata),whileaiming tocopewith their limitations.Bothsystemssignificantlydiverge in the
analytics processes and the organization of the source data. In practice, traditional datawarehouses
used to organize the data in repositories, collecting them from other source databases such as
enterprisemanagement systems, analytics engines, etc.Warehousing systems are poor at organizing
• Ultimately, design next generation data processingmodels, independent of any programming
model,with a focus on general data orchestration forworkflows and streamdata processing
(WP2T2.1).
Page34of36
MSCA-ITN-2014-ETN-642963
D4.1WP4IntermediateReport
7 References
[Aga13] Agarwal, S.,Mozafari, B., Panda, A.,Milner, H.,Madden, S., & Stoica, I. (2013). BlinkDB: Queries withBounded Errors and Bounded Response Times on Very Large Data. Eurosys’13, 29–42.
[Arm15]Armbrust,M.,Ghodsi, A., Zaharia,M., Xin, R. S., Lian, C., Huai, Y.,… Franklin,M. J. (2015). Spark SQL.Proceedingsofthe2015ACMSIGMODInternationalConferenceonManagementofData-SIGMOD’15,
1383–1394.http://doi.org/10.1145/2723372.2742797[Ash09] Thusoo, A., Sarma, J. Sen, Jain, N., Shao, Z., Chakka, P., Anthony, S., … Murthy, R. (2009). Hive - A
Warehousing Solution Over a Map-Reduce Framework. Sort, 2, 1626–1629.http://doi.org/10.1109/ICDE.2010.5447738
[Bdb] Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the annual
[Car15] Paris Carbone, Gyula Fora, Stephan Ewen, Seif Haridi, Kostas Tzoumas. Lightweight AsynchronousSnapshotsforDistributedDataflows,2015,https://arxiv.org/pdf/1506.08603.pdf
[Cass] Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system.SIGOPSOper.Syst.Rev.44,2(April2010),35-40.DOI:http://dx.doi.org/10.1145/1773912.1773922
[Ceph] Sage A.Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: ascalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating
systemsdesignandimplementation(OSDI'06).USENIXAssociation,Berkeley,CA,USA,307-320.[Couch] Jan Lehnardt J. Chris Anderson and Noah Slater. CouchDB: The Definitive Guide. O’Reilly, first edition
[Gon14] Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., Gonzalez, J. E., … Stoica, I. (2014).GraphX:GraphProcessing inaDistributedDataflowFramework.11thUSENIXSymposiumonOperating
Systems Design and Implementation, 599–613. Retrieved fromhttps://www.usenix.org/conference/osdi14/technical-sessions/presentation/gonzalez
[Hao14] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica, Tachyon: Reliable, Memory SpeedStorage for Cluster Computing Frameworks, SoCC 2014, Seattle USA, pp. 6:1--6:15,
Storm @Twitter, SIGMOD 2014, Snowbird, Utah, USA, pages 147--156,http://doi.acm.org/10.1145/2588555.2595641
[Tud14] Radu Tudoran, Alexandru Costan, Olivier Nano, Ivo Santos, Hakan Soncu, Gabriel Antoniu. JetStream:Enablinghighthroughputliveeventstreamingonmulti-siteclouds.FutureGenerationComputerSystems,
[Xin13]Xin,R.S.,Rosen,J.,Zaharia,M.,Franklin,M.J.,Shenker,S.,Stoica,I.,…Xin,R.S.(2013).Shark:SQLandrich analytics at scale. Proceedings of the 2013 International Conference on Management of Data -
J. Franklin, Scott Shenker, IonStoica,ResilientDistributedDatasets:A Fault-TolerantAbstraction for In-MemoryClusterComputing,NSDI2012,SanJose,http://dl.acm.org/citation.cfm?id=2228298.2228301
[Zah12]MateiZaharia,TathagataDas,HaoyuanLi,ScottShenker,IonStoica.DiscretizedStreams:AnEfficientandFault-Tolerant Model for Stream Processing on Large Clusters. HotCloud June 2012, Boston, MA,