Top Banner
A Glimpse of the Hadoop Echosystem 1
16

A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Jun 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

AGlimpseoftheHadoopEchosystem

1

Page 2: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HadoopEchosystem

• Aclusterissharedamongseveralusersinanorganization• Differentservices

• HDFSandMapReduceprovidethelowerlayersoftheinfrastructures• Othersystems“plug”ontopofthese• Easierwaytoprogramapplications• MapReduceandHDFSare“lowlevel”

2

Page 3: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HBase

• Hadoopdatabaseforrandomread/writeaccess• HBase isanopensource,non-relational,distributed“database”

• modeledafterGoogle'sBigTable.• ItrunsontopofHadoopandHDFS,providingBigTable-likecapabilitiesforHadoop.• EricBrewer’sCAPtheorem,HBase isaCPtypesystem.

• Consistency,availability,partitiontolerance.

3

Page 4: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

WhentouseHBase

• Realbigdata:billionsofrowsXmillionsofcolumns• Datacannotstoreinasinglenode.

• Randomread/writeaccess• Thousandsofoperationsonbigdata• NoneedofextrafeaturesofRDMSliketypedcolumns,secondaryindexes,transactions,advancedquerylanguages,etc.

4

HDFS Hbase

Good forstoringlargefile Built ontopofHDFS.GoodforhostingverylargetableslikebillionsofrowsXmillionsofcolumn

Writeonce.Append tofilesinsomeofrecentversionsbutnotcommonlyused

Read/writemany

No randomread/write Randomread/write

Noindividualrecordlookupratherreadalldata Fastrecordslookup(update)

Page 5: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

HBase

• TypeofNoSql database• HBase isreallymorea"DataStore"than"DataBase”.ItlacksmanyofthefeaturesyoufindinanRDBMS,suchastypedcolumns,secondaryindexes,triggers,andadvancedquerylanguages,…

• Stronglyconsistentreadandwrite• Automaticsharding (i.e.,“horizontalpartitioning”)• HBase tablesaredistributedontheclusterviaregions,andregionsareautomaticallysplitandre-distributedasdatagrows

• AutomaticRegionServer failover• Hadoop/HDFSIntegration• MassivelyparallelizedprocessingviaMapReduceforusingHBase asbothsourceandsink.• JavaAPIforprogrammaticaccess,RESTfornon-Javafront-ends.

5

Page 6: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

6

#Getsallthedatafortherowhbase>get'/user/user01/customer','jsmith’

#Limitthistoonlyonecolumnfamilyhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['addr']}

#Limitthistoaspecificcolumnhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['order:numb']}

#Scanallrowsoftable't1'hbase>scan't1'

#Specifyatimerangehbase>scan't1',{TIMERANGE=>[1303668804,1303668904]}

#Specifyastartrow,limittheresultto10rows,andonlyreturnselectedcolumnshbase>scan't1',{COLUMNS=>['c1','c2'],LIMIT=>10,STARTROW=>'xyz'}

Page 7: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Hive

“TheApacheHive™datawarehousesoftwarefacilitatesreading,writing,andmanaginglargedatasetsresidingindistributedstorageusingSQL.Structurecanbeprojectedontodataalreadyinstorage.AcommandlinetoolandJDBCdriverareprovidedtoconnectuserstoHive.”

7

Page 8: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Hive

• AnSQLlikeinterfacetoHadoop.• DatawarehouseinfrastructurebuiltontopofHadoop• Providedatasummarization,queryandanalysis• QueryexecutionviaMapReduce• HiveinterpretertransparentlyconvertsqueriestoMapReduce.• Butotherbackends arealsosupported,e.g.,Spark

• Opensource,developedbyFacebook• AlsousedbyNetflix,Cnet,Digg,eHarmonyetc.

8

SELECTcustomerId,max(total_cost)FROMhive_purchasesGROUPBYcustomerIdHAVINGcount(*)>3;

Page 9: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

• Wordcount inHive• Justacuriosity– probablynotthetypicalkindofquery

https://en.wikipedia.org/wiki/Apache_Hive

9

1DROP TABLE IFEXISTS docs;2CREATE TABLE docs(lineSTRING);3LOAD DATA INPATH'input_file'OVERWRITEINTO TABLE docs;4CREATE TABLE word_counts AS 5SELECT word,count(1)AS count FROM6(SELECT explode(split(line,'\s'))AS wordFROM docs)temp7GROUP BY word8ORDER BY word;

Page 10: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• YetAnotherResourceNegotiator• YARNApplicationResourceNegotiator(RecursiveAcronym)• Remediesthescalabilityshortcomingsof“classic”MapReduce• A generalpurposeframework.MapReduceisoneapplication.

10

Page 11: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

MapReduceLimitations

• Scalability• MaximumClusterSize– 4000Nodes• MaximumConcurrentTasks– 40000• CoarsesynchronizationinJobTracker

• Singlepointoffailure• Failurekillsallqueuedandrunningjobs• Jobsneedtoberesubmittedbyusers• Restartistrickyduetocomplexstate

11

Page 12: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

12

Fora(short)introduction:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Page 13: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

• SplitsupthemajorfunctionsofJobTracker:• TheResourceManager hastwocomponents:SchedulerandApplicationsManager.• Scheduler:performsnomonitoringortrackingofstatusfortheapplication.

• Noguaranteesaboutrestartingfailedtaskseitherduetoapplicationfailureorhardwarefailures.• Performsitsschedulingfunctionbasedontheresourcerequirementsoftheapplications;• Abstractnotionofaresource Container (memory,cpu,disk,networketc.)

• TheApplicationsManager isresponsibleforacceptingjob-submissions,negotiatingthefirstcontainerforexecutingtheapplicationspecificApplicationMaster• ProvidestheserviceforrestartingtheApplicationMaster containeronfailure.

• ApplicationMaster (oneperapplication)• NegotiateappropriateresourcecontainersfromtheScheduler• Trackstheirstatusandmonitoringforprogress.• Runsasanormalcontainer.• Frameworkspecificlibrary• WorkswiththeNodeManager(s)toexecuteandmonitorthetasks.

• NodeManager (NM)• Anewper-nodeslaveisresponsibleforlaunchingtheapplications’containers,monitoringtheirresourceusage(cpu,memory,disk,network)andreportingtotheResourceManager.

13

Page 14: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• FaultToleranceandAvailability• ResourceManager

• Nosinglepointoffailure– statesavedinZooKeeper• ApplicationMastersarerestartedautomatically

• Optionalfailoverviaapplication-specificcheckpoint• MapReduceapplicationspickupwheretheyleftoffviastatesavedinHDFS

• Scalability• 6000- 10000Nodes• 100000+ConcurrentTasks• 10000+Jobs

14

Page 15: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

YARN

• SupportforparadigmsotherthanMapReduce(Multitenancy)• HBase onYARN(HOYA),MachineLearning:Spark,Graphprocessing:Giraph,Real-timeprocessing:Storm

• Enabledbyallowingtheuseofparadigm-specificapplicationmaster• RunallonthesameHadoopcluster!

15

Page 16: A Glimpse of the Hadoop Echosystem - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_07_echosystem.pdf · YARN •Support for paradigms other than MapReduce (Multi tenancy) •HBaseon

Sources

• Hadoop2.0andYARN- Subash D’Souza• https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/• https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• http://hbase.apache.org/book.html#arch.overview

16