Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data
Post on 27-May-2020
2 Views
Preview:
Transcript
1 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience
JeffZhang(zjffdu@apache.org)May16,2017
2 ©HortonworksInc.2011–2016.AllRightsReserved
WhoamI
à ASFMember,workinASFforalmost8years
à CommiRerofApacheTez,Pig&Zeppelin
à WorksinHortonworks
3 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.
à Describewhathappens
à Explainwhathappens
à Predictwhatwillhappen
4 ©HortonworksInc.2011–2016.AllRightsReserved
DataScience
CollectData
DataMunging
DataAnalysisInsight
Product
online offline
5 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
§ CollectandTransformServerLog• UserAgentNormalizaYon• RobotDetecYon• Sessionize
§ MovedatafromDatabasetoHDFS
§ CollectandTransformSocialMediaData
6 ©HortonworksInc.2011–2016.AllRightsReserved
DataMunging
BeforeDataMunging AcerDataMunging
7 ©HortonworksInc.2011–2016.AllRightsReserved
DataAnalysis
à CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight– WebTrafficMetrics– UserSegmentaYonAnalysis– A/BTest
8 ©HortonworksInc.2011–2016.AllRightsReserved
DataMungingvsDataAnalysis
DataMunging DataAnalysisDataSource Messy
Structured/UnstructuredUnorganized
CleanStructuredOrganized
Stability Regular,Stable Ad-hoc
Tools Python,Spark,Hadoopandetc.
R,Python,SQLandetc.
Doyouhavetobefullstackbigdataengineertododatascience?
Whatifyouareadataanalystwithoutmuchprogrammingskills?
9 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure
10 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisSpark
ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.
11 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisApachePig
à ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark
• Easeofprogramming
• OpYmizaYonopportuniYes
• Extensibility
12 ©HortonworksInc.2011–2016.AllRightsReserved
WordCount
Load
ForEach Group ForEach Order
StoreUsingSQL?
13 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-La.nvsSQL
SQL Pig-La.nLanguageType QueryLanguage
• defactorstandard• unreadableforlongscript
DataFlowLanguagemorereadableforlongscripts
DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith
Pig-LaYn
Conclusion• Pig-La.nforDataMunging• SQLforDataAnalysis
14 ©HortonworksInc.2011–2016.AllRightsReserved
Pig-La.n+SparkSQL
SparkDataFrameTable
SparkSQL
Load Store
DataMunging
DataAnalysis
15 ©HortonworksInc.2011–2016.AllRightsReserved
SparkTable(bank)
PigLaYn
SQL
16 ©HortonworksInc.2011–2016.AllRightsReserved
IntegrateSparkintoPig
LogicPlan
PhysicalPlan
Execu.onPlan
Execu.onEngine
Pig-La.n
17 ©HortonworksInc.2011–2016.AllRightsReserved
WheretorunPig-La.n&SparkSQL(Zeppelin)
ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.
18 ©HortonworksInc.2011–2016.AllRightsReserved
JVM
ZeppelinServer
PigInterpreterGroup
Pig-LaYn SparkSQL
JVM
JVM
SparkInterpreterGroup
Scala Python R
ZeppelinArchitecture
19 ©HortonworksInc.2011–2016.AllRightsReserved
Demo
20 ©HortonworksInc.2011–2016.AllRightsReserved
DataScienceInfrastructure(Recap)
21 ©HortonworksInc.2011–2016.AllRightsReserved
CurrentStatus&What’sNext
à Status– PIG-5080(Supportstorealiasassparktable)– ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)
à Next– IntegrateSparkMLlibinPig– UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig– IntegratePigwithotherSparkAPIs,likeR,Python
22 ©HortonworksInc.2011–2016.AllRightsReserved
Q&A
23 ©HortonworksInc.2011–2016.AllRightsReserved
ThankYou
top related