Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

Post on 27-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang(zjffdu@apache.org)May16,2017

2 ©HortonworksInc.2011–2016.AllRightsReserved

WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks

3 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwillhappen

4 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline

5 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

§  CollectandTransformServerLog•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData

6 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

BeforeDataMunging AcerDataMunging

7 ©HortonworksInc.2011–2016.AllRightsReserved

DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest

8 ©HortonworksInc.2011–2016.AllRightsReserved

DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

CleanStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Doyouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?

9 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure

10 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.

11 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility

12 ©HortonworksInc.2011–2016.AllRightsReserved

WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?

13 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard•  unreadableforlongscript

DataFlowLanguagemorereadableforlongscripts

DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis

14 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.n+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis

15 ©HortonworksInc.2011–2016.AllRightsReserved

SparkTable(bank)

PigLaYn

SQL

16 ©HortonworksInc.2011–2016.AllRightsReserved

IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

Pig-La.n

17 ©HortonworksInc.2011–2016.AllRightsReserved

WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.

18 ©HortonworksInc.2011–2016.AllRightsReserved

JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

ZeppelinArchitecture

19 ©HortonworksInc.2011–2016.AllRightsReserved

Demo

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure(Recap)

21 ©HortonworksInc.2011–2016.AllRightsReserved

CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  IntegratePigwithotherSparkAPIs,likeR,Python

22 ©HortonworksInc.2011–2016.AllRightsReserved

Q&A

23 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou

top related