YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang([email protected])May16,2017

Page 2: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

2 ©HortonworksInc.2011–2016.AllRightsReserved

WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks

Page 3: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

3 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwillhappen

Page 4: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

4 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline

Page 5: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

5 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

§  CollectandTransformServerLog•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData

Page 6: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

6 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

BeforeDataMunging AcerDataMunging

Page 7: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

7 ©HortonworksInc.2011–2016.AllRightsReserved

DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest

Page 8: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

8 ©HortonworksInc.2011–2016.AllRightsReserved

DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

CleanStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Doyouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?

Page 9: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

9 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure

Page 10: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

10 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.

Page 11: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

11 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility

Page 12: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

12 ©HortonworksInc.2011–2016.AllRightsReserved

WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?

Page 13: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

13 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard•  unreadableforlongscript

DataFlowLanguagemorereadableforlongscripts

DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis

Page 14: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

14 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.n+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis

Page 15: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

15 ©HortonworksInc.2011–2016.AllRightsReserved

SparkTable(bank)

PigLaYn

SQL

Page 16: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

16 ©HortonworksInc.2011–2016.AllRightsReserved

IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

Pig-La.n

Page 17: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

17 ©HortonworksInc.2011–2016.AllRightsReserved

WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.

Page 18: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

18 ©HortonworksInc.2011–2016.AllRightsReserved

JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

ZeppelinArchitecture

Page 19: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

19 ©HortonworksInc.2011–2016.AllRightsReserved

Demo

Page 20: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure(Recap)

Page 21: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

21 ©HortonworksInc.2011–2016.AllRightsReserved

CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  IntegratePigwithotherSparkAPIs,likeR,Python

Page 22: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

22 ©HortonworksInc.2011–2016.AllRightsReserved

Q&A

Page 23: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

23 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou


Related Documents