Top Banner
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL + Pig-La.n Combine Query Language and Data Flow Language for Data Science Jeff Zhang (zjff[email protected]) May 16, 2017
23

Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

May 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang([email protected])May16,2017

Page 2: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

2 ©HortonworksInc.2011–2016.AllRightsReserved

WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks

Page 3: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

3 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwillhappen

Page 4: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

4 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline

Page 5: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

5 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

§  CollectandTransformServerLog•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData

Page 6: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

6 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

BeforeDataMunging AcerDataMunging

Page 7: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

7 ©HortonworksInc.2011–2016.AllRightsReserved

DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest

Page 8: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

8 ©HortonworksInc.2011–2016.AllRightsReserved

DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

CleanStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Doyouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?

Page 9: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

9 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure

Page 10: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

10 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.

Page 11: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

11 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility

Page 12: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

12 ©HortonworksInc.2011–2016.AllRightsReserved

WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?

Page 13: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

13 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard•  unreadableforlongscript

DataFlowLanguagemorereadableforlongscripts

DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis

Page 14: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

14 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.n+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis

Page 15: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

15 ©HortonworksInc.2011–2016.AllRightsReserved

SparkTable(bank)

PigLaYn

SQL

Page 16: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

16 ©HortonworksInc.2011–2016.AllRightsReserved

IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

Pig-La.n

Page 17: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

17 ©HortonworksInc.2011–2016.AllRightsReserved

WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.

Page 18: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

18 ©HortonworksInc.2011–2016.AllRightsReserved

JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

ZeppelinArchitecture

Page 19: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

19 ©HortonworksInc.2011–2016.AllRightsReserved

Demo

Page 20: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure(Recap)

Page 21: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

21 ©HortonworksInc.2011–2016.AllRightsReserved

CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  IntegratePigwithotherSparkAPIs,likeR,Python

Page 22: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

22 ©HortonworksInc.2011–2016.AllRightsReserved

Q&A

Page 23: Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

23 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou