Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang([email protected])May16,2017


WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks


DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwillhappen


DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline


DataMunging

§  CollectandTransformServerLog•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData


DataMunging

BeforeDataMunging AcerDataMunging


DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcsmethod,BItoolstogetinsight–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest


DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

CleanStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Doyouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?


DataScienceInfrastructure


WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.


WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility


WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?


Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard•  unreadableforlongscript

DataFlowLanguagemorereadableforlongscripts

DataSource StructuredData Structured/UnstructuredIntegra.on IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis


Pig-La.n+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis


SparkTable(bank)

PigLaYn

SQL


IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

Pig-La.n


WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.


JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

ZeppelinArchitecture


Demo


DataScienceInfrastructure(Recap)


CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  IntegratePigwithotherSparkAPIs,likeR,Python


Q&A


ThankYou

Combine Query Language and Data Flow Language for Data Science · Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data

Documents