Top Banner
Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara
25

1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Apr 22, 2018

Download

Documents

dinhdung
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

ProcessingBigDatawithPentahoRakeshSahaPentahoSeniorProductManager,HitachiVantara

Page 2: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Agenda

• Processbigdatavisuallyinfuture-proofway– Demo

• Combinestreamdataprocessingwithbatch– Demo

Pentaho’sLatestandUpcomingFeaturesforProcessingBigData– BatchorReal-time

Page 3: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

BigDataProcessingisHARD

1)GartnerAnalyst,NickHeudecker;infoworld.com,Sept2015

"Through2018,70%ofHadoopdeploymentswillnotmeet

costsavingsandrevenuegenerationobjectivesduetoskills

andintegrationchallenges.”– GARTNER1

1NewSkillsNecessary

2HighEffortandRisk

3ContinuousChange

Page 4: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

BigDataIntegrationandAnalyticsWorkflowwithPentaho

BigDataChallenges• ProcessingSemi/un/structureddata

• Blendingbigdatawithtraditionaldata

• Maintainingsecurity,governanceofdata

• Processingstreamingdatainrealtimeandhistorically

• Enablingandoperationalizingdatascience

DataLake

AnalyticDatabase

PentahoAnalyzer

Sensor

Bigorsmalldata

PentahoData

IntegrationPentahoReporting

MSGQueueKafka,JMS,

MQTTMachineLearning

R,Python

Stream FeedbackLoop

LOBApplications

Embedded

PentahoData

Integration

Page 5: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

ProcessBigDataVisuallyinaFutureProofWay

Page 6: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

VisualBigDataProcessingwithPentaho

• What:VisuallyingestandprocessBigDataatenterprisescale

• WhatSpecial:VisuallydeveloponceandexecuteonanyenginewithAdaptiveExecutionLayer(AEL)

• Why– Difficulttofindqualifieddevelopers– Difficulttokeepupwithnewtechnologies

• AvailablesincePentaho7.1

Page 7: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

AdaptiveExecutionofBigData

BuildOnce,ExecuteonAnyEngineChallenge:Withrapidlychangingbigdatatechnology,codingonvariousenginescanbetime-consumingorimpossiblewithexistingresources

Solution:Future-proofdataintegrationandanalyticsdevelopmentinadrag-and-dropvisualdevelopmentenvironment,eliminatingtheneedforspecializedcodingandAPIknowledge.Seamlesslyswitchbetweenexecutionenginestofitdatavolumeandtransformationcomplexity

PDI

PentahoKettle

Page 8: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

AdaptiveExecutionforSpark

ProcessBigDataFasteronSparkWithoutAnyCodingChallenge:FindingthetalentandtimetoworkwithSparkandnewerbigdatatechnologies

Solution:MoreeasilydevelopbigdataapplicationsinPDIusingadaptiveexecutiontoingest,processandblenddatafromarangeofbigdatasourcesandscaleonSparkclusters

PDI

PentahoKettle

Page 9: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

UpcomingEnhancedAdaptiveExecutionLayer

• SimplifiedSetup– Fewerstepstosetup– Easytoconfigurefail-over,load-balancing

• Developmentproductivity– Robusttransformationerrorandstatusreporting– CustomizationofSparkjobs

• RobustEnterpriseSecurity– ClienttoAELconnectioncanbesecured– End-2-endKerberosimpersonationfromclienttooltocluster

PDIClient

Spark/HadoopProcessingNodes

HADOOPCLUSTER

AEL-SparkEngine(SparkDriver)

AEL-SparkDaemon(EdgeNodes)

Hadoop/SparkCompatibleStorageCluster

HDFS AzureStorage

AmazonS3

Etc…

SparkExecutors

Page 10: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

UpcomingBigDataFileFormatHandling

BigDataplatformsintroducedvariousdataformatstoimproveperformance,compression,andinteroperability

What:• VisualhandlingofdatafileswithBigDataformatsParquetandAvro– Readingandwritingfileswithspecificsteps– NativelyexecuteinSparkviaAEL

Why:• EaseofdevelopmentofBigDataprocessing

• Performanceimprovementduetoavoidanceofintermediateformats

Page 11: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Demonstration

Page 12: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

RetailWebLogDataProcessingwithPentaho

• RunwithinSpoonviaPentahoduringdevelopmentandthenuseSparkclusterforproduction

• Lookups,sort,andParquetfilein/outandotherstepsastotestparallelandserialprocessingwithinSparkCluster

Page 13: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

CombineStreamProcessingwithBatchProcessing

Page 14: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

WhatisStreamDataProcessing?AndWhy?

• Batchdataprocessingisuseful,butsometimesbusinessesneedtoobtaincrucialinsightsfasterandactonthem

• Manyusecasesmustconsiderdata2+times:onthewire,andthensubsequentlyashistoricaldata

• Getcrucialtime-sensitiveinsights– Reacttocustomerinteractionsonawebsiteormobileapp– Predictriskofequipmentbreakdownbeforeithappens

FormerPOV“securedatainDW,thenOLAPASAPafterward”giveswayto

CurrentPOV“analyzeonthewire,writebehind”

Page 15: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

NEWStreamDataProcessingwithPentaho

• Visuallyingestandproducedatafrom/toKafkausingNEWsteps

• Processmicro-batchchunksofdatausingeitheratime-basedoramessagesize-basedwindow

• SwitchprocessingenginesbetweenSpark(Streaming)orNativeKettle

• Hardenstreamprocessinglibrariesandstepstoprocessdatafromtraditionalmessagequeues• Benefits:– Lowerthebartobuildstreamingapplications– Enablecombiningbatchandstreamdataprocessing

Page 16: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

HowtoProcessStreamDatainPentaho

• StepsforKafkaingestionandpublish– KafkaConsumer– KafkaProducer

• Stepsforstreamprocessing– Getrecordsfromstream

• Ingestandprocesscontinuousstreamofdatainnearreal-timeinparenttransformation

• Processmicro-batchofstreamdatainseparatechildtransformation

Page 17: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

CombinedDataProcessingUsingSpark&Pentaho

WebClickstreamandOtherLogs

TraditionalDB/DWandNoSQLDatastores

TraditionalMessageBus

DATASOURCES

IoT DataKafkaCluster

DataCollector

PentahoDIPDIcollectsdatafromsourcesincludingKafkaClusters

DataPublisher

AnalyticalDatabases

PentahoAnalytics

HADOOP/SPARKCLUSTER

DataStore

MicroServices

RTDataProcessors BatchDataProcessors

HadoopMR

HDFS

PentahoDIPDI can process streaming data using Sparkand Spark Streaming or Kettle engine in acompletely visual way

PentahoDIPDIcanretrieve

processedorblendeddatafromHadoop/SparkandpublishtoKafkaclustersorexternal

databases

Ingest Process Publish Reporting

KafkaCluster

Page 18: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Demonstration

Page 19: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

RetailStoreEventProcessing

• CanberunwithinSpoonviaPentahoorwithinAEL-Sparkengine• UtilizesKafkain/out,Parquetoutandotherstepsastodemonstratestreamdataingestion,windowprocessingandmuchmore…

Page 20: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

AvailabilityandRoadmap

Page 21: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Availability

• AdaptiveExecutionLayer(AEL)andSpark-AELavailableinPentaho7.1– SecureSparkintegration,high-availabilityandsecurityofAELisEEonly– SupportedHadoopdistrosinPentaho7.1- ClouderaCDHandPentaho8.0– ClouderaCDHandHortonworksHDP

• KafkastepsandstreamdataprocessingavailableinPentaho8.0– KafkafromClouderaandHortonworkstobesupported

Page 22: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

Roadmap

• ExtendingAELtosupportotherSparkdistrosandotherdataprocessingengines• Advancedstreamprocessingwithotherreal-timemessagingprotocolsandwindowingmechanism

• EnablingBigDatadrivenmachinelearningonbatchorstreamdata

• IntegratedwithbroaderHitachiVantara portfolio

Page 23: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

SUMMARY:VisualFuture-ProofBigDataProcessingwithPentaho

Visuallybuildstreamdataprocessingpipelinesfordifferentstreamingengines• ConfigureStreamdataprocessinglogic• Executelogicinmultiplestreamprocessingengineswithoutrework

• Connecttostreamingdatasources

NEWinPentahoü NativeStreaminginPDIü SparkStreamingviaAELü KafkaConnectivity

LeveragethepowerofAdaptiveExecutiontofuture-proofdataprocessingpipelines• Configurelogicwithoutcoding• Switchprocessingengineswithoutrework• HandleBigDataformatsmoreefficiently

NEWinPentahoü AdaptiveExecutionLayerü VisualSparkviaAELü NativeBigdataFormatHandling

Page 24: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python

NextStepsWanttolearnmore?

• Meet-the-Experts:– AnthonyDeShazor– LukeNazarro– CarloRusso

• RecommendedBreakoutSessions:– JonathanJarvis:UnderstandingParallelismwithPDIandAdaptiveExecutionwithSpark– MarkBurnette:UnderstandingtheBigDataTechnologyEcosystem

Page 25: 1 Processing Big Data - PentahoWorld 2017 · Pentaho’s Latest and Upcoming Features for Processing Big Data –Batch or Real-time. ... Kafka, JMS, MQTT Machine Learning R, Python