Click here to load reader
Apr 13, 2017
Presentation Title Goes Here with a Maximum of Three Lines of Copy
Dataflow with Apache NiFiAldrin Piri - @aldrinpiriApache NiFi Crash CourseHadoop Summit 2016 San Jose
29 June 2016
# Hortonworks Inc. 2011 2016. All Rights Reserved
Hortonworks: Powering the Future of Data1
Key: 'Apache NiFi Value: 'PMC Member'Key: 'Work Value: Sr. Member of Technical Staff @ Hortonworks'Key: 'Working with NiFi Since Value: '2010
# Hortonworks Inc. 2011 2016. All Rights ReservedAgendaWhat is dataflow and what are the challenges?Apache NiFiArchitectureLive DemoCommunity
# Hortonworks Inc. 2011 2016. All Rights Reserved
# Hortonworks Inc. 2011 2016. All Rights ReservedAgendaWhat is dataflow and what are the challenges?Apache NiFiArchitectureLive DemoCommunity
# Hortonworks Inc. 2011 2016. All Rights Reserved
# Hortonworks Inc. 2011 2016. All Rights ReservedLets Connect A to B
Producers A.K.A ThingsAnythingAND Everything
Internet!
ConsumersUserStorageSystemMore Things
# Hortonworks Inc. 2011 2016. All Rights ReservedMoving data effectively is hardStandards: http://xkcd.com/927/
# Hortonworks Inc. 2011 2016. All Rights ReservedWhy is moving data effectively hard? StandardsFormatsExactly Once DeliveryProtocolsVeracity of InformationValidity of InformationEnsuring SecurityOvercoming SecurityComplianceSchemasConsumers ChangeCredential ManagementThat [person|team|group]NetworkExactly Once Delivery
# Hortonworks Inc. 2011 2016. All Rights ReservedLets Connect Lots of As to Bs to As to Cs to Bs to s to Cs to sLets consider the needs of a courier service
Physical Store
Gateway Server
Mobile Devices
Registers
Server ClusterDistribution Center
Core Data Center at HQ
Server Cluster
On Delivery Routes
Trucks
Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
# Hortonworks Inc. 2011 2016. All Rights ReservedGreat! I am collecting all this data! Lets use it!Finding our needles in the haystack
Physical Store
Gateway Server
Mobile Devices
Registers
Server ClusterDistribution Center
Kafka
Core Data Center at HQ
Server ClusterOthersStorm / Spark / Flink / ApexKafkaStorm / Spark / Flink / Apex
On Delivery Routes
Trucks
Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
# Hortonworks Inc. 2011 2016. All Rights ReservedWhy is moving data effectively hard when scoped internally? StandardsFormatsExactly Once DeliveryProtocolsVeracity of InformationValidity of InformationEnsuring SecurityOvercoming SecurityComplianceSchemasConsumers ChangeCredential ManagementThat [person|team|group]NetworkExactly Once Delivery
# Hortonworks Inc. 2011 2016. All Rights ReservedLets Connect Lots of As to Bs to As to Cs to Bs to s to Cs to sOh, that courier service is global
# Hortonworks Inc. 2011 2016. All Rights ReservedWhy is moving data effectively hard when scoped globally? StandardsFormatsExactly Once DeliveryProtocolsVeracity of InformationValidity of InformationEnsuring SecurityOvercoming SecurityComplianceSchemasConsumers ChangeCredential ManagementThat [person|team|group]NetworkExactly Once Delivery
# Hortonworks Inc. 2011 2016. All Rights ReservedThe Unassuming Line: A Case StudyWeve seen a few lines show up in the wild thus far
Internet!Inter- & Intra- connections inour global courier enterpriseSpotlight: Arthur Lacte, https://thenounproject.com/turo/
# Hortonworks Inc. 2011 2016. All Rights ReservedDataflow Line Anatomy 101Lets dissect what this line typically represents
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or ApplicationScript or Application
Data
Data
Disparate TransportMechanisms
# Hortonworks Inc. 2011 2016. All Rights ReservedDataflow Line Anatomy 201Sometimes that transport is just more lines
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or ApplicationScript or Application
Line Inception
Data
Data
# Hortonworks Inc. 2011 2016. All Rights ReservedDataflow Line Anatomy 301But those lines could also have componentsFig 1. Lineus Worldwidewebus. Common Name: Internet!
Fig 2. Good Recursion Joke
NoSuchJokeException
footage not found
# Hortonworks Inc. 2011 2016. All Rights ReservedAgendaWhat is dataflow and what are the challenges?Apache NiFiArchitectureLive DemoCommunity
# Hortonworks Inc. 2011 2016. All Rights Reserved
# Hortonworks Inc. 2011 2016. All Rights ReservedApache NiFiKey FeaturesGuaranteed deliveryData buffering BackpressurePressure releasePrioritized queuingFlow specific QoSLatency vs. throughputLoss toleranceData provenanceSupports push and pull models
Recovery/recording a rolling log of fine-grained historyVisual command and controlFlow templatesPluggable/multi-role securityDesigned for extensionClustering
# Hortonworks Inc. 2011 2016. All Rights ReservedApache NiFi Subproject: MiNiFiLet me get the key parts of NiFi close to where data begins and provide bidrectional communication
NiFi lives in the data center. Give it an enterprise server or a cluster of them.MiNiFi lives as close to where data is born and is a guest on that device or system
# Hortonworks Inc. 2011 2016. All Rights ReservedLets revisit our courier service from the perspective of NiFi
Physical Store
Gateway Server
Mobile Devices
Registers
Server ClusterDistribution Center
Kafka
Core Data Center at HQ
Server ClusterOthersStorm / Spark / Flink / ApexKafkaStorm / Spark / Flink / Apex
On Delivery Routes
Trucks
Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/Deliverer: Rigo Peter, https://thenounproject.com/rigo/Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/Client LibrariesClient LibrariesMiNiFiMiNiFiNiFiNiFiNiFiNiFiNiFiNiFi
Client Libraries
# Hortonworks Inc. 2011 2016. All Rights ReservedApache NiFi Managed Dataflow
SOURCESREGIONAL INFRASTRUCTURECORE INFRASTRUCTURE
# Hortonworks Inc. 2011 2016. All Rights ReservedNiFi is based on Flow Based Programming (FBP)FBP TermNiFi TermDescriptionInformation PacketFlowFileEach object moving through the system.Black BoxFlowFile ProcessorPerforms the work, doing some combination of data routing, transformation, or mediation between systems.Bounded BufferConnectionThe linkage between processors, acting as queues and allowing various processes to interact at differing rates.SchedulerFlow ControllerMaintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.SubnetProcess GroupA set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
# Hortonworks Inc. 2011 2016. All Rights ReservedFlowFiles & Data AgnosticismNiFi is data agnostic!But, NiFi was designed understanding that userscan care about specifics and provides tooling to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/Robustness principleBe conservative in what you do, be liberal in what you accept from others
# Hortonworks Inc. 2011 2016. All Rights Reserved
Hortonworks: Powering the Future of Data23
FlowFiles are like HTTP data
HTTP DataFlowFile
HTTP/1.1 200 OKDate: Sun, 10 Oct 2010 23:26:07 GMTServer: Apache/2.2.8 (CentOS) OpenSSL/0.9.8gLast-Modified: Sun, 26 Sep 2010 22:04:35 GMTETag: "45b6-834-49130cc1182c0"Accept-Ranges: bytesContent-Length: 13Connection: closeContent-Type: text/html
Hello world!Standard FlowFile AttributesKey: 'entryDateValue: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'lineageStartDate Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'fileSizeValue: '23609'FlowFile Attribute Map ContentKey: 'filenameValue: '15650246997242'Key: 'pathValue: './
Binary Content *HeaderContent
# Hortonworks Inc. 2011 2016. All Rights Reserved
Hortonworks: Powering the Future of Data24
AgendaWhat is dataflow and what are the challenges?Apache NiFiArchitectureLive DemoCommunity
# Hortonworks Inc. 2011 2016. All Rights Reserved
# Hortonworks Inc. 2011 2016. All Rights ReservedExtension / Integration PointsNiFi TermDescriptionFlow File ProcessorPush/Pull behavior. Custom UIReporting TaskUsed to push data from NiFi to some external service (metrics, provenance, etc..)Controller ServiceUsed to enable reusable components / shared services throughout the flowREST APIAllows clients to connect to pull information, change behavior, etc..
# Hortonworks Inc. 2011 2016. All Rights Reserved
26
OS/Host
JVM
Flow Controller
Web Server
Processor 1
Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1
Extension N
FlowFileRepository
Conte