Apache NiFi Better Analytics Demand Better Dataflow Presented by: Joe Witt Apache NiFi PPMC Member
Aug 04, 2015
Apache NiFi Better Analytics Demand Better Dataflow
Presented by: Joe Witt Apache NiFi PPMC Member
2
@apachenifi
History
3
• Developed at NSA for over eight years
• Donated to the Apache Software Foundation Nov 2014
• Undergoing incubation
• Three ASF releases to date • Closing in on 0.2.0 release
The problem space: Enterprise Dataflow
4
Automate the flow of data from any source
…to systems which extract meaning and insight
…and to those that store and make it available for users
Use Cases for NiFi
5
• Remote sensor delivery
• Inter-site / global distribution
• Intra-site distribution
• ‘Big Data’ ingest
• Data Processing (enrichment, filtering, sanitization)
The challenges we faced
6
• Transport / Messaging was not enough
• Needed to understand the big picture
• Needed the ability to make *immediate* changes
• Must maintain chain of custody for data • Rigorous security and compliance requirements
Why transport and messaging was not enough?
7
• Data access exceeded resources to transport
• Decoupling systems is about more than the connectivity
• Message sizes ranged from B to GB
• Not all data is created equal
• Needed precise security controls • SSL and topic level authorization insufficient
The basic building blocks
Real-time Command and Control
The Power of Provenance
8
Apache NiFi Foundational Concepts
2
3
1
HEADER -‐ UUID -‐ Name -‐ Size -‐ Entry Time
A3ributes Map [[Key | Value]]
CONTENT
Flow File
9
• Types • Events • Objects • Files • Messages • Media
• Formats • JSON • Avro • Text • Mp4 • Proprietary
• Sizes • Bytes to GBs
Flow File Processor
10
• Routing • Context • Content
• Transformation • Enrich • Obfuscate • Filter • Convert • Analyze • Split • Aggregate
• Mediation • Push / Pull • …
Connections
11
• Queuing • Back Pressure • Expiration
• Prioritize
• Swapping
Flow Controller
12
NiFi Architecture
13
NiFi Clustering Model
14
Tighten the feedback loop • Changes have consequences (good or bad) • And you see them as they occur
Continuous Improvement • Compare real-time vs. historical statistics • View data provenance • View Content at any stage Intuitive user experience • Visual programming • Logical flow graph
15
Real-time command and control 2
Latency Optimization • Intra process • Inter process • End-to-end Compliance • Prove handling • Assess impact Understanding • Step through time • View content • View Context
16
The Power of Provenance – Chain of custody for data 3
17
Demo
Flow File Repo – Write Ahead Log Content Repo
Add more partitions Input/Output Streams
Copy on Write Pass by Reference Allow tradeoffs of latency vs throughput
18
How fast is it and why?
- User to System and System to System - Authentication (2-Way SSL, more coming…)
- Authorization (pluggable)
- Authorize a specific piece of data to a specific system
- Data provenance - Prove you have done the right thing - Recover when you have not
19
How does it deal with security?
Web UI Push API
Reporting Tasks (ganglia, graphite, etc…) Pull API
REST API
20
How can I monitor this at runtime?
Flow File Processors Advanced UI
Flow File Prioritizer Reporting Tasks Controller Services Build Clients against our REST API
21
What are the points of extension?
Status and direction for NiFi
22
Efficient use of each node - 100s of MB/s per node - 100Ks transactions/s per node Simple / Effective scaling model Runtime Command and Control Data Provenance
Distributed durability of data - Maybe Kafka backed queues High Availability Cluster Manager Live / Rolling Upgrades Provenance Query Language / Reporting A complete user experience enabled by provenance
Existing Strengths Roadmap Highlights
Apache NiFi (incubating) site http://nifi.incubator.apache.org Subscribe to and collaborate at [email protected] [email protected] Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI @apachenifi
23
Learn more about Apache NiFi