Top Banner
Apache NiFi Better Analytics Demand Better Dataflow Presented by: Joe Witt Apache NiFi PPMC Member
22
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Joe witt may2015_kafka_nyc_apachenifi-overview

Apache NiFi Better Analytics Demand Better Dataflow

Presented by: Joe Witt Apache NiFi PPMC Member

Page 2: Joe witt may2015_kafka_nyc_apachenifi-overview

2  

#nifidemo

Page 3: Joe witt may2015_kafka_nyc_apachenifi-overview

History

3  

•  Developed at NSA for over eight years

•  Donated to the Apache Software Foundation Nov 2014

•  Undergoing incubation

•  Three ASF releases to date •  0.1.0 out last night!

 

Page 4: Joe witt may2015_kafka_nyc_apachenifi-overview

The problem space: Enterprise Dataflow

4  

Automate the flow of data from any source

…to systems which extract meaning and insight

…and to those that store and make it available for users

Page 5: Joe witt may2015_kafka_nyc_apachenifi-overview

The challenges we faced

5  

•  Transport / Messaging was not enough

•  Needed to understand the big picture

•  Needed the ability to make *immediate* changes

•  Must maintain chain of custody for data •  Rigorous security and compliance requirements  

Page 6: Joe witt may2015_kafka_nyc_apachenifi-overview

Why transport and messaging was not enough?

6  

•  Data access exceeded resources to transport

•  Decoupling systems is about more than the connectivity

•  Message sizes ranged from B to GB

•  Not all data is created equal

•  Needed precise security controls •  SSL and topic level authorization insufficient

 

Page 7: Joe witt may2015_kafka_nyc_apachenifi-overview

 The basic building blocks

Real-time Command and Control

The Power of Provenance

7  

Apache NiFi Foundational Concepts

2

3

1

Page 8: Joe witt may2015_kafka_nyc_apachenifi-overview

HEADER  -­‐  UUID  -­‐  Name  -­‐  Size  -­‐  Entry  Time  

           A3ributes  Map                [[Key  |  Value]]  

CONTENT  

Flow File

8  

•  Types •  Events •  Objects •  Files •  Messages •  Media

•  Formats •  JSON •  Avro •  Text •  Mp4 •  Proprietary

•  Sizes •  Bytes to GBs

Page 9: Joe witt may2015_kafka_nyc_apachenifi-overview

Flow File Processor

9  

• Routing •  Context •  Content

• Transformation •  Enrich •  Obfuscate •  Filter •  Convert •  Analyze •  Split •  Aggregate

• Mediation •  Push / Pull • …

Page 10: Joe witt may2015_kafka_nyc_apachenifi-overview

Connections

10  

• Queuing • Back Pressure • Expiration

• Prioritize

• Swapping

Page 11: Joe witt may2015_kafka_nyc_apachenifi-overview

Flow Controller

11  

Page 12: Joe witt may2015_kafka_nyc_apachenifi-overview

NiFi Architecture

12  

Page 13: Joe witt may2015_kafka_nyc_apachenifi-overview

NiFi Clustering Model

13  

Page 14: Joe witt may2015_kafka_nyc_apachenifi-overview

Tighten the feedback loop •  Changes have consequences (good or bad) •  And you see them as they occur

Continuous Improvement •  Compare real-time vs. historical statistics •  View data provenance •  View Content at any stage Intuitive user experience •  Visual programming •  Logical flow graph

14  

Real-time command and control 2

Page 15: Joe witt may2015_kafka_nyc_apachenifi-overview

Latency Optimization •  Intra process •  Inter process •  End-to-end Compliance •  Prove handling •  Assess impact Understanding •  Step through time •  View content •  View Context

15  

The Power of Provenance – Chain of custody for data 3

Page 16: Joe witt may2015_kafka_nyc_apachenifi-overview

16  

Demo

Page 17: Joe witt may2015_kafka_nyc_apachenifi-overview

Flow File Repo – Write Ahead Log Content Repo

Add more partitions Input/Output Streams

Copy on Write Pass by Reference Allow tradeoffs of latency vs throughput

17  

How fast is it and why?

Page 18: Joe witt may2015_kafka_nyc_apachenifi-overview

- User to System and System to System -  Authentication (2-Way SSL)

-  Authorization (pluggable)

-  Authorize a specific piece of data to a specific system

-  Data provenance -  Prove you have done the right thing -  Recover when you have not

18  

How does it deal with security?

Page 19: Joe witt may2015_kafka_nyc_apachenifi-overview

Web UI Push API

Reporting Tasks (ganglia, graphite, etc…) Pull API

REST API

19  

How can I monitor this at runtime?

Page 20: Joe witt may2015_kafka_nyc_apachenifi-overview

Flow File Processors Advanced UI

Flow File Prioritizer Reporting Tasks Controller Services Build Clients against our REST API

20  

What are the points of extension?

Page 21: Joe witt may2015_kafka_nyc_apachenifi-overview

Status and direction for NiFi

21  

Efficient use of each node -  100s of MB/s per node -  100Ks transactions/s per node Simple / Effective scaling model Runtime Command and Control Data Provenance  

Distributed durability of data - Maybe Kafka backed queues High Availability Cluster Manager Live / Rolling Upgrades Provenance Query Language / Reporting A complete user experience enabled by provenance

Existing Strengths Roadmap Highlights

Page 22: Joe witt may2015_kafka_nyc_apachenifi-overview

Apache NiFi (incubating) site http://nifi.incubator.apache.org Subscribe to and collaborate at [email protected] [email protected] Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI @apachenifi  

22  

Learn more about Apache NiFi