Top Banner
Thomas Weise <[email protected]> Dec 2 nd , 2015 Introduction to Open Source Unified Streaming and Fast Batch Platform Apache Apex (incubating)
16

DataTorrent Presentation @ Big Data Application Meetup

Mar 21, 2017

Download

Data & Analytics

Thomas Weise
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DataTorrent Presentation @ Big Data Application Meetup

Thomas Weise <[email protected]>Dec 2nd, 2015

Introduction to Open Source Unified Streaming and Fast Batch PlatformApache Apex (incubating)

Page 2: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent2

Apex Platform Overview

Page 3: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent3

Apache Malhar Library

Page 4: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent4

Native Hadoop Integration

• YARN is the resource manager

• HDFS used for storing any persistent state

Page 5: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent5

Application Programming Model

A Stream is a sequence of data tuplesAn Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded

Directed Acyclic Graph (DAG) is made up of operations and streams

Directed Acyclic Graph (DAG)

Filtered Stream

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

Page 6: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent6

Application Specification

Page 7: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent7

Partitioning and Scaling Out

• Operators can be dynamically scaled• Flexible Streams split• Parallel partitioning

• MxN partitioning • Unifiers

Page 8: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent8

Advanced Windowing Support

Application window Sliding window and tumbling window

Checkpoint window No artificial latency

Page 9: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent9

Guarantees and PerformanceStateful Fault Tolerance Processing Semantics Data Locality

Supported out of the box– Application state– Application master state– No data loss

Automatic recovery Lunch test Buffer server

At least once At most once Exactly once

Stream locality for placement of operators

Rack local – Distributed deployment

Node local – Data does not traverse NIC

Container local – Data doesn’t need to be serialized

Thread local – Operators run in same thread

Data locality

Page 10: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent10

Dynamic Updates Dynamic topology updates

– Properties of operators can be changed– New operators can be added

Page 11: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent11

Data Processing Pipeline ExampleApp Builder

Page 12: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent12

Data Processing Pipeline ExampleLogical Plan

Page 13: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent13

Data Processing Pipeline ExamplePhysical Plan

Page 14: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent14

Data Processing Pipeline ExampleReal Time Visualization

Page 15: DataTorrent Presentation @ Big Data Application Meetup

© 2015 DataTorrent15

ResourcesApache Apex Community Page - http://apex.incubator.apache.org/

Apache Apex LinkedIn Group

Page 16: DataTorrent Presentation @ Big Data Application Meetup

EndThank You!

16