Apache Storm Tutorial

INTRODUCTION TO APACHE STORM

Sapienza University of Rome Data Mining ClassA.Y. 2016-2017

Team

2

Riccardo Di Stefano

Roberto Gaudenzi

Davide Mazza

Lorenzo Rutigliano

Sara Veterini

Federico Croce

Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

https://it.linkedin.com/in/lorenzo-rutigliano-00a007135/it

https://it.linkedin.com/in/sara-veterini-667684116

https://it.linkedin.com/in/roberto-gaudenzi-4b0422116

https://it.linkedin.com/in/federico-croce-921a19134/it

https://it.linkedin.com/in/riccardo-di-stefano-439a11134

https://it.linkedin.com/in/davide-mazza-33a9b291




















https://it.linkedin.com/in/davide-mazza-33a9b291

Contacts and Links

3

https://github.com/davidemazza/ApacheStorm

http://www.slideshare.net/DavideMazza6/apache-storm-tutorial

[email protected]

3Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

IntroductionApache Storm is a free and open source distributed fault-tolerant realtime computation system that make easy to process unbounded streams of data.

> use-cases: financial applications, network monitoring, social network analysis, online machine learning, ecc..

> different from traditional batch systems (store and process) .


Companies


Stream

Unbounded Sequence of Tuples

Tuple: Core unit of data, is a named list of values


TopologiesAn application is defined in Storm through a Topology that describes its logic as a DAG of operators and streams.

Spouts: are the sources of data streams. Usually read data from external sources (e.g. Twitter API) or from disk and emit them in the topology.

Bolts: processes input streams and (eventually) produce output streams. They represent the application logic.


ArchitectureTwo kinds of nodes in a Storm cluster:

➢ The Master node runs a daemon called “Nimbus” to which topologies are submitted. It is responsible for scheduling, job orchestration, and monitoring for failures.

➢ Each Worker (Slave) node runs a daemon called “Supervisor”, that can run one or more worker process in which applications are executed.

The coordination between this two entities is done through Zookeeper. It is mainly used to maintain state, because Nimbus and Supervisors are stateless.


ArchitectureThree entities are involved in running a topology:

➢ Worker Process: one or more per cluster, each one is related to only one topology (for design reasons related to fault-tolerance and isolation).

➢ Executor: thread in the Worker process. It runs one or more tasks for the same component (spout or bolt).

➢ Task: a component replica.

Therefore Workers provide inter-topology parallelism, Executors intra-topology and Tasks intra-component.

Worker process

Executor Task

TaskTask

Task


Simple Example


ExampleWe will show how to compute the average of the grades using a simple Storm topology.

We will use:

➢ one spout;➢ two bolts that work in parallel;➢ another bolt in which the previous two converge


Spout

This represents the spout.

Its job is to read a stream of numbers.

Our stream represents the grades, so they are within 18 and 30


Bolt

This represents the bolt.

We can distinguish three different bolts in our example:

1. SummationBolt: computes the sum of the numbers;2. CounterBolt: counts the numbers;3. AverageBolt: computes the average.


Topology


Stream

Topology


Stream

Topology


Stream

Topology


Stream

Topology


Stream

Topology


Stream

Topology


Stream Output

Trident


Trident

➢ A high-level abstraction on top of Storm

➢ Uses Spout and Bolt auto-generated by Trident before execution

➢ Trident has functions, filters, joins, grouping, and aggregation

➢ Process streams as a series of batches


Topology

➢ Receives input stream from spout

➢ Do ordered sequence of operation (filter, aggregation, grouping, etc.,) on the stream


Tuples & Spout

➢ TridentTuple is a named list of values.

➢ TridentTuple interface is the data model of a Trident topology

➢ TridentSpout is similar to Storm spout, with additional options

➢ TridentSpout has many sample implementation of trident spout


Example of Spout


Operations

➢ Filter

➢ Function

➢ Aggregation

➢ Grouping

➢ Merging and Joining


Operations: Filter

➢ Object used to perform the task of input validation.

➢ Gets a subset of trident tuple fields as input

➢ Returns either true or false

➢ True → tuple is kept in the output stream

➢ False → the tuple is removed from the stream


Operations: Function➢ Object used to perform a simple operation on a single trident tuple.

➢ Takes a subset of trident tuple fields

➢ Emits zero or more new trident tuple fields.


Operations: AggregationObject used to perform aggregation operations on an input batch or partition or stream.

➢ Aggregate → Aggregates each batch of trident tuple in isolation

➢ PartitionAggregate → Aggregates each partition instead of the entire batch of trident tuple.

➢ PersistentAggregate → Aggregates on all trident tuple across all batch.


Operations: Aggregation


Operations: Grouping

➢ Inbuilt operation and can be called by the groupBy method

➢ Repartitions the stream by doing a partitionBy on the specified fields

➢ Groups tuples together whose group fields are equal


Operations: Merging and Joining

➢ Merging combines one or more streams

➢ Joining uses trident tuple field from both sides to check and join two streams.


State Maintenance

➢ State information can be stored in the topology itself

➢ if any tuple fails during processing, then the failed tuple is retried.

➢ If the tuple has failed before updating the state → retrying the tuple will make the state

stable.

➢ if the tuple has failed after updating the state → then retrying the same tuple will make

the state unstable


When to use Trident?

It will be difficult to achieve exactly once processing in the case of Storm


Trident will be useful for those use-cases where you require exactly once processing.

Trident Example


Trident Demo: Twitter LanguagesWhich are the most used languages in Twitter?

The code is built on top of Trident and gets a stream of tweets using twitter4J library

For each tweet the language is extracted

A hashmap of counters is maintained and periodically published on a tweet by the code itself


Trident example setupTo setup your twitter application:

● go to https://apps.twitter.com/ and create a new app● fill the form, leaving callback url empty● after creating the app, go to keys and access tokens● pick consumer secret and consumer key info● select create my access tokens if no tokens are present, then pick access

token and access token secret● open project TwitterTridentExample in Eclipse, open file twitter4j.properties

in the project, and copy your info

Now you are ready!


https://apps.twitter.com/

Homework


Homework


https://github.com/davidemazza/ApacheStorm

Folder “Homework”

Thanks! 40Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Apache Storm Tutorial

Technology