Top Banner
INTRODUCTION TO APACHE STORM Sapienza University of Rome Data Mining Class A.Y. 2016-2017
40

Apache Storm Tutorial

Jan 06, 2017

Download

Technology

Davide Mazza
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Storm Tutorial

INTRODUCTION TO APACHE STORM

Sapienza University of Rome Data Mining ClassA.Y. 2016-2017

Page 2: Apache Storm Tutorial

Team

2

Riccardo Di Stefano

Roberto Gaudenzi

Davide Mazza

Lorenzo Rutigliano

Sara Veterini

Federico Croce

Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

https://it.linkedin.com/in/lorenzo-rutigliano-00a007135/it

https://it.linkedin.com/in/sara-veterini-667684116

https://it.linkedin.com/in/roberto-gaudenzi-4b0422116

https://it.linkedin.com/in/federico-croce-921a19134/it

https://it.linkedin.com/in/riccardo-di-stefano-439a11134

https://it.linkedin.com/in/davide-mazza-33a9b291

Page 3: Apache Storm Tutorial

Contacts and Links

3

https://github.com/davidemazza/ApacheStorm

http://www.slideshare.net/DavideMazza6/apache-storm-tutorial

[email protected]

3Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 4: Apache Storm Tutorial

IntroductionApache Storm is a free and open source distributed fault-tolerant realtime computation system that make easy to process unbounded streams of data.

> use-cases: financial applications, network monitoring, social network analysis, online machine learning, ecc..

> different from traditional batch systems (store and process) .

4Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 5: Apache Storm Tutorial

Companies

5Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 6: Apache Storm Tutorial

Stream

Unbounded Sequence of Tuples

Tuple: Core unit of data, is a named list of values

6Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 7: Apache Storm Tutorial

TopologiesAn application is defined in Storm through a Topology that describes its logic as a DAG of operators and streams.

Spouts: are the sources of data streams. Usually read data from external sources (e.g. Twitter API) or from disk and emit them in the topology.

Bolts: processes input streams and (eventually) produce output streams. They represent the application logic.

7Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 8: Apache Storm Tutorial

ArchitectureTwo kinds of nodes in a Storm cluster:

➢ The Master node runs a daemon called “Nimbus” to which topologies are submitted. It is responsible for scheduling, job orchestration, and monitoring for failures.

➢ Each Worker (Slave) node runs a daemon called “Supervisor”, that can run one or more worker process in which applications are executed.

The coordination between this two entities is done through Zookeeper. It is mainly used to maintain state, because Nimbus and Supervisors are stateless.

8Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 9: Apache Storm Tutorial

ArchitectureThree entities are involved in running a topology:

➢ Worker Process: one or more per cluster, each one is related to only one topology (for design reasons related to fault-tolerance and isolation).

➢ Executor: thread in the Worker process. It runs one or more tasks for the same component (spout or bolt).

➢ Task: a component replica.

Therefore Workers provide inter-topology parallelism, Executors intra-topology and Tasks intra-component.

Worker process

Executor Task

TaskTask

Task

9Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 10: Apache Storm Tutorial

Simple Example

10Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 11: Apache Storm Tutorial

ExampleWe will show how to compute the average of the grades using a simple Storm topology.

We will use:

➢ one spout;➢ two bolts that work in parallel;➢ another bolt in which the previous two converge

11Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 12: Apache Storm Tutorial

Spout

This represents the spout.

Its job is to read a stream of numbers.

Our stream represents the grades, so they are within 18 and 30

12Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 13: Apache Storm Tutorial

Bolt

This represents the bolt.

We can distinguish three different bolts in our example:

1. SummationBolt: computes the sum of the numbers;2. CounterBolt: counts the numbers;3. AverageBolt: computes the average.

13Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 14: Apache Storm Tutorial

Topology

14Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 15: Apache Storm Tutorial

Topology

15Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 16: Apache Storm Tutorial

Topology

16Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 17: Apache Storm Tutorial

Topology

17Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 18: Apache Storm Tutorial

Topology

18Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 19: Apache Storm Tutorial

Topology

19Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Page 20: Apache Storm Tutorial

Topology

20Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream Output

Page 21: Apache Storm Tutorial

Trident

21Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 22: Apache Storm Tutorial

Trident

➢ A high-level abstraction on top of Storm

➢ Uses Spout and Bolt auto-generated by Trident before execution

➢ Trident has functions, filters, joins, grouping, and aggregation

➢ Process streams as a series of batches

22Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 23: Apache Storm Tutorial

Topology

➢ Receives input stream from spout

➢ Do ordered sequence of operation (filter, aggregation, grouping, etc.,) on the stream

23Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 24: Apache Storm Tutorial

Tuples & Spout

➢ TridentTuple is a named list of values.

➢ TridentTuple interface is the data model of a Trident topology

➢ TridentSpout is similar to Storm spout, with additional options

➢ TridentSpout has many sample implementation of trident spout

24Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 25: Apache Storm Tutorial

Example of Spout

25Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 26: Apache Storm Tutorial

Operations

➢ Filter

➢ Function

➢ Aggregation

➢ Grouping

➢ Merging and Joining

26Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 27: Apache Storm Tutorial

Operations: Filter

➢ Object used to perform the task of input validation.

➢ Gets a subset of trident tuple fields as input

➢ Returns either true or false

➢ True → tuple is kept in the output stream

➢ False → the tuple is removed from the stream

27Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 28: Apache Storm Tutorial

Operations: Function➢ Object used to perform a simple operation on a single trident tuple.

➢ Takes a subset of trident tuple fields

➢ Emits zero or more new trident tuple fields.

28Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 29: Apache Storm Tutorial

Operations: AggregationObject used to perform aggregation operations on an input batch or partition or stream.

➢ Aggregate → Aggregates each batch of trident tuple in isolation

➢ PartitionAggregate → Aggregates each partition instead of the entire batch of trident tuple.

➢ PersistentAggregate → Aggregates on all trident tuple across all batch.

29Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 30: Apache Storm Tutorial

Operations: Aggregation

30Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 31: Apache Storm Tutorial

Operations: Grouping

➢ Inbuilt operation and can be called by the groupBy method

➢ Repartitions the stream by doing a partitionBy on the specified fields

➢ Groups tuples together whose group fields are equal

31Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 32: Apache Storm Tutorial

Operations: Merging and Joining

➢ Merging combines one or more streams

➢ Joining uses trident tuple field from both sides to check and join two streams.

32Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 33: Apache Storm Tutorial

State Maintenance

➢ State information can be stored in the topology itself

➢ if any tuple fails during processing, then the failed tuple is retried.

➢ If the tuple has failed before updating the state → retrying the tuple will make the state

stable.

➢ if the tuple has failed after updating the state → then retrying the same tuple will make

the state unstable

33Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 34: Apache Storm Tutorial

When to use Trident?

It will be difficult to achieve exactly once processing in the case of Storm

34Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Trident will be useful for those use-cases where you require exactly once processing.

Page 35: Apache Storm Tutorial

Trident Example

35Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 36: Apache Storm Tutorial

Trident Demo: Twitter LanguagesWhich are the most used languages in Twitter?

The code is built on top of Trident and gets a stream of tweets using twitter4J library

For each tweet the language is extracted

A hashmap of counters is maintained and periodically published on a tweet by the code itself

36Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 37: Apache Storm Tutorial

Trident example setupTo setup your twitter application:

● go to https://apps.twitter.com/ and create a new app● fill the form, leaving callback url empty● after creating the app, go to keys and access tokens● pick consumer secret and consumer key info● select create my access tokens if no tokens are present, then pick access

token and access token secret● open project TwitterTridentExample in Eclipse, open file twitter4j.properties

in the project, and copy your info

Now you are ready!

37Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 38: Apache Storm Tutorial

Homework

38Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Page 39: Apache Storm Tutorial

Homework

39Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

https://github.com/davidemazza/ApacheStorm

Folder “Homework”

Page 40: Apache Storm Tutorial

Thanks! 40Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017