Top Banner
© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission. Scalable Big Data Stream Processing with Storm and Groovy
96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Big Data Stream Processing with Storm and Groovy

© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.

Scalable Big Data Stream Processing

with Storm and Groovy

Page 2: Scalable Big Data Stream Processing with Storm and Groovy

• 20 years of industry experience

• ORGANIZER OF NYC STORM USER GROUP

• ARCHITECT AT WEBMD

CONTACT ME

[email protected] @edvorkin

Eugene Dvorkin

PRESENTATION FEEDBACK

#spring2gx @edvorkin

Page 3: Scalable Big Data Stream Processing with Storm and Groovy

• Stream processing

• Storm architecture

• Storm abstractions

• Parallelism

• Message Processing

• Storm cluster

• Fault-tolerance

• Trident

• Storm at WebMD

• QA

PRESENTATION SUMMARY

Page 4: Scalable Big Data Stream Processing with Storm and Groovy

Stream Processing

[

I

Page 5: Scalable Big Data Stream Processing with Storm and Groovy

WEBMD MEDPULSE

real-time medical news from curated Twitter feed

Page 6: Scalable Big Data Stream Processing with Storm and Groovy

Twitter streaming system

Page 7: Scalable Big Data Stream Processing with Storm and Groovy

Every second, on average,

around 6,000 tweets are tweeted

on Twitter (visualize them here),

which corresponds to over 350,000

tweets sent per minute, 500

million tweets per day

TWITTER CHALLENGE

350,000 per minute

^

1% = 3500 per minute

^

Page 8: Scalable Big Data Stream Processing with Storm and Groovy

HADOOP

Inherently BATCH-Oriented System

Page 9: Scalable Big Data Stream Processing with Storm and Groovy

What is the need?

• Exponential rise in real-time data

• New business opportunity

Why Now?

• Economics of OSS and commodity hardware

WHY STREAM PROCESSING

Stream processing has emerged as a key use case*

*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014

Page 10: Scalable Big Data Stream Processing with Storm and Groovy

• Detecting fraud while someone swiping credit card

• Place ad on website while someone is reading a specific

article

• Alerts on application and machine failures

• ETL – move data from one system to another in real-time

Stream Processing Use Cases

Page 11: Scalable Big Data Stream Processing with Storm and Groovy

Stream Processing Flow

4

Page 12: Scalable Big Data Stream Processing with Storm and Groovy

Do It Yorself

DB

t

*Image credit:Nathanmarz: slideshare: storm

• How to scale

• Ho to deal with failures

• What to do with failed

messages

• A lot of infrastructure

concerns

• Complexity

• Tedious to build

Page 13: Scalable Big Data Stream Processing with Storm and Groovy

Apache

STORM

II

%

å å

Page 14: Scalable Big Data Stream Processing with Storm and Groovy

STORM HISTORY

2011 Created

by Nathan

Martz

2011 Acquired

by Twitter

2013 Apache

Incubator

Project

2011

Open

sourced

2014 Part of

Hortonworks

HDP2

platform

V a

Page 15: Scalable Big Data Stream Processing with Storm and Groovy

STORM ADOPTION

Most mature, widely adopted framework

Source: http://storm.incubator.apache.org/

Page 16: Scalable Big Data Stream Processing with Storm and Groovy

STORM PROPERTIES

STREAM

Process

endless stream

of data.

FAST

1M+ messages

/ sec on a 10-15

node cluster

SCALABLE FAULT

TOLERANT

( / R

4

Page 17: Scalable Big Data Stream Processing with Storm and Groovy

STORM PROPERTIES

RELIABLE

Guaranteed

message

processing

RUNS ON

JVM

SUPPORT

ANY

LANGUAGE

Java, Python,

C# , etc

EASY

Û $ u 5

Page 18: Scalable Big Data Stream Processing with Storm and Groovy

STORM

ABSTRACTIONS Tuples, Streams, Spouts, Bolts and

Topologies

Z III

å å å

Page 19: Scalable Big Data Stream Processing with Storm and Groovy

TUPLE

STORM ABSTRACTIONS

Storm data type: Immutable List of Key/Value pair of any data type

word: “Hello”

Count: 25

Frequency: 0.25

Page 20: Scalable Big Data Stream Processing with Storm and Groovy

STREAM

STORM ABSTRACTIONS

Unbounded Sequence of Tuples between nodes

Page 21: Scalable Big Data Stream Processing with Storm and Groovy

SPOUT

STORM ABSTRACTIONS

The Source of the Stream

Page 22: Scalable Big Data Stream Processing with Storm and Groovy

READ Data

Read from stream of data – queues, web logs, API

calls, databases

Emit tuples into storm

STORM ABSTRACTIONS

Spout responsibilities

STREAM

Page 23: Scalable Big Data Stream Processing with Storm and Groovy

BOLT

STORM ABSTRACTIONS

Processing Unit

Page 24: Scalable Big Data Stream Processing with Storm and Groovy

• Process tuples and perform actions:

calculations, API calls, DB calls

• Produce new output stream based on

computations

STORM ABSTRACTIONS

Bolt

F(x)

Page 25: Scalable Big Data Stream Processing with Storm and Groovy

• A topology is a network of spouts and bolts

• Defines stream processing pipeline as a graph of computational nodes

TOPOLOGY

4

Page 26: Scalable Big Data Stream Processing with Storm and Groovy

• May have multiple spouts

TOPOLOGY

4

Page 27: Scalable Big Data Stream Processing with Storm and Groovy

TASK

TASKS Instance of Spout or Bolt

• Each spout and bolt may

have many instances

(tasks) that perform all the

processing in parallel.

Page 28: Scalable Big Data Stream Processing with Storm and Groovy

Stream Grouping

How tuples are send between instances of spouts and bolts

Shuffle

Random

Distribution.

Fields

Routes tuples to

bolt based on the

value of the field.

Same values

always route to

the same bolt

4

Page 29: Scalable Big Data Stream Processing with Storm and Groovy

Abstractions Summary

4

• Tuple

• Stream

• Spout (task)

• Bolt (task)

• Topology

Page 30: Scalable Big Data Stream Processing with Storm and Groovy

Let’s Code

( IV

å å å å

Page 31: Scalable Big Data Stream Processing with Storm and Groovy

compile 'org.apache.storm:storm-core:0.9.2-incubating’

<dependency>

<groupId>org.apache.storm</groupId>

<artifactId>storm-core</artifactId>

<version>0.9.2-incubating</version>

</dependency>

STORM DEPENDENCIES

Page 32: Scalable Big Data Stream Processing with Storm and Groovy

DISTRIBUTED WORD COUNT TOPOLOGY

sentence word

Word

count

SPOUT

1

Split

Sentence

⚡ 2

Printer

Bolt

4

Word

Count

⚡ 3 final count:

Two 20

Households 24

Both 22

Alike 1

In 1

Dignity 10

"Two households, both alike in dignity"

“Two”

“households”

“both”

“alike”

“in” “dignity”

“Two”:2

“households”:4

“both”:5

“alike”:5

“in”:4 “dignity”:7

Page 33: Scalable Big Data Stream Processing with Storm and Groovy

WORD COUNT

Data Source

Page 34: Scalable Big Data Stream Processing with Storm and Groovy

WRITING SPOUT

Page 35: Scalable Big Data Stream Processing with Storm and Groovy

WRITING BOLT

SplitSentenceBolt

Resource initialization

Page 36: Scalable Big Data Stream Processing with Storm and Groovy

WRITING BOLT

WordCountBolt

Page 37: Scalable Big Data Stream Processing with Storm and Groovy

WRITING BOLT

PrinterBolt

Page 38: Scalable Big Data Stream Processing with Storm and Groovy

TOPOLOGY

Linking it all together

Page 39: Scalable Big Data Stream Processing with Storm and Groovy

Demo

Page 40: Scalable Big Data Stream Processing with Storm and Groovy

STORM

PARALLELISM

How to scale

stream processing

q V

å å å å å

Page 41: Scalable Big Data Stream Processing with Storm and Groovy

STORM PARALLELISM

storm main components

Machines in a

storm cluster

JVM processes

running on a node.

One or more per

node.

Java thread

running within

worker JVM

process.

Instances of

spouts and

bolts.

NODES (SERVERS) WORKERS (JVM) EXECUTORS (THREADS) TASKS

(BOLT/SPOUT)

Page 42: Scalable Big Data Stream Processing with Storm and Groovy

Parallelism configuration

q

Page 43: Scalable Big Data Stream Processing with Storm and Groovy

Parallelism configuration

q

Page 44: Scalable Big Data Stream Processing with Storm and Groovy

Stream Grouping

How tuples are send between instances of spouts and bolts

Page 45: Scalable Big Data Stream Processing with Storm and Groovy

Guaranteed

message

Processing

VII

a

å å å å å å

Page 46: Scalable Big Data Stream Processing with Storm and Groovy

Reliability API

Tuple tree

Storm retries always happen from the source

Page 47: Scalable Big Data Stream Processing with Storm and Groovy

Reliability API (Spout)

Methods from ISpout interface

Page 48: Scalable Big Data Stream Processing with Storm and Groovy

Reliability API (Bolt)

Reliability in Bolts

Anchoring Ack Fail

Page 49: Scalable Big Data Stream Processing with Storm and Groovy

TESTING Unit testing Storm

components

VII

a

Page 50: Scalable Big Data Stream Processing with Storm and Groovy

TEST EACH COMPONENT

SEPARATELY

USE STORM TESTING API

CHECK TestingApiDemo.java

on GITHUB

TESTING STORM

Page 51: Scalable Big Data Stream Processing with Storm and Groovy

TESTING with SPOCK

BDD style of testing

Page 52: Scalable Big Data Stream Processing with Storm and Groovy

COMPONENTS

OF A STORM

CLUSTER Z

VIII

å å å å å å å

Page 53: Scalable Big Data Stream Processing with Storm and Groovy

STORM COMPONENTS

Physical View

4

Page 54: Scalable Big Data Stream Processing with Storm and Groovy

PACKAGE and DEPLOYMENT

deploying topology to a cluster

storm jar Spring2GX-1.0.jar com.spring2gx.storm.WordCountTopology word-count-

topology

Page 55: Scalable Big Data Stream Processing with Storm and Groovy

STORM UI

Monitoring and performance tuning

Page 56: Scalable Big Data Stream Processing with Storm and Groovy

STORM

FAULT-

TOLERANCE

x IX

å å å å å å å å

Page 57: Scalable Big Data Stream Processing with Storm and Groovy

Normal operations

Page 58: Scalable Big Data Stream Processing with Storm and Groovy

NIMBUS DOWN

Run under supervision: Monit, supervisord

Page 59: Scalable Big Data Stream Processing with Storm and Groovy

WORKER NODE DOWN

Nimbus move work to another node

Page 60: Scalable Big Data Stream Processing with Storm and Groovy

SUPERVISOR DOWN

Page 61: Scalable Big Data Stream Processing with Storm and Groovy

WORKER PROCESS DOWN

Supervisor will restart worker

Page 62: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT Micro-Batch

Stream Processing

K X

å å å å å å å å å

Page 63: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT

Trident

FUNCTIONS

Functions,

Filters,

aggregations,

joins, grouping

MICRO-BATCH

Ordered batches of

tuples. Batches can

be partitioned.

HIGH LEVEL API

Similar to Pig

or Cascading

EXACTLY ONE SEMANTICS

Transactional

spouts

SUPPORT STATE

Trident has first class

abstraction for

reading and writing to

stateful sources

Ü

1

4

Page 64: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT IS MICRO-BATCH

Stream processed in small batches

• Each batch has a unique ID which is always the same on each replay

• If one tuple failed, the whole batch is reprocessed

• Batches are ordered

Page 65: Scalable Big Data Stream Processing with Storm and Groovy

How trident provides exactly –one semantics?

Page 66: Scalable Big Data Stream Processing with Storm and Groovy

EXACTLY-ONCE SEMANTICS

Store the count along with BatchID

COUNT 100

BATCHID 1

COUNT 110

BATCHID 2

10 more tuples with batchId 2

Failure: Batch 2 replayed

The same batchId (2)

• Spout should replay a batch exactly as it was played before

• Trident API hide dealing with batchID complexity

Page 67: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT EXAMPLE

Word count with Trident

Page 68: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT EXAMPLE

Word count with Trident

Page 69: Scalable Big Data Stream Processing with Storm and Groovy

TRIDENT FUNCTION

Word count with Trident

Page 70: Scalable Big Data Stream Processing with Storm and Groovy

Demo

Page 71: Scalable Big Data Stream Processing with Storm and Groovy

Style of computation One at a time Micro Batch

Lower Latency

Higher throutput

At least once semantics

Exactly-once semantics

Simpler programming model

STORM vs TRIDENT

4

Page 72: Scalable Big Data Stream Processing with Storm and Groovy

TWITTER

CHALLENGE

XII

å å å å å å å å å å

Page 73: Scalable Big Data Stream Processing with Storm and Groovy

Twitter streaming system

Page 74: Scalable Big Data Stream Processing with Storm and Groovy

WEBMD MEDPULSE

Page 75: Scalable Big Data Stream Processing with Storm and Groovy

WEBMD MEDPULSE

Enhancing Twitter feed with lead Image and Title

• Readability enhancements

• Image Scaling

• Remove duplicates

• Custom Business Logic

Page 76: Scalable Big Data Stream Processing with Storm and Groovy

Twitter streaming system

Page 77: Scalable Big Data Stream Processing with Storm and Groovy

TWITTER STREAMING API

Page 78: Scalable Big Data Stream Processing with Storm and Groovy

TWITTER STREAMING API

Status

Page 79: Scalable Big Data Stream Processing with Storm and Groovy

use Twitter4J java library

CREATE SPOUT

Page 80: Scalable Big Data Stream Processing with Storm and Groovy

use existing Spout from Storm contrib

project on GitHub

CREATE TWITTER SPOUT

Spouts exists for: Twitter, Kafka, JMS,

RabbitMQ, Amazon SQS, Kinesis,

MongoDB….

Page 81: Scalable Big Data Stream Processing with Storm and Groovy

MEDPULSE TOPOLOGY

• Storm takes care of scalability and fault-tolerance

• Problem: burst in traffic

Page 82: Scalable Big Data Stream Processing with Storm and Groovy

MEDPULSE TOPOLOGY

Introducing Queuing Layer with Kafka

Fast

( V Ñ Scalable Durable Distributed

Page 83: Scalable Big Data Stream Processing with Storm and Groovy

MEDPULSE TOPOLOGY

4

Page 84: Scalable Big Data Stream Processing with Storm and Groovy

STORM ARCHITECTURAL BLUEPRINT

Queue

(Kafka)

Page 85: Scalable Big Data Stream Processing with Storm and Groovy

OTHER WEBMD USE CASES

Solr Indexing

Page 86: Scalable Big Data Stream Processing with Storm and Groovy

RULE ENGINE TOPOLOGY

Processing Groovy Rules (DSL) on a scale in real-time

Page 87: Scalable Big Data Stream Processing with Storm and Groovy

Tuning and

Monitoring

XII

å å å å å å å å å å å

Page 88: Scalable Big Data Stream Processing with Storm and Groovy

Synthetic Monitoring

Page 89: Scalable Big Data Stream Processing with Storm and Groovy

Use Graphite

Statsd and Storm Metrics API

http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/

Page 90: Scalable Big Data Stream Processing with Storm and Groovy

• Use cache if you can: for example Google Guava caching utilities

• @CompileStatic

Performance Optimization

Page 91: Scalable Big Data Stream Processing with Storm and Groovy

RESOURCES

• http://www.michael-noll.com

• http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to

• http://svendvanderveken.wordpress.com/

Page 92: Scalable Big Data Stream Processing with Storm and Groovy

RESOURCES

edvorkin/Storm_Demo_Spring2GX

Page 93: Scalable Big Data Stream Processing with Storm and Groovy

YOU FOR LISTENING

THANK

Page 94: Scalable Big Data Stream Processing with Storm and Groovy

Go ahead. Ask away.

QUESTIONS

AND

ANSWERS

Page 95: Scalable Big Data Stream Processing with Storm and Groovy

WRITING SPOUT

Page 96: Scalable Big Data Stream Processing with Storm and Groovy

å

å å