Storm at Forter

Storm at

Picture https://www.flickr.com/photos/silentmind8/15865860242 by silentmind8 under CC BY 2.0 http://creativecommons.org/licenses/by/2.0/

https://www.flickr.com/photos/silentmind8/15865860242

http://creativecommons.org/licenses/by/2.0/

Forter• We detect fraud

• A lot of data is collected

• New data can introduce new data sources

• At transaction time, we do our magic. Fast.

• We deny less

What’s Storm?• Streaming/data-pipeline infrastructure

• What’s a pipeline?

• “Topology” driven flow, static

• Written over JVM and also supports Python and Node.js

• Easy clustering

• Apache top level project, large community

Storm Lingo• Tuples

• The basic data transfer object in storm. Basically a dictionary (key->val).

• Spouts

• Entry points into the pipe. This is where data comes from.

• Bolts

• Components that can transform and route tuples

• Joins

• Joins are where async branches of the topology meet and join

• Streams

• Streams allow for flow control in the topology

System challenges• Latency should be determined by business needs -

flexible per customer (300ms - customers who just don’t care)

• Data dependencies in decision part can get very complex

• Getting data can be slow, especially 3rd party

• Data scientists write in Python

• Should be scaleable, because we’re ever growing

• Should be very granularly monitored

Bird’s eye view

• Two systems:

• System 1: data prefetching & preparing

• System 2: decision engine, must have all available data handy at TX time

System 1: high throughput pipeline

• Stream Batching

• Prefetching / Preparing

• Common use case, lots of competitors

System 2: low latency decision

• Dedicated everything

• Complex dependency graph

• Less common, fewer players

System 1High Throughput

Cache and cache layering• Storm constructs make it easy to tweak caches,

add enrichment steps transparently

• Different enrichment operations may require different execution power

• Each operation can be replaced by a sub-topology - layering of cache levels

• Field grouping allows the ability to maintain state in components - local cache or otherwise

Maintain a stored state• Many events coming in, some cause a state to

change

• State of a working set is saved in memory

• New/old states are fetched from an external data source

• Sate updates are saved immediately

• State machine is scalable - again, field grouping

And the rest…

• Batching content for writing (Storm’s tick tuples)

• Aggregating events in memory

• Throttling/Circuit-breaking external calls

System 2: Low Latency

Unique Challenges• Scaling. Resources need to be very dedicated,

parallelizing is bad

• Join logic is much stricter, with short timeouts

• Data validity is crucial for the stream routing

• Error handling

• Component graph is immense and hard to contain mentally - especially considering the delicate time window configurations.

Scalability• Each topology is built to handle a fixed number of

parallel TXs. Storm’s max-spout-pending

• Each topology atomically polls a queue

• Trying to keep as much of the logic in the same process to reduce network and serialization costs

• Latency is the only measure

Joining and errors• Waiting is not an option

• Tick tuples no good, break the single thread illusion

• Static topologies are easy to analyze and edit in runtime, and intervene

• Fallback streams are an elegant solution to the problem, preventing developers from explicitly defining escape routes

• Also allow for “try->finally” semantics

Multilang• Storm allows running bolt processes (shell-bolt)

with the builtin capability of communicating through standard i/o

• Not hugely scalable, but works

• Implemented are: Node.js (our contribution) and Python

• We use for legacy and to keep data scientists happy

Data Validity• Wrapping the bolts, we implemented contracts for

outputs

• Java POJOs with Hibernate Validator

• Contracts allow us “hard-typing” the links in the topologies

• Also help minimize data flow, especially to shell-bolts

• Checkout storm-data-contracts on github

Managing Complexity• Complexity of the data dependencies is maintained

by literally drawing it.

• Nimbus REST APIs offer access to the topology layout

• Timing complexity reduced by synchronizing the joins to a shared point-in-time. Still pretty complex.

• Proves better than our previous iterative solution

Monitoring• Nimbus metrics give out averages - not

good enough

• Reimann used to efficiently monitor latencies for every tuple in the system

• Inherent low latency monitoring issue: CPU utilization monitoring

• More at Itai Frenkel’s lecture

Questions?

Contact info:

Re’em Bensimhon [email protected] / [email protected] linkedin.com/in/bensimhon twitter: @reembs

mailto:[email protected]

mailto:[email protected]

Storm at Forter

Engineering