Page 1
WHEN STORM HITS DATA.DATA STREAMS PROCESSING IN REAL TIME.
MARCIN STANISLAWSKI
Page 2
WHO AM I?Architect/Developer at Interia.plStorm and Hadoop userGithub: webikTwitter: @unilama
Page 6
RUN JOB
COFFEE BREAK*
RESULTS* - there are some solutions
Page 7
IMPALA
implemented in C++non Map Reduce solution
Page 8
KIJI
KijiRESTHDFS/HBase/Cassandra
Page 9
BATCH PROCESSING VS. STREAMING
Page 12
STREAMING SOLUTIONSYahoo S4AkkaSpark StreamingStorm
Page 13
STORM WHAT IS THAT?
Page 14
README.MDStorm is a distributed realtime computation system.
Storm is simple, can be used with any programming
language, and is a lot of fun to use!
Page 15
CURRENT STATUSApache IncubationIncluded in HortonWorks DataPlatformContributed by YahooEasy deploy to Amazon EC2
Page 18
SPOUTSTAKES EVENTS FROM:
KafkaKestrelRabitMQ...
AND PASS THEM TO...
Page 19
BOLTSTUPLES ARE PROCESSED, IN WAY THAT YOU IMPLEMENT IT
Page 20
EVENTS ARE TUPLES( 1, "TEST", "ATMOSPHERE", "2014-05-20 10:00:40", ... )
OBJECTS ARE SERIALIZED USING KYRO
Page 21
WRITTEN IN JAVA&CLOJURETOPOLOGIES ARE DAGS
Page 22
ARCHITECTURENimbusNodes(Supervisors)UIDRPC
Page 23
EVENT PROCESSED ONE OR MORE TIMES.
Page 24
ACKING FRAMEWORKEach tuple must be acked or failed
Page 25
TUPLES TRACKINGtuple has random 64 bit id
xor of all tuple ids, that have been createdand/or acked in the tree
if tuple id equals 0, tuple is fully processed
Page 26
COMMUNICATIONBetween:
Tasks: Disruptor LMAXWorkers: ⦰MQ -> Netty
Page 27
TRIDENThigh-level abstractionsame as Cascading/Scalding in Hadoop World
Page 28
SPOUTKey difference - producing Stream(s)
Page 29
STREAMBatches chain with multiplication ability
Page 30
STREAM OPERATIONSFunctionsFiltersProjectionsJoinsMerges
Page 31
SATEOperations:
GroupingAggregateQuery
Page 32
STATE TYPESnon-transactionaltransactionalopaque transactional
Page 33
STATEIn memory stateNoSQL databasesExternal systems via APIs
Page 35
DRPC TOPOLOGYNAMED DRPC SPOUT
USES MAIN TOPOLOGY STATESGENERATES ONE TUPLE OUTPUT
Page 36
DRPC ELEMENTSTHRIFT SERVER(S)
WITH PREDEFINED SPOUTAND BOLT
Page 37
ARE YOU PROGRAMMING IN NON-JVMLANGUAGE?NO PROBLEM :)
RubyPythonPerlPHP...
Page 38
STREAMING APIAPI defined as ThriftJSON based communication
Page 39
RED STORMWriting topologies in Ruby
Page 40
REAL TIME ALGORITHMS
Page 41
SIMPLE OPERATIONSSumCountMultiplication
Page 42
MAXIMUM AND MINIMUMdon't lose current value
Page 43
USUALLY TWO TOPOLOGIES
Page 44
LEARNINGClassificationClustering
Page 45
MODELEvaluatorVisualiser
Page 46
BASIC ELEMENT TABLE
Page 48
ALGORITHM EXAMPLESk-means clustering
statistical test (T, F, Z, Chi2)Hidden Markov Models
Page 50
STORMUNIThttp://github.com/webik/StormUnit
MAVEN MOJO - COMMING SOON :)http://github.com/webik/storm-maven
Page 52
SUMMINGBIRDWrite once, run on:
StormHadoop(Scalding)Amazon Kinesis
Page 53
MAYBE BACK INTO ZOOSTORM YARN