This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
§ Working for Trivadis for more than 18 years § Oracle ACE Director for Fusion Middleware and SOA § Co-Author of different books § Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data § Member of Trivadis Architecture Board § Technology Manager @ Trivadis
Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria.
We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems.
Our company
O P E R A T I O N
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency
(Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
At most once [0,1] • Messages my be lost • Messages never redelivered
At least once [1 .. n] • Messages will never be lost • but messages may be redelivered (might be ok if consumer can handle it)
Exactly once [1] • Messages are never lost • Messages are never redelivered • Perfect message delivery • Incurs higher latency for transactional semantics
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. • A highly distributed real-time computation system
• Provides general primitives to do real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
• complementary to Hadoop
• Written in Clojure, supports Java, Clojure
• Originated at Backtype, acquired by Twitter in 2011
• Open Sourced late 2011
• Part of Apache Incubator since September 2013
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
Tuple • Core data structure in storm • Immutable Set of Key/value pairs • You can think of Storm tuples as events • Values must be serializable
Stream • Key abstraction of Storm • an unbounded sequence of tuples that can be processed in parallel by Storm • Each stream is given ID and bolts can produce and consume tuples from
these streams on the basis of their ID • Each stream also has an associated schema of the tuples that will flow
through it
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
Each Spout or Bolt are running N instances in parallel
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
24
Split Sentence
TwitterSpout
WordCount
Split Sentence
WordCount
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.
Core data model is the stream • Processed as a series of batches (micro-batches) • Stream is partitioned among nodes in cluster
5 kinds of operations in Trident • Operations that apply locally to each partition and cause no network transfer • Repartitioning operations that don‘t change the contents • Aggregation operations that do network transfer • Operations on grouped streams • Merges and Joins
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
• takes in a set of input fields and emits zero or more tuples as output
• fields of the output tuple are appended to the original input tuple in the stream • If a function emits no tuples, the original input tuple is filtered out • Otherwise the input tuple is duplicated for each output tuple
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
30
Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala,
Python, Ruby, … Java, Clojure,
Scala
Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least
Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • DAG-based • Ease of development – native APIs in Java, Scala and Python
Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system • Started in 2012, first alpha release in 2013
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
Resilient Distributed Dataset (RDD) • Core Spark abstraction • Collections of objects (partitions) spread across cluster • Partitions can be stored in-memory or on-disk (local) • Enables parallel processing on data sets • Build through parallel transformations • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD
Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter,
ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources
Operations • Same as Spark Core • Additional Stateful transformations (window, reduceByWindow)
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
LinkedIn’s motivation for Kafka was: § “A unified platform for handling all the real-time data feeds a large company
might have.”
Must haves § High throughput to support high volume event feeds. § Support real-time processing of these feeds to create new, derived feeds. § Support large data backlogs to handle periodic ingestion from offline
systems. § Support low-latency delivery to handle more traditional messaging use
cases. § Guarantee fault-tolerance in the presence of machine failures.
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture