Top Banner
Low-Latency Streaming Data Processing in Hadoop InSemble Inc. http://www.insemble.com
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Low Latency Streaming Data Processing in Hadoop

Low-Latency Streaming Data Processing in Hadoop

InSemble Inc. http://www.insemble.com

Page 2: Low Latency Streaming Data Processing in Hadoop

Agenda

Reference Architecture for Low Latency Streaming1

Storm 4

Kafka3

Flume2

Demo5

Page 3: Low Latency Streaming Data Processing in Hadoop

Hadoop Ecosystem

Source: Apache Hadoop Documentation

Page 4: Low Latency Streaming Data Processing in Hadoop

Cloudera Platform

Page 5: Low Latency Streaming Data Processing in Hadoop

Hortonworks Data Platform(HDP)

Page 6: Low Latency Streaming Data Processing in Hadoop

Real time Stream Processing Architecture with Hadoop

Page 7: Low Latency Streaming Data Processing in Hadoop

Flume Architecture

• Distributed system for collecting and aggregating from multiple data stores to a centralized data store

• Agent is a JVM that hosts the Flume components

• Channel will store message until picked by a sink

• Different types of Flume sources

• Source and Sink are decoupled

Page 8: Low Latency Streaming Data Processing in Hadoop

Consolidation Architecture

Page 9: Low Latency Streaming Data Processing in Hadoop

Multiplexing Architecture

Page 10: Low Latency Streaming Data Processing in Hadoop

Kafka Introduction

• Messaging System which is distributed, partitioned and replicated• Kafka brokers run as a cluster• Producers and Consumers can be written in any language

Page 11: Low Latency Streaming Data Processing in Hadoop

Topic

• Ordered, immutable sequence numbers• Retains messages until a period of time• “Offset” of where they are is controlled by the consumer• Each partition is replicated and has “leader” and 0 or more “follower”. R/W

only done on leader

Page 12: Low Latency Streaming Data Processing in Hadoop

Producers and Consumers

• Producer controls which partition messages goes to• Supports both Queuing and Pub/Sub

– Abstraction called Consumer group• Ordering within Partition

– Ordering for subscriber has to be done with only one subscriber to that partition

Page 13: Low Latency Streaming Data Processing in Hadoop

Storm Introduction

• Distributed real time computational system– Process unbounded streams of data– Can use multiple programming languages– Scalable, fault-tolerant and guarantees that data will be processed

• Use Cases– Real time analytics, online machine learning– Continuous Computation– Distributed RPC– ETL

• Concepts– Topology– Spouts– Bolts

Page 14: Low Latency Streaming Data Processing in Hadoop

Concepts

• Storm Cluster– Master node(Nimbus)

• Distributing code• Assigns tasks to machines• Monitors for failures

– Worker nodes(Supervisor)• Starts/stops worker processes• Each worker process executes subset of a topology

– Zookeeper• Coordinates between Nimbus and Supervisors• Nimbus and Supervisors completely stateless• State maintained by Zookeeper or local disks

Page 15: Low Latency Streaming Data Processing in Hadoop

Details

• Stream – Unbounded sequence of tuples

• Spout(write logic)– Source of stream. Emits tuples

• Bolt(write logic)– Processes streams and emits tuples

• Topology– DAG of spouts and bolts– Submit a topology to a Storm cluster– Each node runs in parallel and parallelism is controlled

Page 16: Low Latency Streaming Data Processing in Hadoop

Stream groupings

• Tells a topology how to send tuples between two components• Since tasks are executed in parallel, how do we control which tasks the

tuples are being sent to

Page 17: Low Latency Streaming Data Processing in Hadoop

Demo - Twitter TopN Trending Topic

• Use Flume Twitter Source to ingest data and publish event to Kafka topic

• Use Storm as an Real-Time event processing system to calculate TopN trending topic

• Use Redis to store the TopN Result• Use Node.js/JQuery for visualization

Page 18: Low Latency Streaming Data Processing in Hadoop

Flow Chart

Twitter Twitter Source

Flume Agent

Mem Channel Kafka Sink

KafkaKafka SpoutParse Twitter BoltCount Bolt

TopN Ranker Bolt Report Bolt

Storm

RedisNode.js + JQuery

Twitter Source Mem Channel Kafka Sink

Page 19: Low Latency Streaming Data Processing in Hadoop

Flume Agent — Source

Page 20: Low Latency Streaming Data Processing in Hadoop

Flume Agent — Channel

Page 21: Low Latency Streaming Data Processing in Hadoop

Flume Agent — Sink

Page 22: Low Latency Streaming Data Processing in Hadoop

Storm Topology Design

Page 23: Low Latency Streaming Data Processing in Hadoop

Submit Topology to Cluster

Page 24: Low Latency Streaming Data Processing in Hadoop

ParseTweetBolt Code

Page 25: Low Latency Streaming Data Processing in Hadoop

ParseTweetBolt Code

Page 26: Low Latency Streaming Data Processing in Hadoop

ParseTweetBolt Code