Top Banner
1 Processing Processing “BIG-DATA” “BIG-DATA” In In Real Time Real Time Yanai Franchi , Tikal Yanai Franchi , Tikal
93
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Processing Big Data in Real-Time - Yanai Franchi, Tikal

1

ProcessingProcessing“BIG-DATA”“BIG-DATA”In In Real TimeReal Time

Yanai Franchi , TikalYanai Franchi , Tikal

Page 2: Processing Big Data in Real-Time - Yanai Franchi, Tikal

2

Two years ago...Two years ago...

Page 3: Processing Big Data in Real-Time - Yanai Franchi, Tikal

3

Page 4: Processing Big Data in Real-Time - Yanai Franchi, Tikal

4

Vacation to BarcelonaVacation to Barcelona

Page 5: Processing Big Data in Real-Time - Yanai Franchi, Tikal

5

After a Long Travel DayAfter a Long Travel Day

Page 6: Processing Big Data in Real-Time - Yanai Franchi, Tikal

6

Going to a Salsa Club

Page 7: Processing Big Data in Real-Time - Yanai Franchi, Tikal

7

Best Salsa Club NOW

● Good Music

● Crowded – Now!

Page 8: Processing Big Data in Real-Time - Yanai Franchi, Tikal

8

Same Problem in “gogobot”

Page 9: Processing Big Data in Real-Time - Yanai Franchi, Tikal

9

Page 10: Processing Big Data in Real-Time - Yanai Franchi, Tikal

10

gogobot checkinHeat Map Service

Lets' Develop“Gogobot Checkins Heat-Map”

Page 11: Processing Big Data in Real-Time - Yanai Franchi, Tikal

11

Key Notes● Collector Service - Collects checkins as text addresses

– We need to use GeoLocation ServiceWe need to use GeoLocation Service

● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI.

● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise).

● Accuracy – Sample data, NOT critical data.

– Proportionately representative

– Data volume is large enough to is large enough to compensate for data loss.compensate for data loss.

Page 12: Processing Big Data in Real-Time - Yanai Franchi, Tikal

12

Heat-Map Context

Text-Address

Checkins Heat-MapService

Gogobot System

GogobotMicro Service

GogobotMicro Service

GogobotMicro Service

Geo LocationService

Get-GeoCode(Address)

Heat-Map

Last Interval Locations

Page 13: Processing Big Data in Real-Time - Yanai Franchi, Tikal

13

Database

Persist Checkin Intervals

ProcessingCheckins

ReadText Address

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Simulate Checkins with a File

Plan A

GET Geo Location

Geo LocationService

Page 14: Processing Big Data in Real-Time - Yanai Franchi, Tikal

14

Tons of Addresses Arriving Every Second

Page 15: Processing Big Data in Real-Time - Yanai Franchi, Tikal

15

Architect - First Reaction...

Page 16: Processing Big Data in Real-Time - Yanai Franchi, Tikal

16

Second Reaction...

Page 17: Processing Big Data in Real-Time - Yanai Franchi, Tikal

17

DeveloperFirst

Reaction

Page 18: Processing Big Data in Real-Time - Yanai Franchi, Tikal

18

SecondReaction

Page 19: Processing Big Data in Real-Time - Yanai Franchi, Tikal

19

Problems ?

● Tedious: Spend time conf iguring where to send messages, deploying workers, and deploying intermediate queues.

● Brittle: There's little fault-tolerance.

● Painful to scale: Partition of running worker/s is complicated.

Page 20: Processing Big Data in Real-Time - Yanai Franchi, Tikal

20

What We Want ?● Horizontal scalability● Fault-tolerance● No intermediate message brokers!● Higher level abstraction than message

passing● “Just works”● Guaranteed data processing (not in this

case)

Page 21: Processing Big Data in Real-Time - Yanai Franchi, Tikal

21

Apache Storm

✔Horizontal scalability

✔Fault-tolerance

✔No intermediate message brokers!

✔Higher level abstraction than message passing

✔“Just works”

✔Guaranteed data processing

Page 22: Processing Big Data in Real-Time - Yanai Franchi, Tikal

22

Anatomy of Storm

Page 23: Processing Big Data in Real-Time - Yanai Franchi, Tikal

23

What is Storm ?

● CEP - Open source and distributed realtime computation system. – Makes it easy to Makes it easy to reliably process unboundedreliably process unbounded streams streams ofof

tuplestuples– Doing for realtime processing what Hadoop did for batch Doing for realtime processing what Hadoop did for batch

processing.processing.

● Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be It is scalable,fault-tolerant, guarantees your data will be

processed, and is easy to set up and operate.processed, and is easy to set up and operate.

Page 24: Processing Big Data in Real-Time - Yanai Franchi, Tikal

24

Streams

Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Page 25: Processing Big Data in Real-Time - Yanai Franchi, Tikal

25

Spouts

Tuple Tuple

Sources of Streams

Tuple Tuple

Page 26: Processing Big Data in Real-Time - Yanai Franchi, Tikal

26

Bolts

Tuple

TupleTuple

Processes input streams and producesnew streams

TupleTupleTupleTuple

Tuple TupleTuple

Page 27: Processing Big Data in Real-Time - Yanai Franchi, Tikal

27

Storm Topology

Network of spouts and bolts

Tuple

TupleTuple

TupleTuple TupleTupleTuple TupleTupleTuple

Tuple

Tuple

TupleTuple TupleTupleTuple

Page 28: Processing Big Data in Real-Time - Yanai Franchi, Tikal

28

Guarantee for Processing

● Storm guarantees the full processing of a tuple by tracking its state

● In case of failure, Storm can re-process it.● Source tuples with full “acked” trees are removed

from the system

Page 29: Processing Big Data in Real-Time - Yanai Franchi, Tikal

29

Tasks (Bolt/Spout Instance)

Spouts and bolts execute asmany tasks across the cluster

Page 30: Processing Big Data in Real-Time - Yanai Franchi, Tikal

30

Stream Grouping

When a tuple is emitted, which task(instance) does it go to?

Page 31: Processing Big Data in Real-Time - Yanai Franchi, Tikal

31

Stream Grouping

● Shuff le grouping: pick a random task● Fields grouping: consistent hashing on a subset of

tuple f ields● All grouping: send to all tasks● Global grouping: pick task with lowest id

Page 32: Processing Big Data in Real-Time - Yanai Franchi, Tikal

32

Tasks , Executors , Workers

Task Task Task

Worker Process

Sput /Bolt

Sput /Bolt

Sput /Bolt=

Executor Thread

JVM

Executor Thread

Page 33: Processing Big Data in Real-Time - Yanai Franchi, Tikal

33

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

Page 34: Processing Big Data in Real-Time - Yanai Franchi, Tikal

34

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo KeeperNodes

Storm Architecture

Master Node (similar to Hadoop JobTracker)

NOT criticalfor running topology

Page 35: Processing Big Data in Real-Time - Yanai Franchi, Tikal

35

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Used For Cluster Coordination

A few nodes

Page 36: Processing Big Data in Real-Time - Yanai Franchi, Tikal

36

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Run Worker Processes

Page 37: Processing Big Data in Real-Time - Yanai Franchi, Tikal

37

Assembling Heatmap Topology

Page 38: Processing Big Data in Real-Time - Yanai Franchi, Tikal

38

HeatMap Input/Output Tuples

● Input Tuples: Timestamp and Text Address : – (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)

● Output Tuple: Time interval, and a list of points for it:– (9:00:00 PM to 9:00:15 PM, (9:00:00 PM to 9:00:15 PM,

ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))

Page 39: Processing Big Data in Real-Time - Yanai Franchi, Tikal

39

Checkins Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

PersistorBolt

(9:01 PM @ 287 Hudson st)

(9:01 PM , (40.736, -74,354)))

Heat Map Storm

Topology(9:00 PM – 9:15 PM , List((40.73, -74,34),

(51.36, -83,33),(69.73, -34,24))

Upon Elapsed Interval

Page 40: Processing Big Data in Real-Time - Yanai Franchi, Tikal

40

Checkins Spoutpublic class CheckinsSpout extends BaseRichSpout {

private List<String> sampleLocations;private int nextEmitIndex;private SpoutOutputCollector outputCollector;

@Overridepublic void open(Map map, TopologyContext topologyContext,

SpoutOutputCollector spoutOutputCollector) {this.outputCollector = spoutOutputCollector;this.nextEmitIndex = 0;sampleLocations = IOUtils.readLines(

ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));}

@Overridepublic void nextTuple() {

String address = checkins.get(nextEmitIndex);String checkin = new Date().getTime()+"@ADDRESS:"+address;

outputCollector.emit(new Values(checkin));nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("str"));

}}

We hold stateNo need for thread safety

Declare output fields

Been called iteratively by Storm

Page 41: Processing Big Data in Real-Time - Yanai Franchi, Tikal

41

Geocode Lookup Boltpublic class GeocodeLookupBolt extends BaseBasicBolt {

private LocatorService locatorService;

@Overridepublic void prepare(Map stormConf, TopologyContext context) {

locatorService = new GoogleLocatorService();}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

String str = tuple.getStringByField("str");String[] parts = str.split("@");Long time = Long.valueOf(parts[0]);String address = parts[1];

LocationDTO locationDTO = locatorService.getLocation(address);String city = locationDTO.getCity();outputCollector.emit(new Values(city,time,locationDTO) );

}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {

fieldsDeclarer.declare(new Fields("city","time", "location"));}

}

Get Geocode, Create DTO

Page 42: Processing Big Data in Real-Time - Yanai Franchi, Tikal

42

Tick Tuple – Repeating Mantra

Page 43: Processing Big Data in Real-Time - Yanai Franchi, Tikal

43

Two Streams to Heat-Map Builder

On tick tuple, we f lush our Heat-Map

Checkin 1 Checkin 4 Checkin 5 Checkin 6

HeatMap-Builder Bolt

Page 44: Processing Big Data in Real-Time - Yanai Franchi, Tikal

44

Tick Tuple in Actionpublic class HeatMapBuilderBolt extends BaseBasicBolt {

private Map<String, List<LocationDTO>> heatmaps;

@Overridepublic Map<String, Object> getComponentConfiguration() {

Config conf = new Config();conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );return conf;

}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

if (isTickTuple(tuple)) {// Emit accumulated intervals

} else {// Add check-in info to the current interval in the Map

}}

private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)

&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("time-interval", "city","locationsList"));}

Tick interval

Hold latest intervals

Page 45: Processing Big Data in Real-Time - Yanai Franchi, Tikal

45

Persister Boltpublic class PersistorBolt extends BaseBasicBolt {

private Jedis jedis;

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString

( tuple.getValueByField("locationsList"));

String dbKey = "checkins-" + timeInterval+"@"+city;

jedis.setex(dbKey, 3600*24 ,locationsList);

jedis.publish("location-key", dbKey);

}}

Publish in Redis channel for debugging

Persist in Redisfor 24h

Page 46: Processing Big Data in Real-Time - Yanai Franchi, Tikal

46

Shuffle Grouping

Shuffle Grouping

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Sample Checkins File

Read Text Addresses

Transforming the TuplesCheckins

Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

DatabasePersistor

Bolt

Get Geo Location

Geo LocationService

Field Grouping(city)

Group by city

Page 47: Processing Big Data in Real-Time - Yanai Franchi, Tikal

47

Heat Map Topologypublic class LocalTopologyRunner { public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();StormSubmitter.submitTopology(

"local-heatmap", new Config(), builder.createTopology());

}

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout());

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ).shuffleGrouping("checkins");

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() ).shuffleGrouping("heatmap-builder");

return builder;

}}

Page 48: Processing Big Data in Real-Time - Yanai Franchi, Tikal

48

Its NOT Scaled

Page 49: Processing Big Data in Real-Time - Yanai Franchi, Tikal

49

Page 50: Processing Big Data in Real-Time - Yanai Franchi, Tikal

50

Scaling the Topologypublic class LocalTopologyRunner {conf.setNumWorkers(20); public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();Config conf = new Config();conf.setNumWorkers(2);StormSubmitter.submitTopology(

"local-heatmap", conf, builder.createTopology()); }

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 );

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 ).shuffleGrouping("checkins").setNumTasks(64);

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() , 2 ).shuffleGrouping("heatmap-builder").setNumTasks(4);

return builder;

}}

Parallelism hint

Increase TasksFor Future

Set no. of workers

Page 51: Processing Big Data in Real-Time - Yanai Franchi, Tikal

51

Database

Storm Heat-Map Topology

Persist Checkin Intervals

GET Geo Location

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

ReadText Address

Sample Checkins File

Recap – Plan A

Geo LocationService

Page 52: Processing Big Data in Real-Time - Yanai Franchi, Tikal

52

We have something working

Page 53: Processing Big Data in Real-Time - Yanai Franchi, Tikal

53

Add Kafka Messaging

Page 54: Processing Big Data in Real-Time - Yanai Franchi, Tikal

54

Plan B - Kafka Spout&Bolt to HeatMap

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

KafkaCheckins

Spout

Database

PersistorBolt

Geo LocationService

Read Text Addresses

CheckinKafkaTopic

PublishCheckins

LocationsTopic

KafkaLocations

Bolt

Page 55: Processing Big Data in Real-Time - Yanai Franchi, Tikal

55

Page 56: Processing Big Data in Real-Time - Yanai Franchi, Tikal

56

They all are GoodBut not for all use-cases

Page 57: Processing Big Data in Real-Time - Yanai Franchi, Tikal

57

KafkaA little introduction

Page 58: Processing Big Data in Real-Time - Yanai Franchi, Tikal

58

Page 59: Processing Big Data in Real-Time - Yanai Franchi, Tikal

59

Page 60: Processing Big Data in Real-Time - Yanai Franchi, Tikal

60

Page 61: Processing Big Data in Real-Time - Yanai Franchi, Tikal

61

Pub-Sub Messaging System

Page 62: Processing Big Data in Real-Time - Yanai Franchi, Tikal

62

Page 63: Processing Big Data in Real-Time - Yanai Franchi, Tikal

63

Page 64: Processing Big Data in Real-Time - Yanai Franchi, Tikal

64

Page 65: Processing Big Data in Real-Time - Yanai Franchi, Tikal

65

Page 66: Processing Big Data in Real-Time - Yanai Franchi, Tikal

66

Stateless Broker &Doesn't Fear the File System

Page 67: Processing Big Data in Real-Time - Yanai Franchi, Tikal

67

Page 68: Processing Big Data in Real-Time - Yanai Franchi, Tikal

68

Page 69: Processing Big Data in Real-Time - Yanai Franchi, Tikal

69

Page 70: Processing Big Data in Real-Time - Yanai Franchi, Tikal

70

Topics● Logical collections of partitions (the physical f iles). ● A broker contains some of the partitions for a topic

Page 71: Processing Big Data in Real-Time - Yanai Franchi, Tikal

71

A partition is Consumed byExactly One Group's Consumer

Page 72: Processing Big Data in Real-Time - Yanai Franchi, Tikal

72

Distributed &Fault-Tolerant

Page 73: Processing Big Data in Real-Time - Yanai Franchi, Tikal

73

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 74: Processing Big Data in Real-Time - Yanai Franchi, Tikal

74

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 75: Processing Big Data in Real-Time - Yanai Franchi, Tikal

75

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 76: Processing Big Data in Real-Time - Yanai Franchi, Tikal

76

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 77: Processing Big Data in Real-Time - Yanai Franchi, Tikal

77

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 78: Processing Big Data in Real-Time - Yanai Franchi, Tikal

78

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 79: Processing Big Data in Real-Time - Yanai Franchi, Tikal

79

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 80: Processing Big Data in Real-Time - Yanai Franchi, Tikal

80

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 81: Processing Big Data in Real-Time - Yanai Franchi, Tikal

81

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 82: Processing Big Data in Real-Time - Yanai Franchi, Tikal

82

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

Page 83: Processing Big Data in Real-Time - Yanai Franchi, Tikal

83

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 84: Processing Big Data in Real-Time - Yanai Franchi, Tikal

84

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 85: Processing Big Data in Real-Time - Yanai Franchi, Tikal

85

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

Page 86: Processing Big Data in Real-Time - Yanai Franchi, Tikal

86

Performance Benchmark3 Brokers

3 Producers3 Consumers

Cheap Machines

Page 87: Processing Big Data in Real-Time - Yanai Franchi, Tikal

• “Up to 2 million writes/sec on 3 cheap machines”

• Using 3 producers on 3 different machines, 3x async replication,

• Only 1 producer/machine because NIC already saturatedOnly 1 producer/machine because NIC already saturated

• End-to-End Latency is about 10ms for 99.9%

• Sustained throughput as stored data grows

•87

Page 88: Processing Big Data in Real-Time - Yanai Franchi, Tikal

88

Add Kafka to our Topologypublic class LocalTopologyRunner { ...

private static TopologyBuilder buildTopolgy() {

... builder.setSpout("checkins", new KafkaSpout(kafkaConfig) , 4); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092",

"kafka.serializer.StringEncoder","locations-topic"))

.shuffleGrouping("persistor");

return builder;}

}

Kafka Bolt

Kafka Spout

Page 89: Processing Big Data in Real-Time - Yanai Franchi, Tikal

89

Checkin HTTP Reactor

PublishCheckins

Database

CheckinKafkaTopic

Consume Checkins

Storm Heat-Map Topology

LocationsKafkaTopic

Publish Interval Key

Persist Checkin Intervals

Geo LocationServiceGET Geo

Location

Text-Address

Page 90: Processing Big Data in Real-Time - Yanai Franchi, Tikal

90

Demo

Page 91: Processing Big Data in Real-Time - Yanai Franchi, Tikal

91

SummaryWhen You go out to Salsa Club...

● Good Music

● Crowded

Page 92: Processing Big Data in Real-Time - Yanai Franchi, Tikal

92

More Conclusions..

● BigData – Also refers to Velocity of data (not only Volume of data)

● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs.

● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout

Page 93: Processing Big Data in Real-Time - Yanai Franchi, Tikal

93

Thanks