Getting Started with Real-Time Analytics

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Getting Started with Real-Time

Analytics

Shawn Gandhi, Solutions Architect

@shawnagram

@edwardfagin

{

"payerId": "Joe",

"productCode": "AmazonS3",

"clientProductCode": "AmazonS3",

"usageType": "Bandwidth",

"operation": "PUT",

"value": "22490",

"timestamp": "1216674828"

}

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]

"GET /apache_pb.gif HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z

mymachine.example.com evntslog - ID47

[exampleSDID@32473 iut="3"

eventSource="Application"

eventID="1011"][examplePriority@32473

class="high"]

Syslog Entry“SeattlePublicWater/Kinesis/123/Realtime”

– 412309129140

MQTT Record <R,AMZN ,T,G,R1>

NASDAQ OMX Record

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Abraham Wald (1902-1950)

Big data: best served fresh

Big data

• Hourly server logs: Were your systems misbehaving one hour ago?

• Weekly/monthly bill:

What you spent this billing cycle?

• Daily customer-preferences report from your

website’s clickstream:

What deal or ad to try next time?

• Daily fraud reports:

Was there fraud yesterday?

Real-time big data

• Amazon CloudWatch metrics:

What went wrong now?

• Real-time spending alerts/caps:

Prevent overspending now.

• Real-time analysis:

What to offer the current customer now?

• Real-time detection:

Block fraudulent use now.

Talk outline

• A tour of Amazon Kinesis concepts in the context of a Twitter Trends service

• Implementing the Twitter Trends service

• Amazon Kinesis in the broader context of a big data ecosystem

• Q&A

Sending and reading data from Amazon Kinesis

streams

HTTP POST

AWS SDK

Log4j

Flume

Fluentd

Get* APIs

Amazon Kinesis

Client Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Reading

A tour of Amazon Kinesis concepts in the

context of a Twitter Trends service

bit.ly/KinesisDemo

twitter-trends.com

AWS Elastic

Beanstalk

twitter-trends.com

twitter-trends.com website

twitter-trends.com

Too big to handle on one box

twitter-trends.com

The solution: streaming Map/Reduce

My top-10

My top-10

My top-10

Global top-10

twitter-trends.com

Core concepts

My top-10

My top-10

My top-10

Global top-10

Data recordStream

Partition key

ShardWorker

Shard: 14 17 18 21 23

Data record

Sequence number

twitter-trends.com

How this relates to Amazon Kinesis

Amazon

Kinesis

Amazon

Kinesis application

Core concepts recapped

• Data record - Tweet

• Stream - All tweets (the Twitter Firehose)

• Partition key - Twitter topic (every tweet record belongs to exactly one of these)

• Shard - All the data records belonging to a set of Twitter topics that will get grouped together

• Sequence number - Each data record gets one assigned when first ingested

• Worker - Processes the records of a shard in sequence number order

twitter-trends.com

Using the Amazon Kinesis API directly

A

m

a

z

o

n

K

i

n

e

s

i

s

iterator = getShardIterator(shardId, LATEST);

while (true) {

[records, iterator] =

getNextRecords(iterator, maxRecsToReturn);

process(records);

}

process(records): {

for (record in records) {

updateLocalTop10(record);

}

if (timeToDoOutput()) {

writeLocalTop10ToDDB();

}

}

while (true) {

localTop10Lists =

scanDDBTable();

updateGlobalTop10List(

localTop10Lists);

sleep(10);

}

A

m

a

z

o

n

K

i

n

i

s

i

s

twitter-trends.com

Challenges with using the Amazon Kinesis API

directly

Kinesis

application

Manual creation of workers and

assignment to shards

How many workers

per EC2 instance?How many EC2 instances?

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Using the Amazon Kinesis Client Library

Kinesis

application

Shard mgmt

table

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Elastic scale and load balancing

Shard mgmt

table

Auto scaling

Group

Amazon

CloudWatch

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Shard management

Shard mgmt

table

Auto scaling

Group

Amazon

CloudWatch

Amazon

SNSAWS

Console

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

The challenges of fault tolerance

XXX Shard mgmt

table

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Fault tolerance support in KCL

Shard mgmt

table

XAvailability Zone

1

Availability Zone

3

Checkpoint, replay design pattern

Amazon Kinesis

1417182123

Shard-i

235810

Shard

ID

Lock Seq

num

Shard-i

Host A

Host B

Shard

ID

Local top-10

Shard-i

0

10

18X2

3

5

8

10

14

17

18

21

23

0

310

Host AHost B

{#Obama: 10235, #Seahawks: 9835, …}{#Obama: 10235, #Congress: 9910, …}

1023

14

17

1821

23

Catching up from delayed processing

Amazon Kinesis

1417182123

Shard-i

235810

Host A

Host B

X23

5

2

3

5

8

10

14

1718

21

23

How long until

we’ve caught up?

Shard throughput SLA:

- 1 MBps in

- 2 MBps out

=>

Catch up time ~ down time

Unless your application

can’t process 2 MBps on a

single host

=>

Provision more shards

Concurrent processing applications

Amazon Kinesis

1417182123

Shard-i

235810

twitter-trends.com

spot-the-revolution.org

018

2

3

5

8

10

14

17

18

21

23

310

023

14

17

1821

23

23

2

3

5

8

10

10

Why not combine applications?

Amazon Kinesistwitter-trends.com

twitter-stats.com

Output state and

checkpoint every 10

seconds

Output state and

checkpoint every hour

Implementing twitter-trends.com

Creating an Amazon Kinesis stream

How many shards?

Additional considerations:

- How much processing is needed

per data record/data byte?

- How quickly do you need to

catch up if one or more of your

workers stalls or fails?

You can always dynamically reshard

to adjust to changing workloads

Twitter Trends shard processing code

Class TwitterTrendsShardProcessor implements IRecordProcessor {

public TwitterTrendsShardProcessor() { … }

@Override

public void initialize(String shardId) { … }

@Override

public void processRecords(List<Record> records,

IRecordProcessorCheckpointer checkpointer) { … }

@Override

public void shutdown(IRecordProcessorCheckpointer checkpointer,

ShutdownReason reason) { … }

}


Class TwitterTrendsShardProcessor implements IRecordProcessor {

private Map<String, AtomicInteger> hashCount = new HashMap<>();

private long tweetsProcessed = 0;

@Overridepublic void processRecords(List<Record> records,

IRecordProcessorCheckpointer checkpointer) {computeLocalTop10(records);

if ((tweetsProcessed++) >= 2000) {emitToDynamoDB(checkpointer);

}}

}


private void computeLocalTop10(List<Record> records) {for (Record r : records) {

String tweet = new String(r.getData().array());String[] words = tweet.split(" \t");for (String word : words) {

if (word.startsWith("#")) {if (!hashCount.containsKey(word)) hashCount.put(word, new AtomicInteger(0));hashCount.get(word).incrementAndGet();

}}

}

private void emitToDynamoDB(IRecordProcessorCheckpointer checkpointer) {persistMapToDynamoDB(hashCount);try {

checkpointer.checkpoint();} catch (IOException | KinesisClientLibDependencyException | InvalidStateException

| ThrottlingException | ShutdownException e) {// Error handling

}hashCount = new HashMap<>();tweetsProcessed = 0;

}

Amazon Kinesis as a gateway into the big

data ecosystem

Big data has a lifecycle

twitter-trends.com

Twitter trends

processing

application

Twitter trends

statistics

Twitter trends

anonymized archive

Connector framework

public interface IKinesisConnectorPipeline<T> {

IEmitter<T> getEmitter(KinesisConnectorConfiguration configuration);

IBuffer<T> getBuffer(KinesisConnectorConfiguration configuration);

ITransformer<T> getTransformer(

KinesisConnectorConfiguration configuration);

IFilter<T> getFilter(KinesisConnectorConfiguration configuration);

}

Example Amazon Redshift connector

A

m

a

z

o

n

K

i

n

e

s

i

s

Amazon

Redshift

Amazon

S3

Example Amazon Redshift connector v2

A

m

a

z

o

n

K

i

n

e

s

i

s

Amazon

Redshift

Amazon

S3

A

m

a

z

o

n

K

i

n

e

s

i

s

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

MediaMath’s Data Revolution with

Amazon Kinesis

Edward Fagin, VP Engineering, MediaMath

MediaMath firehose

Firehose

(Amazon

Kinesis)

Single

Event

Single

Event

Single

Event

Bidder Data

(wins)

Site Events

Third-Party

Segments

SingleEvent

SingleEvent

SingleEvent

Event Log Files

Real-time

Analytics

Decisioning &

Optimization

Archive

Warehouse

(Amazon

S3)

Bits & bytes (lots of them)

151 billion bid opportunities analyzed per day

120 terabytes of data analyzed per day

9 POPS + cloud & thousands of servers across the globe

40ms average response time

Transactional/financial data (so every record counts)$

The problem: Batch log shipping

Warehouse

(analytics,

decisioning,

optimization,

archive)

Impressions

Log File

Bidder Data

(wins)

Site Events

Third-Party

Segments

Event/Click

Log File

Segment

Log File

The solution: Streaming architecture

• Real-time data access

• Record-by-record routing and processing

• Multiple consumers (push and pull)

• Terabyte scale and beyond

MediaMath firehose

Firehose

(Amazon

Kinesis)

Single

Event

Single

Event

Single

Event

Bidder Data

(wins)

Site Events

Third-Party

Segments

SingleEvent

SingleEvent

SingleEvent

Event Log Files

Real-time

Analytics

Decisioning &

Optimization

Archive

Warehouse

(Amazon

S3)

MediaMath firehose

Multiple Amazon Kinesis streams

• Hundreds of shards

• Terabytes of daily data

• Topic-level streams

Seamless integration with existing producers and consumers

Consumers built on Amazon Kinesis Client Library (KLC)

• KCL-based consumers on (mostly) c3.4xlarge instance

• Connectors to Amazon S3, Amazon EMR, Amazon Kinesis, and more

Opportunities

Unlocked

• Innovation speed

• Reliability of data pipeline

• Real-time analytics

• Customer value

Your feedback is important to AWSPlease complete the session evaluation. Tell us what you think!

@shawnagram

@edwardfagin

NEW YORK

Getting Started with Real-Time Analytics

Technology

data records

amazon kinesis api

data record sequence

tweet record

generated data available

metering record

big data ecosystem qa

amazon web services