Top Banner
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Getting Started with Real-Time Analytics Shawn Gandhi, Solutions Architect @shawnagram @edwardfagin
54
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Getting Started with Real-Time Analytics

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Getting Started with Real-Time

Analytics

Shawn Gandhi, Solutions Architect

@shawnagram

@edwardfagin

Page 2: Getting Started with Real-Time Analytics

{

"payerId": "Joe",

"productCode": "AmazonS3",

"clientProductCode": "AmazonS3",

"usageType": "Bandwidth",

"operation": "PUT",

"value": "22490",

"timestamp": "1216674828"

}

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]

"GET /apache_pb.gif HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z

mymachine.example.com evntslog - ID47

[exampleSDID@32473 iut="3"

eventSource="Application"

eventID="1011"][examplePriority@32473

class="high"]

Syslog Entry“SeattlePublicWater/Kinesis/123/Realtime”

– 412309129140

MQTT Record <R,AMZN ,T,G,R1>

NASDAQ OMX Record

Page 3: Getting Started with Real-Time Analytics

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 4: Getting Started with Real-Time Analytics

Abraham Wald (1902-1950)

Page 5: Getting Started with Real-Time Analytics
Page 6: Getting Started with Real-Time Analytics
Page 7: Getting Started with Real-Time Analytics

Big data: best served fresh

Big data

• Hourly server logs: Were your systems misbehaving one hour ago?

• Weekly/monthly bill:

What you spent this billing cycle?

• Daily customer-preferences report from your

website’s clickstream:

What deal or ad to try next time?

• Daily fraud reports:

Was there fraud yesterday?

Real-time big data

• Amazon CloudWatch metrics:

What went wrong now?

• Real-time spending alerts/caps:

Prevent overspending now.

• Real-time analysis:

What to offer the current customer now?

• Real-time detection:

Block fraudulent use now.

Page 8: Getting Started with Real-Time Analytics

Talk outline

• A tour of Amazon Kinesis concepts in the context of a Twitter Trends service

• Implementing the Twitter Trends service

• Amazon Kinesis in the broader context of a big data ecosystem

• Q&A

Page 9: Getting Started with Real-Time Analytics

Sending and reading data from Amazon Kinesis

streams

HTTP POST

AWS SDK

Log4j

Flume

Fluentd

Get* APIs

Amazon Kinesis

Client Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Reading

Page 10: Getting Started with Real-Time Analytics

A tour of Amazon Kinesis concepts in the

context of a Twitter Trends service

Page 11: Getting Started with Real-Time Analytics

bit.ly/KinesisDemo

Page 12: Getting Started with Real-Time Analytics

twitter-trends.com

AWS Elastic

Beanstalk

twitter-trends.com

twitter-trends.com website

Page 13: Getting Started with Real-Time Analytics

twitter-trends.com

Too big to handle on one box

Page 14: Getting Started with Real-Time Analytics

twitter-trends.com

The solution: streaming Map/Reduce

My top-10

My top-10

My top-10

Global top-10

Page 15: Getting Started with Real-Time Analytics

twitter-trends.com

Core concepts

My top-10

My top-10

My top-10

Global top-10

Data recordStream

Partition key

ShardWorker

Shard: 14 17 18 21 23

Data record

Sequence number

Page 16: Getting Started with Real-Time Analytics

twitter-trends.com

How this relates to Amazon Kinesis

Amazon

Kinesis

Amazon

Kinesis application

Page 17: Getting Started with Real-Time Analytics

Core concepts recapped

• Data record - Tweet

• Stream - All tweets (the Twitter Firehose)

• Partition key - Twitter topic (every tweet record belongs to exactly one of these)

• Shard - All the data records belonging to a set of Twitter topics that will get grouped together

• Sequence number - Each data record gets one assigned when first ingested

• Worker - Processes the records of a shard in sequence number order

Page 18: Getting Started with Real-Time Analytics

twitter-trends.com

Using the Amazon Kinesis API directly

A

m

a

z

o

n

K

i

n

e

s

i

s

iterator = getShardIterator(shardId, LATEST);

while (true) {

[records, iterator] =

getNextRecords(iterator, maxRecsToReturn);

process(records);

}

process(records): {

for (record in records) {

updateLocalTop10(record);

}

if (timeToDoOutput()) {

writeLocalTop10ToDDB();

}

}

while (true) {

localTop10Lists =

scanDDBTable();

updateGlobalTop10List(

localTop10Lists);

sleep(10);

}

Page 19: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

i

s

i

s

twitter-trends.com

Challenges with using the Amazon Kinesis API

directly

Kinesis

application

Manual creation of workers and

assignment to shards

How many workers

per EC2 instance?How many EC2 instances?

Page 20: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Using the Amazon Kinesis Client Library

Kinesis

application

Shard mgmt

table

Page 21: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Elastic scale and load balancing

Shard mgmt

table

Auto scaling

Group

Amazon

CloudWatch

Page 22: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Shard management

Shard mgmt

table

Auto scaling

Group

Amazon

CloudWatch

Amazon

SNSAWS

Console

Page 23: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

The challenges of fault tolerance

XXX Shard mgmt

table

Page 24: Getting Started with Real-Time Analytics

A

m

a

z

o

n

K

i

n

e

s

i

s

twitter-trends.com

Fault tolerance support in KCL

Shard mgmt

table

XAvailability Zone

1

Availability Zone

3

Page 25: Getting Started with Real-Time Analytics

Checkpoint, replay design pattern

Amazon Kinesis

1417182123

Shard-i

235810

Shard

ID

Lock Seq

num

Shard-i

Host A

Host B

Shard

ID

Local top-10

Shard-i

0

10

18X2

3

5

8

10

14

17

18

21

23

0

310

Host AHost B

{#Obama: 10235, #Seahawks: 9835, …}{#Obama: 10235, #Congress: 9910, …}

1023

14

17

1821

23

Page 26: Getting Started with Real-Time Analytics

Catching up from delayed processing

Amazon Kinesis

1417182123

Shard-i

235810

Host A

Host B

X23

5

2

3

5

8

10

14

1718

21

23

How long until

we’ve caught up?

Shard throughput SLA:

- 1 MBps in

- 2 MBps out

=>

Catch up time ~ down time

Unless your application

can’t process 2 MBps on a

single host

=>

Provision more shards

Page 27: Getting Started with Real-Time Analytics

Concurrent processing applications

Amazon Kinesis

1417182123

Shard-i

235810

twitter-trends.com

spot-the-revolution.org

018

2

3

5

8

10

14

17

18

21

23

310

023

14

17

1821

23

23

2

3

5

8

10

10

Page 28: Getting Started with Real-Time Analytics

Why not combine applications?

Amazon Kinesistwitter-trends.com

twitter-stats.com

Output state and

checkpoint every 10

seconds

Output state and

checkpoint every hour

Page 29: Getting Started with Real-Time Analytics

Implementing twitter-trends.com

Page 30: Getting Started with Real-Time Analytics

Creating an Amazon Kinesis stream

Page 31: Getting Started with Real-Time Analytics

How many shards?

Additional considerations:

- How much processing is needed

per data record/data byte?

- How quickly do you need to

catch up if one or more of your

workers stalls or fails?

You can always dynamically reshard

to adjust to changing workloads

Page 32: Getting Started with Real-Time Analytics

Twitter Trends shard processing code

Class TwitterTrendsShardProcessor implements IRecordProcessor {

public TwitterTrendsShardProcessor() { … }

@Override

public void initialize(String shardId) { … }

@Override

public void processRecords(List<Record> records,

IRecordProcessorCheckpointer checkpointer) { … }

@Override

public void shutdown(IRecordProcessorCheckpointer checkpointer,

ShutdownReason reason) { … }

}

Page 33: Getting Started with Real-Time Analytics

Twitter Trends shard processing code

Class TwitterTrendsShardProcessor implements IRecordProcessor {

private Map<String, AtomicInteger> hashCount = new HashMap<>();

private long tweetsProcessed = 0;

@Overridepublic void processRecords(List<Record> records,

IRecordProcessorCheckpointer checkpointer) {computeLocalTop10(records);

if ((tweetsProcessed++) >= 2000) {emitToDynamoDB(checkpointer);

}}

}

Page 34: Getting Started with Real-Time Analytics

Twitter Trends shard processing code

private void computeLocalTop10(List<Record> records) {for (Record r : records) {

String tweet = new String(r.getData().array());String[] words = tweet.split(" \t");for (String word : words) {

if (word.startsWith("#")) {if (!hashCount.containsKey(word)) hashCount.put(word, new AtomicInteger(0));hashCount.get(word).incrementAndGet();

}}

}

private void emitToDynamoDB(IRecordProcessorCheckpointer checkpointer) {persistMapToDynamoDB(hashCount);try {

checkpointer.checkpoint();} catch (IOException | KinesisClientLibDependencyException | InvalidStateException

| ThrottlingException | ShutdownException e) {// Error handling

}hashCount = new HashMap<>();tweetsProcessed = 0;

}

Page 35: Getting Started with Real-Time Analytics

Amazon Kinesis as a gateway into the big

data ecosystem

Page 36: Getting Started with Real-Time Analytics

Big data has a lifecycle

twitter-trends.com

Twitter trends

processing

application

Twitter trends

statistics

Twitter trends

anonymized archive

Page 37: Getting Started with Real-Time Analytics

Connector framework

public interface IKinesisConnectorPipeline<T> {

IEmitter<T> getEmitter(KinesisConnectorConfiguration configuration);

IBuffer<T> getBuffer(KinesisConnectorConfiguration configuration);

ITransformer<T> getTransformer(

KinesisConnectorConfiguration configuration);

IFilter<T> getFilter(KinesisConnectorConfiguration configuration);

}

Page 38: Getting Started with Real-Time Analytics

Example Amazon Redshift connector

A

m

a

z

o

n

K

i

n

e

s

i

s

Amazon

Redshift

Amazon

S3

Page 39: Getting Started with Real-Time Analytics

Example Amazon Redshift connector v2

A

m

a

z

o

n

K

i

n

e

s

i

s

Amazon

Redshift

Amazon

S3

A

m

a

z

o

n

K

i

n

e

s

i

s

Page 40: Getting Started with Real-Time Analytics

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

MediaMath’s Data Revolution with

Amazon Kinesis

Edward Fagin, VP Engineering, MediaMath

Page 41: Getting Started with Real-Time Analytics

MediaMath firehose

Firehose

(Amazon

Kinesis)

Single

Event

Single

Event

Single

Event

Bidder Data

(wins)

Site Events

Third-Party

Segments

SingleEvent

SingleEvent

SingleEvent

Event Log Files

Real-time

Analytics

Decisioning &

Optimization

Archive

Warehouse

(Amazon

S3)

Page 42: Getting Started with Real-Time Analytics

Bits & bytes (lots of them)

151 billion bid opportunities analyzed per day

120 terabytes of data analyzed per day

9 POPS + cloud & thousands of servers across the globe

40ms average response time

Transactional/financial data (so every record counts)$

Page 43: Getting Started with Real-Time Analytics

The problem: Batch log shipping

Warehouse

(analytics,

decisioning,

optimization,

archive)

Impressions

Log File

Bidder Data

(wins)

Site Events

Third-Party

Segments

Event/Click

Log File

Segment

Log File

Page 44: Getting Started with Real-Time Analytics
Page 45: Getting Started with Real-Time Analytics
Page 46: Getting Started with Real-Time Analytics

The solution: Streaming architecture

• Real-time data access

• Record-by-record routing and processing

• Multiple consumers (push and pull)

• Terabyte scale and beyond

Page 47: Getting Started with Real-Time Analytics

MediaMath firehose

Firehose

(Amazon

Kinesis)

Single

Event

Single

Event

Single

Event

Bidder Data

(wins)

Site Events

Third-Party

Segments

SingleEvent

SingleEvent

SingleEvent

Event Log Files

Real-time

Analytics

Decisioning &

Optimization

Archive

Warehouse

(Amazon

S3)

Page 48: Getting Started with Real-Time Analytics

MediaMath firehose

Multiple Amazon Kinesis streams

• Hundreds of shards

• Terabytes of daily data

• Topic-level streams

Seamless integration with existing producers and consumers

Consumers built on Amazon Kinesis Client Library (KLC)

• KCL-based consumers on (mostly) c3.4xlarge instance

• Connectors to Amazon S3, Amazon EMR, Amazon Kinesis, and more

Page 49: Getting Started with Real-Time Analytics
Page 50: Getting Started with Real-Time Analytics
Page 51: Getting Started with Real-Time Analytics

Opportunities

Unlocked

• Innovation speed

• Reliability of data pipeline

• Real-time analytics

• Customer value

Page 52: Getting Started with Real-Time Analytics
Page 53: Getting Started with Real-Time Analytics

Your feedback is important to AWSPlease complete the session evaluation. Tell us what you think!

@shawnagram

@edwardfagin

Page 54: Getting Started with Real-Time Analytics

NEW YORK