Real-time Data Processing Using AWS Lambda

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ken Payne – AWS Solutions Architect

27 Oct 2016

Real-time Data Processing

Using AWS Lambda

Agenda

Why Serverless?

How can AWS Serverless Services be used for Realtime

Data Processing?

The serverless compute manifesto

• Functions are the unit of deployment and scaling.

• No machines, VMs, or containers visible in the programming model.

• Permanent storage lives elsewhere.

• Scales per request. Users cannot over- or under-provision capacity.

• Never pay for idle (no cold servers/containers or their costs).

• Implicitly fault-tolerant because functions can run anywhere.

• BYOC – Bring your own code.

• Metrics and logging are a universal right.

Why Serverless?

Hardware

OS

Dependencies

Framework

Your CodeLambda

“Undifferentiated

Heavy Lifting”Configuration,

Updates,

Security/Hardening

e.g. Apache, MySQL

e.g. PHP, Django,

Ruby-on-Rails

GENERATE STORE ANALYZE SHARE

BATCH

PROCESSING

GENERATE SHARE

STREAM

PROCESSING

Data Streaming

Architecture & Workflow

Smart

Devices

Click

Stream

Log

Data

AWS Lambda: Overview

Lambda functions: a piece of code with stateless execution

Triggered by events:

• Direct Sync and Async API calls

• AWS Service integrations

• 3rd party triggers

• And many more …

Makes it easy to:

• Perform data-driven auditing, analysis, and notification

• Build back-end services that perform at scale

High performance at any scale;

Cost-effective and efficient

No Infrastructure to

manage

Pay only for what you use: Lambda

automatically matches capacity to

your request rate. Purchase compute

in 100ms increments.

Bring Your

Own Code

“Productivity focused compute platform to build powerful, dynamic,

modular applications in the cloud”

Run code in a choice of

standard languages. Use

threads, processes, files, and

shell scripts normally.

Focus on business logic, not

infrastructure. You upload code;

AWS Lambda handles everything

else.

Benefits of AWS Lambda for building a server-

less data processing engine

1 2 3

Amazon Kinesis: OverviewManaged services for streaming data ingestion and processing

Amazon Kinesis

StreamsBuild your own custom

applications that

process or analyze

streaming data

Amazon Kinesis

FirehoseEasily load massive

volumes of streaming

data into Amazon S3

and Redshift

Amazon

Kinesis

Analytics Easily analyze data

streams using

standard SQL queries

Benefits of Amazon Kinesis for stream data

ingestion and continuous processing

Real-time Ingest

Highly Scalable

Durable

Replay-able Reads

Continuous Processing

GetShardIterator and GetRecords(ShardIterator)

Allows checkpointing/ replay

Enables multi concurrent processing

KCL, Firehose, Analytics, Lambda

Enable data movement into many Stores/ Processing Engines

Managed Service

Low end-to-end latency

AWS Lambda and Amazon Kinesis integrationHow it Works

Pull event source model:

▪ Kinesis mapped as Event Source in Lambda

▪ Lambda polls the stream and batches available records

▪ Batches are passed for invocation to Lambda through function

param

What this means:

▪ Fleet of pollers running a flavour of KCL

▪ No resource policy


Event structure:Event received by Lambda function is a collection of records from Kinesis stream

{ "Records": [ { "kinesis": {

"partitionKey": "partitionKey-3", "kinesisSchemaVersion": "1.0", "data": "SGVsbG8sIHRoaXMgaXMgYSB0ZXN0IDEyMy4=", "sequenceNumber": "49545115243490985018280067714973144582180062593244200961" },

"eventSource": "aws:kinesis","eventID": "shardId-

000000000000:49545115243490985018280067714973144582180062593244200961", "invokeIdentityArn": "arn:aws:iam::account-id:role/testLEBRole", "eventVersion": "1.0", "eventName": "aws:kinesis:record", "eventSourceARN": "arn:aws:kinesis:us-west-2:35667example:stream/examplestream", "awsRegion": "us-west-2" } ] }


Synchronous invocation:

▪ Lambda invoked as synchronous RequestResponse type

▪ Lambda honors Kinesis at least once semantics

▪ Each shard blocks on in order synchronous invocation


Amazon

Kinesis 1

AWS

Lambda 1

Amazon

CloudWatch

Amazon

DynamoD

B

AWS

Lambda 2 Amazon

S3

• Multiple functions can be mapped to one

stream

• Multiple streams can be mapped to one

Lambda function

• Each mapping is a unique key pair

Kinesis stream to Lambda function

• Each mapping has unique shard

iterators

Amazon

Kinesis 2

DEMO

Creating a Kinesis stream

Streams

▪ Made up of Shards

▪ Each Shard supports writes up to 1MB/s

▪ Each Shard supports reads up to 2MB/s

across maximum 5 transactions/s

Data

▪ All data is stored and replayable for 24 hours

▪ A Partition Key is supplied by producer and used to distribute the PUTs across

Shards (using MD5 function to hash to 128-bit integers)

▪ A unique Sequence # is returned to the Producer upon a successful PUT call

▪ Make sure partition key distribution is even to optimize parallel throughput

▪ Pick a key with more groups than shards

Creating a Lambda function

Memory:

▪ CPU and disk proportional to the memory configured

▪ Increasing memory makes your code execute faster (if CPU bound)

▪ Increasing memory allows for larger record sizes processed

Timeout:

▪ Increasing timeout allows for longer functions, but more wait in case of

errors

Retries:

▪ For Kinesis, Lambda retries until the data expires (default 24 hours)

Permission model:

• The execution role defined for Lambda must have permission to access

the stream

Configuring the Event Source

Batch size:

▪ Max number of records that Lambda will send to one invocation

▪ Not equivalent to how many records Lambda will poll

▪ Effective batch size is every 250 ms

MIN(records available, batch size, 6MB)

▪ Increasing batch size allows fewer Lambda function invocations with

more data processed per function

Configuring the Event Source

Starting Position:

▪ The position in the stream where Lambda starts reading

▪ Set to “Trim Horizon” for reading from start of stream (all data)

▪ Set to “Latest” for reading most recent data (LIFO) (latest data)

Processing

Per Shard:

▪ Lambda calls GetRecords with max limit from Kinesis (10k or 10MB)

▪ If no record, wait 250 ms

▪ From in memory, sub batches and formats records into Lambda payload

▪ Invoke Lambda with synchronous invoke

… …Source

Kinesis

Destination

1

Lambda Poller

Destination

2

FunctionsShards

Lambda will scale automaticallyScale Kinesis by adding shards

Batch sync invokesPolls

Processing

▪ Lambda blocks on ordered processing for each individual shard

▪ Increasing # of shards with even distribution allows increased concurrency

▪ Batch size may impact duration if the Lambda function takes longer to process

more records

… …Source

Kinesis

Destination

1

Lambda Poller

Destination

2

FunctionsShards


Batch sync invokesPolls

Processing

▪ Maximum theoretical throughput :

# shards * 2MB / (s)

▪ Effective theoretical throughput :

# shards * batch size (MB) / Lambda function duration (s)

▪ If put / ingestion rate is greater than the theoretical throughput, your processing

is at risk of falling behind

Common observations:

▪ Effective batch size may be less than configured during low throughput

▪ Effective batch size will increase during higher throughput

▪ Number of invokes and GetRecord calls will decrease with increased Lambda

duration

Processing

▪ Retry execution failures until the record is expired

▪ Retry with exponential backoff up to 60s

▪ Throttles and errors impacts duration and directly impacts throughput

▪ Effective theoretical throughput :

( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry)

Kinesis

Destination

1

Destination

2

… …Source

FunctionsShards


Receives errorPolls

Receives error

Receives success

Lambda Poller

Monitoring Kinesis

•GetRecords (effective throughput) : bytes, latency, records etc

•PutRecord : bytes, latency, records, etc

•GetRecords.IteratorAgeMilliseconds: how old your last processed records were.

If high, processing is falling behind. If close to 24 hours, records are close to

being dropped.

Monitoring Lambda

•Monitoring: available in Amazon CloudWatch Metrics

• Invocation count

• Duration

• Error count

• Throttle count

•Debugging: available in Amazon CloudWatch Logs

• All Metrics

• Custom logs

• RAM consumed

• Search for log events

Best Practices

▪ Create enough shards for parallel processing

▪ Distribute load evenly across shards

▪ Monitor and address Lambda errors and throttles

▪ Monitor Kinesis limits and throttles

Best PracticesCreate different Lambda functions for each task, associate to same Kinesis stream

Log to

CloudWatch

Logs

Push to SNS

Get Started: Data Processing with AWSNext Steps

1. Create your first Kinesis stream. Configure hundreds of thousands

of data producers to put data into an Amazon Kinesis stream. Ex.

data from Social media feeds.

2. Create and test your first Lambda function. Use any third party

library, even native ones. First 1M requests each month are on us!

3. Read the Developer Guide, AWS Lambda and Kinesis Tutorial, and

resources on GitHub at AWS Labs

• http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html

• https://github.com/awslabs/lambda-streams-to-firehose

http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html

https://github.com/awslabs/lambda-streams-to-firehose lambda-streams-to-firehose



Real-time Data Processing Using AWS Lambda

Technology