Top Banner
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 30, 2016 Disrupting Big Data with Cost-effective Compute Charles Allen, Metamarkets Durga Nemani, Gaurav Agrawal, AOL Anu Sharma, Amazon EC2 CMP302
71

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Jan 11, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 30, 2016

Disrupting Big Data with

Cost-effective ComputeCharles Allen, Metamarkets

Durga Nemani, Gaurav Agrawal, AOL

Anu Sharma, Amazon EC2

CMP302

Page 2: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Amazon EC2 Spot instances

• Regular EC2 instances opened to the Spot market when

spare

• Prices on average 70-80% lower than On-Demand

• Best suited for workloads that can scale with compute

• Accelerate jobs 5-10 times e.g. run faster CI/CD pipelines

(case study: Yelp)

• Reduce costs by 5-10 times, scale stateless web applications

(case study: Mapbox, Ad-tech)

• Generate better business insights from your event stream

Page 3: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

In this session

• Use Case: context and history

• AOL: Separation of Compute and Storage using Amazon EMR and

EC2 Spot instances

• Architecture

• Cost Optimization

• Orchestration

• Monitoring

• Best Practices

• Metamarkets: Spark and Druid on EC2 Spot instances

• Architecture Overview: Real-time, Batch Jobs, Lambda

• Spark on Spot instances

• Druid on Spot instances

• Monitoring

Page 4: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Business Intelligence Data Set

• Event Data • Timestamp

• Dimensions/Attributes

• Measures

• Total data set is huge, billions of events per day

Page 5: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Relational Databases

Traditional Data Warehouse Star

Schema

• FACT table contains primary information

and measures to aggregate

• DIM tables contain additional attributes

about entities

• Queries involve joins between central

FACT and DIM tables

Performance degrades as data scales.

Page 6: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Key/Value Stores

Fast writes, fast lookups

• Pre-compute every possible query

• As more columns are added, query space grows exponentially

• Primary key is a hash of timestamp and dimensions

• Value is measure to aggregate

• Shuffle data from storage to computational buffer - slow

• Difficult to create intelligent indexes

Precomputation Range Scans

Page 7: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

General Compute Engines

SQL on Hadoop

• Scale with compute power

• Generate up to 5-10x faster

business insights with cheaper

compute

• Or just reduce costs by 80-90%

Page 8: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Pioneers to Settlers

Algorithmic Efficiency to Mundane Efficiency

Page 9: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Separation of Compute and Storage

Durga Nemani, System Architect, AOL

Gaurav Agrawal, Software Engineer, AOL

Big Data Processing with Amazon EMR and EC2 Spot instances

Page 10: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Architecture

Page 11: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Architecture

AWS Lambda :

Orchestration

Elastic IP

Amazon

EMR Hive

AWS IAM

Amazon

S3 : Data

Lake

Amazon Dynamo DB :

Data Validation

Amazon

EMR Hive

client

Amazon RDS : Hive Metastore

Data Processing Data Analytics

Amazon EMR

Presto

Elastic IP

Amazon

EMR Presto

client

Page 12: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Key features and advantages• Separation of compute and storage

• Scale compute and storage independently

• Separate data processing and analytics

• Hive for processing, Presto for analytics

• No data migration

• S3 Data lake

• Single source of truth

• Columnar format for performance and compression

• VPC design

• Identified by Name Tags

• AOL CIDR, VPN

• Few lines of code change vs big data migration efforts

Page 13: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Cost Optimization

Page 14: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Amazon EC2 Spot Instances

• Keep in mind

• Availability

• Spot pricing vary for

• Instance Types

• Availability Zone

• Different provisioning time

• AOL Requirement

• Major restatement - 15-20K EC2 Instances

• Data for 15+ countries

• Frequency : HLY, DLY, WLY, MTD, MLY, 28 Days

Page 15: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

EMR Deployment Setup

• Set up VPC in all regions

• Ensure Spot Limits

• Setup Hard EC2 limit per AZ

• Multiple instance types

• Define Instance Type-Core Mapping

• Data Volume

• Code Complexity

• Pay actual price not bid price!

Page 16: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Deployment Logic Diagram

Data Volume

+

Code

Complexity

Pick

Instance

Type

Sorted Spot

Price AZ

List

Number of

Cores = A

Next AZ

in List?

Open/Active

Instances =

B

A + B <

AZ Limit

Kick off

EMR

Yes

Yes

No

No

Page 17: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Average Cost Saving Graphs

**m3.xlarge Sept’2016 Cost

On Demand

Static AZ

~80%

Savings

Page 18: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Average Cost Saving Graphs

Static AZ

Cheap AZ

10-15%

Savings

**m3.xlarge Sept’2016 Cost

Page 19: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Size ( GB

)Cores Hours

Local AZ

Cost

Cheaper

AZ Cost

Transfer

Cost

Total

Cost

Cost

Savings

50 100 2 3,431 2,847 365 3,212 6%

100 300 3 20,586 17,082 730 17,812 13%

200 500 5 51,465 42,705 1,460 44,165 14%

300 700 7 109,792 91,104 2,190 93,294 15%

Why cheaper AZ matters?

• Data transfer cost

• Worst Case scenario – Cheaper AZ not in local region

• More Data => More Nodes + More Hours

Size ( GB

)Cores Hours

Local AZ

Cost

Cheaper

AZ Cost

Transfer

Cost

Total

Cost

Cost

Savings

10 25 1 429 356 73 429 0%

**m3.xlarge Sept’2016 Cost

Page 20: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

EMR Region Distribution

us-east-120%

ap-northeast-11%

sa-east-19%

ap-southeast-13%

ap-southeast-29%

us-west-226%

us-west-14%

eu-west-128%

AOL DW Sept-Oct 2016

80% times

Cheaper AZ is

not in local

region

Page 21: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Average Cost Saving Graphs9/1

/16

9/2

/16

9/3

/16

9/4

/16

9/5

/16

9/6

/16

9/7

/16

9/8

/16

9/9

/16

9/1

0/1

6

9/1

1/1

6

9/1

2/1

6

9/1

3/1

6

9/1

4/1

6

9/1

5/1

6

9/1

6/1

6

9/1

7/1

6

9/1

8/1

6

9/1

9/1

6

9/2

0/1

6

9/2

1/1

6

9/2

2/1

6

9/2

3/1

6

9/2

4/1

6

9/2

5/1

6

9/2

6/1

6

9/2

7/1

6

9/2

8/1

6

9/2

9/1

6

9/3

0/1

6

Static AZ

Cheap AZ

AZ+Data+Code

15-22%

Savings

**m3.xlarge Sept’2016 Cost

Page 22: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Orchestration: AWS Lambda

Page 23: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Process Pipeline Overview

• Multiple Stages b/w Raw Data & Final Summary

• Ensure dependencies

• Integration with Data services

• Extensible, Scalable & Reliable

• Recovery Options

• Notifications

• Directed Acyclic Graph

Page 24: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Sample DW Workflow

a

b c

e

jg h i

dOperations

Page 25: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

AOL DW Process Pipeline

Amazon S3

Amazon S3

Amazon EMR

Amazon EMR

AWS Lambda

Python Boto

AWS Lambda

Python Boto

Page 26: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Benefits & Suggestions

• Improved SLA due to Event based model

• Serverless – Zero Administration

• Millisecond response time

• Pricing – 1 million requests/month Free

• Generic utilities for Extensibility

• Built in Auto Scaling

• CloudWatch Logging

• Replaced ~2000 Autosys jobs

Page 27: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

EMR Monitoring

Page 28: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

EMR Monitoring - Prunella

• Tons of clusters/day

• EMR Failure causes

• Network Connectivity

• Bootstrap Actions

• Zero OPS Hours

• SLA improvement

• No datacenter dependency

• Notifications – Email/Slack

Page 29: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
Page 30: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Good to have

• S3 Lifecycle based on Tags

• Terminate Long STARTING EMR Cluster

• Python 3 Lambda Support

• Lambda Code Test/Deployment

• Kappa

• Global EMR Dashboard

• Redshift External Tables

Page 31: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Recap

• Transient Spot Architecture

• S3 as Data Lake

• Cost Optimization

• Dynamic choice of Spot AZ and Number of Cores

• Server less Process Pipeline

• AWS Lambda for event driven design

• Automated EMR Monitoring

• Reduce Manual intervention for 1000s of clusters

Page 32: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Photo Credits

• Gabor Kiss - http://bit.ly/2epkQJY

• AustinPixels- http://bit.ly/2eAenqr

• Mike - http://bit.ly/2eqGx82

Page 33: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Related Sessions

• AWS re:Invent 2015 | (BDT208) A Technical Introduction

to Amazon Elastic MapReduce

• https://www.youtube.com/watch?v=WnFYoiRqEHw

• AWS re:Invent 2015 | (BDT210) Building Scalable Big

Data Solutions: Intel & AOL

• https://www.youtube.com/watch?v=2yZginBYcEo

Page 34: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Spark and Druid on EC2 Spot Instances

Charles Allen, Metamarkets

Page 35: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

About Me

Director of Platform

Druid PMC

@drcrallen

[email protected]

Special thanks to Jisoo Kim

Page 36: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Programmatic data is 100x larger than Wall

Street

Page 37: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Metamarkets

+ Industry leader in interactive analytics for

programmatic marketing

+ > 100B events / day

+ Typical peak approx 2,000,000 events / sec

+ Massaged, joined, HA replicated → 3M/s

Move fast. Think big. Be open. Have fun.

Page 38: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Metamarkets

+ Event ingestion lag down to few ms

+ Dynamic queries

+ Query latency less than 1 second

+ Specially tailored for real-time bidding

Page 39: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Current Spot Usage

Page 40: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Current Spot Usage

+ Spark

+ Druid

+ Jenkins

Page 41: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Brief Architecture Overview - Real-time

Kafk

a Druid Real-

time

Indexing

Kafka / Samza

Very Fast

● Pretty accurate

● On-time data

Page 42: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Brief Architecture Overview - BatchK

afk

a

Druid

HistoricalS3 Spark

A Few Hours Later

● Highly accurate

● Deduplicated

● Late data

Page 43: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Brief Architecture Overview (Lambda)

Real Time

Batch

Kafk

a

ΔFew HrsHistorical

User

Page 44: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Key Technologies Used

+ Kafka

+ Samza

+ Spark

+ Druid

Page 45: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Spark on Spot

Page 46: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Why Spark?

The Good:

+ No HDFS

+ Good enough partial failure recovery

+ Native Mesos, Yarn, and Stand-alone

The Bad:

+ Rough to configure multi-tenant

Page 47: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Spark

+ Between 1 and 4 PiB / day

(mem bytes spilled)

+ Between 200B and 1T events / day

+ Peak days can be up to 5x baseline

Think Big.

Page 48: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Cost Savings

SPARK

Savings vs. on-demand

Approx equal to 3-year term

>60%

Page 49: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Tradeoff

+ More complex job failure handling

+ “Did my job die because of Me, Spark, the

Data, or the Market?”

+ More random delays

+ More man-hours to manage, or

automation to build

Page 50: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Druid on Spot

Page 51: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Druid on Spot

Some of our Historical nodes run on Spot

185 TB (compressed)

state on EBS on Spot

⅕ of a petabyte can vanish… and come back in

15 minutes

Page 52: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Druid Historical Data

1 hr < EVENT_TIME < X Months

X Months < EVENT_TIME < Y Months

HOT

Y Months < EVENT_TIME < Z Years

COLD

ICY

Page 53: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Historical Tier QPS (Logscale)

Page 54: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Historical Tier QPS (Logscale)

Spot can

go here

Page 55: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Using EBS With Druid on Spot

+ Define a “pool” tag or EBS volumes

+ If EBS “pool” is “empty” (no unmounted volumes)

Create a new volume (with proper tags) and mount it

+ Otherwise, claim drive from pool

+ Sanity check on volume, discard if unrecoverable

Page 56: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Using EBS With Druid on Spot

+ Monitor spot notifications[1] to stop gracefully

+ If stop is detected, prepare to die gracefully

+ Stop applications (hook)

+ Unmount volume cleanly

+ Do not actually terminate instance; wait for death

[1] https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/

Page 57: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Terrifying to Boring

(Originally ran without EBS reattachment)

[ops] Search Alert: More than 0 results found

for "DRUID - Spot Market Fluctuations"

Now mundane.

Page 58: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Druid Tips

+ Coordinator (thing that moves state around) does

better with NO tier than with a half-tier

+ Flapping nodes can cause backpressure, better to

kill entire tier than repeatedly flap up and down.

+ Nodes usually have a burn-in time before they reach

steady-state fast queries (few minutes)

Page 59: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Druid + Spot + EBS

Accomplished by EBS re-attachment

Metamarkets is proud to Open Source this tool

Be Open.

Page 60: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Monitoring

Page 61: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Spot Price on the AWS Management Console

If only there was some tool that allowed

powerful, drill-down analytics on real-time

markets…

Page 62: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
Page 63: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

x1.32xl price stability across zones

Page 64: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Final Thoughts

Page 65: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Spot Caveats

+ Switching from Spot to On-Demand does NOT

always work

+ Pricing strategy tuned to value of lost work

+ Scaling in a Spot market must be done SLOWLY

(tens of nodes at a time)

+ us-east-1 is crowded

Page 66: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Lessons Learned… “If I could do it all over

again”

+ Multi-homed (at least by AZ) from the very start

+ us-west

+ More ZK quorums

+ Build on cluster resource framework

Page 67: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

We Are Hiring!

Have Fun!

Page 68: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Metamarkets and Spot

+ Metamarkets has great internal tooling for Spot

market insight

+ Druid uses EBS reattachment

+ Spark works well with proper configuration

Page 69: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Thank you!

Page 70: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Remember to complete

your evaluations!

Page 71: AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Related Sessions

• AWS re:Invent 2015 | (BDT208) A Technical Introduction

to Amazon Elastic MapReduce

• https://www.youtube.com/watch?v=WnFYoiRqEHw

AWS re:Invent 2015 | (BDT210) Building Scalable Big Data

Solutions: Intel & AOL

• https://www.youtube.com/watch?v=2yZginBYcEo