Analytics on AWS - IP Expo 2013

Analytics on AWS

IP Expo 2013

BIG DATA

When innovation is required

to collect, store, analyze, and

manage your data

VOLUME

VELOCITY

VARIETY

Customer Needs

• Store Any Amount of Data

– Without Capacity Planning

• Perform Complex Analysis on Any Data

– Scale on Demand

• Store Data Securely

• Decrease Time to Market

– Build Environments Quickly

• Reduce Costs

– Reduce Capital Expenditure

• Enable Global Reach

Ingestion | Integration

Elastic Block Store High performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Availability 99.99%

Durability 99.999999999%

Is a Web Store Not a file system

No Single Points of Failure

Eventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.095/GB/month

(DUB)

Typical use

case

Write once, read many

Limits 100 Buckets, Unlimited

Storage, 5TB Objects

Simple Storage Service Highly scalable object storage for the internet

1 byte to 5TB in size

99.999999999% durability

Peak Requests: 1.2 Million / Second

14 40

102

262

762

1300

2100

0

500

1000

1500

2000

Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today

Objects in S3 B

illio

ns

Amazon S3 provides near linear scalability

S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr

350 VMs; 28.7GB/s; $90/hr

34 secs per terabyte

GB/Second

Reader

Connections

Performance & Scalability

• Spotify is an online music

service offering instant access

to over 16 million licensed

songs

• Over 15 million active users

and 4 million paying

subscribers

• Spotify adds over 20,000 tracks

a day to its catalogue

Spotify uses Amazon S3 for Music Storage

-Emil Fredriksson

Operations Director for Spotify

AMAZON S3 GIVES US CONFIDENCE IN OUR ABIL ITY TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y

Amazon Glacier Long term object archive

Extremely low cost per gigabyte


Elastic Block Store High performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Durability 99.999999999%

Designed for Archival Not a file system

Vaults & Archives

3-5 Hour Retrieval Time

Paradigm Archive Store

Performance Configurable - Low

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.011/GB/month

Typical use

case

Write once, read

infrequently

< 10% / Month

Simple Storage Service Highly scalable object storage

1 byte to 5TB in size


Glacier Long term object archive

Extremely low cost per gigabyte


Storage Lifecycle Integration

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

NOSQL Data Capture

DynamoDB Provisioned throughput NoSQL database

Fast, predictable, configurable performance

Fully distributed, fault tolerant HA architecture

Integration with EMR & Hive

RDS Dynamo

DB

Redshift

• Writes

• Writes are acknowledged

(committed) once they exist in at

least two physical data centers

• Writes are persisted to SSD

• Reads

• Tunable for Application

Requirements

• No reduction in durability or

consistency in order to

achieve throughput

Dynamo Consistency

Eventually Consistent Read Strongly Consistent Read

Stale Values reads possible No Stale Values read

Highest Throughput Lower Potential Throughput

√ √

√

• Shazam connects more than 200

million people, in more than 200

countries and 33 languages, to the

music, TV shows and brands they love

• When customers hear a song or see a

TV program or ad they like, they simply

activate the app to “tag” it

• Shazam realized it could support over

500,000 writes per second with

Dynamo DB

• Also using Amazon EMR for large-

scale data analysis that can require

more than 1 million writes per second

Shazam scaled Dynamo DB to 500,000 IOPS for a

Superbowl Ad

-Jason Titus

Shazam CTO

AWS GAVE USE

THE ABILITY TO

BRING A MASSIVE

A M O U N T O F

C A P A C I T Y

O N L I N E I N A

SHORT PERIOD

O F T I M E

Complex Data Analysis …

Parallel ETL

Elastic MapReduce Managed, elastic Hadoop cluster

Integrates with S3 & DynamoDB

Automated installation of Hive & Pig

Support for Spot Instances

Integrated HBase NOSQL Database

Compute Storage


Database

App Services


Networking

Application Services

Elastic

MapReduce

EMR Data Sources

Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 +

5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50%

Cost Savings: ~20%

Reducing Costs with Spot Instances

Compute Storage


Database

App Services


Networking

Compute

Vertical

Scaling

From $0.02/hr

Elastic Compute Cloud (EC2) Basic unit of compute capacity

Range of CPU, memory & local disk options

13 Instance types available, from micro to cluster

compute

Feature Details

Flexible Run windows or linux distributions

Scalable Wide range of instance types from micro to cluster compute

Machine Images Configurations can be saved as machine images (AMIs) from which new instances can be created

Full control Full root or administrator rights

Secure Full firewall control via Security Groups

Monitoring Publishes metrics to Cloud Watch

Inexpensive On-demand, Reserved and Spot instance types

VM Import/Export Import and export VM images to transfer configurations in and out of EC2

Cluster Compute

EC2 Instance 2nd Generation cluster compute instance

Cluster Compute instances implement HVM process execution

Intel® Xeon® E5-2670 processors

10 Gigabit Ethernet

1

Cluster Compute

80 EC2

Compute Units

60GB RAM

3TB Local

Disk

Cluster Compute

Network placement groups Cluster instances deployed in a ‘Placement

Group’ enjoy low latency, full bisection 10

Gbps bandwidth

2

10Gbps

CC2 Instance Cluster

240 TFLOPS Making it the 72nd fastest

supercomputer in the world (#42 when announced at SC’11)

(Test performed nov 2011, benchmark published June 2012)

Cluster GPU

EC2 instance GPU compute instances: Intel® Xeon® X5570 processors

2 x NVIDIA Tesla “Fermi” M2050 GPUs

I/O Performance: Very High (10 Gigabit Ethernet)

1

Cluster GPU

33.5 EC2

Compute Units

20GB RAM

2x NVIDIA GPU

@ >400 Cores

Each

S&P Capital IQ Uses AWS for Big Data Processing

Provides data to 4200+ top

global investment firms

Launched Hadoop faster,

Learned Hadoop faster

S3 Hadoop Cluster

Structured Data Management

Compute Storage


Database

App Services


Networking

Structured Data Analysis

Relational Database Service Managed Oracle, MySQL & SQL Server

Dynamo DB Managed NOSQL Database

Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse

RDS Dynamo

DB

Redshift

Compute Storage


Database

App Services


Networking


Relational Database Service Database-as-a-Service

No need to install or manage database instances

Scalable and fault tolerant configurations

Integration with Data Pipeline

RDS Dynamo

DB

Redshift

Compute Storage


Database

App Services


Networking


Redshift Managed Massively Parallel Petabyte Scale Data

Warehouse

Streaming Backup/Restore to S3

Extensive Security

2 TB -> 1.6 PB

RDS Dynamo

DB

Redshift

Redshift parallelizes and distributes everything

Query

Load

Backup

Restore

Resize

ComputeNode

ComputeNode

ComputeNode

LeaderNode

Common BI Tools

JDBC/ ODBC

10GigE Mesh

Redshift lets you start small and grow

big Extra Large Node (XL)

3 spindles, 2TB, 15GiB RAM

2 virtual cores, 10GigE

Single Node (2TB)

Cluster 2-32 Nodes (4TB – 64TB)

8 Extra Large Node (8XL)

24 spindles, 16TB, 120GiB RAM

16 virtual cores, 10GigE

Cluster 2-100 Nodes (32TB – 1.6PB)

Important Redshift Features

No Downtime Resize

Streaming Backup/Restore to S3

Automated Point In Time

Snapshotting

Workload Management

Support for VPC

Support for Encrypted Data Loads

Cluster SSL Only Communications

Input Datanode: This could be a S3 bucket, RDS

table, EMR Hive table, etc.

Activity: This is a data aggregation,

manipulation, or copy that runs on a user-

configured schedule.

Output Datanode: This supports all the same

datasources as the input datanode, but they don’t

have to be the same type.

Application Services

Compute Storage


Database

App Services


Networking

Data Pipeline Automatically Provision EC2 & EMR Resources

Manage Dependencies & Scheduling

Automatically Retry and Notify of Success &

Failure

Output: S3 file

Path: s3://trend-data/#{year-month-day}.csv

Activity: EMR Transform

Hive Query: user-metrics.hql

Frequency: Daily

Input: RDS Table

Table: User-Demographics

SQL Precondition: “Select last_update from table“ > #{YY-MM-DD}

Input: DynamoDB Table

Table: User-Event-Data-#{year-month}

Success Notification: [email protected]

Failure Notification: [email protected]

Delay Notification: : [email protected]

Sample Use Case

mailto:[email protected]








Integrated Analytics

Integrated Analytics

End User Reporting

End User Reporting

Redshift RDS EMR

Analytics on AWS - IP Expo 2013

Technology

Analytics on AWS - IP Expo 2013