Analytics on AWS
IP Expo 2013
BIG DATA
When innovation is required
to collect, store, analyze, and
manage your data
VOLUME
VELOCITY
VARIETY
Customer Needs
• Store Any Amount of Data
– Without Capacity Planning
• Perform Complex Analysis on Any Data
– Scale on Demand
• Store Data Securely
• Decrease Time to Market
– Build Environments Quickly
• Reduce Costs
– Reduce Capital Expenditure
• Enable Global Reach
Ingestion | Integration
Elastic Block Store High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability 99.99%
Durability 99.999999999%
Is a Web Store Not a file system
No Single Points of Failure
Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
(DUB)
Typical use
case
Write once, read many
Limits 100 Buckets, Unlimited
Storage, 5TB Objects
Simple Storage Service Highly scalable object storage for the internet
1 byte to 5TB in size
99.999999999% durability
Peak Requests: 1.2 Million / Second
14 40
102
262
762
1300
2100
0
500
1000
1500
2000
Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Today
Objects in S3 B
illio
ns
Amazon S3 provides near linear scalability
S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
34 secs per terabyte
GB/Second
Reader
Connections
Performance & Scalability
• Spotify is an online music
service offering instant access
to over 16 million licensed
songs
• Over 15 million active users
and 4 million paying
subscribers
• Spotify adds over 20,000 tracks
a day to its catalogue
Spotify uses Amazon S3 for Music Storage
-Emil Fredriksson
Operations Director for Spotify
AMAZON S3 GIVES US CONFIDENCE IN OUR ABIL ITY TO EXPAND STORAGE Q U I C K LY W H I L E ALSO PROVIDING H I G H D A T A D U R A B I L I T Y
Amazon Glacier Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
Elastic Block Store High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Durability 99.999999999%
Designed for Archival Not a file system
Vaults & Archives
3-5 Hour Retrieval Time
Paradigm Archive Store
Performance Configurable - Low
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.011/GB/month
Typical use
case
Write once, read
infrequently
< 10% / Month
Simple Storage Service Highly scalable object storage
1 byte to 5TB in size
99.999999999% durability
Glacier Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
Storage Lifecycle Integration
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
NOSQL Data Capture
DynamoDB Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo
DB
Redshift
• Writes
• Writes are acknowledged
(committed) once they exist in at
least two physical data centers
• Writes are persisted to SSD
• Reads
• Tunable for Application
Requirements
• No reduction in durability or
consistency in order to
achieve throughput
Dynamo Consistency
Eventually Consistent Read Strongly Consistent Read
Stale Values reads possible No Stale Values read
Highest Throughput Lower Potential Throughput
√ √
√
• Shazam connects more than 200
million people, in more than 200
countries and 33 languages, to the
music, TV shows and brands they love
• When customers hear a song or see a
TV program or ad they like, they simply
activate the app to “tag” it
• Shazam realized it could support over
500,000 writes per second with
Dynamo DB
• Also using Amazon EMR for large-
scale data analysis that can require
more than 1 million writes per second
Shazam scaled Dynamo DB to 500,000 IOPS for a
Superbowl Ad
-Jason Titus
Shazam CTO
AWS GAVE USE
THE ABILITY TO
BRING A MASSIVE
A M O U N T O F
C A P A C I T Y
O N L I N E I N A
SHORT PERIOD
O F T I M E
Complex Data Analysis …
Parallel ETL
Elastic MapReduce Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Automated installation of Hive & Pig
Support for Spot Instances
Integrated HBase NOSQL Database
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Application Services
Elastic
MapReduce
EMR Data Sources
Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
#2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50%
Cost Savings: ~20%
Reducing Costs with Spot Instances
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Compute
Vertical
Scaling
From $0.02/hr
Elastic Compute Cloud (EC2) Basic unit of compute capacity
Range of CPU, memory & local disk options
13 Instance types available, from micro to cluster
compute
Feature Details
Flexible Run windows or linux distributions
Scalable Wide range of instance types from micro to cluster compute
Machine Images Configurations can be saved as machine images (AMIs) from which new instances can be created
Full control Full root or administrator rights
Secure Full firewall control via Security Groups
Monitoring Publishes metrics to Cloud Watch
Inexpensive On-demand, Reserved and Spot instance types
VM Import/Export Import and export VM images to transfer configurations in and out of EC2
Cluster Compute
EC2 Instance 2nd Generation cluster compute instance
Cluster Compute instances implement HVM process execution
Intel® Xeon® E5-2670 processors
10 Gigabit Ethernet
1
Cluster Compute
80 EC2
Compute Units
60GB RAM
3TB Local
Disk
Cluster Compute
Network placement groups Cluster instances deployed in a ‘Placement
Group’ enjoy low latency, full bisection 10
Gbps bandwidth
2
10Gbps
CC2 Instance Cluster
240 TFLOPS Making it the 72nd fastest
supercomputer in the world (#42 when announced at SC’11)
(Test performed nov 2011, benchmark published June 2012)
Cluster GPU
EC2 instance GPU compute instances: Intel® Xeon® X5570 processors
2 x NVIDIA Tesla “Fermi” M2050 GPUs
I/O Performance: Very High (10 Gigabit Ethernet)
1
Cluster GPU
33.5 EC2
Compute Units
20GB RAM
2x NVIDIA GPU
@ >400 Cores
Each
S&P Capital IQ Uses AWS for Big Data Processing
Provides data to 4200+ top
global investment firms
Launched Hadoop faster,
Learned Hadoop faster
S3 Hadoop Cluster
Structured Data Management
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Structured Data Analysis
Relational Database Service Managed Oracle, MySQL & SQL Server
Dynamo DB Managed NOSQL Database
Amazon Redshift Massively Parallel Petabyte Scale Data Warehouse
RDS Dynamo
DB
Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Structured Data Analysis
Relational Database Service Database-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Integration with Data Pipeline
RDS Dynamo
DB
Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Structured Data Analysis
Redshift Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Extensive Security
2 TB -> 1.6 PB
RDS Dynamo
DB
Redshift
Redshift parallelizes and distributes everything
Query
Load
Backup
Restore
Resize
ComputeNode
ComputeNode
ComputeNode
LeaderNode
Common BI Tools
JDBC/ ODBC
10GigE Mesh
Redshift lets you start small and grow
big Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
Single Node (2TB)
Cluster 2-32 Nodes (4TB – 64TB)
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
Cluster 2-100 Nodes (32TB – 1.6PB)
Important Redshift Features
No Downtime Resize
Streaming Backup/Restore to S3
Automated Point In Time
Snapshotting
Workload Management
Support for VPC
Support for Encrypted Data Loads
Cluster SSL Only Communications
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.
Activity: This is a data aggregation,
manipulation, or copy that runs on a user-
configured schedule.
Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.
Application Services
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Data Pipeline Automatically Provision EC2 & EMR Resources
Manage Dependencies & Scheduling
Automatically Retry and Notify of Success &
Failure
Output: S3 file
Path: s3://trend-data/#{year-month-day}.csv
Activity: EMR Transform
Hive Query: user-metrics.hql
Frequency: Daily
Input: RDS Table
Table: User-Demographics
SQL Precondition: “Select last_update from table“ > #{YY-MM-DD}
Input: DynamoDB Table
Table: User-Event-Data-#{year-month}
Success Notification: [email protected]
Failure Notification: [email protected]
Delay Notification: : [email protected]
Sample Use Case
Integrated Analytics
Integrated Analytics
End User Reporting
End User Reporting
Redshift RDS EMR