©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Leveraging Amazon Redshift for Your Data Warehouse Pavan Pothukuchi—Principal PM, Amazon Redshift Dan Wagner—Founder & CEO, Civis Analytics
Aug 13, 2015
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Leveraging Amazon Redshift for
Your Data WarehousePavan Pothukuchi—Principal PM, Amazon Redshift
Dan Wagner—Founder & CEO, Civis Analytics
Analytics workflow
Generate Ingest Analyze
• Transactional
• Semi-structured
• Log data
• Sensor/IoT
Amazon
S3
Amazon
RDS
Amazon
DynamoDB
Amazon Kinesis
Amazon
Redshift
Amazon EMR
Data warehouse—challenges
Cost
Complexity
Performance
1990 2000 2010 2020
Enterprise Data Data in Warehouse
Petabyte scale; massively parallel
Relational data warehouse
Fully managed; zero admin
SSD and HDD platforms
As low as $1,000/TB/year
Amazon
Redshift
a lot cheaper
a lot faster
a whole lot simpler
Clickstream analytics for Amazon.com
• Web log analysis for Amazon.com– 1 PB+ workload, 2 TB/day@67% YoY
– Largest table: 400 TB
• Understand customer behavior
• Solution– Legacy DW—query across 1 week/hr.
– Hadoop—query across 1 month/hr.
Results with Amazon Redshift
• Query 15 months in 14 minutes
• Load 5 B rows in 10 minutes
• 21 B w/ 10 B rows: 3 days to 2 hours(Hive Amazon Redshift)
• Load pipeline: 90 hours to 8 hours (Oracle Amazon Redshift)
• 100 node DS1.8XL
• Easy resizing
• Managed backups and restore
• Failure tolerance and recovery
• 20% time of one DBA
• Increased productivity
Common customer use cases
• Reduce costs by extending DW rather than adding HW
• Migrate completely from existing DW systems
• Respond faster to business
• Improve performance by an order of magnitude
• Make more data available for analysis
• Access business data via standard reporting tools
• Add analytic functionality to applications
• Scale DW capacity as demand grows
• Reduce HW and SW costs by an order of magnitude
Traditional enterprise DW Companies with big data SaaS Companies
Amazon Redshift architecture
• Leader node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– DS2: HDD; scale from 2 TB to 2 PB
– DC1: SSD; scale from 160 GB to 326 TB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
• COPY compresses automatically
• You can analyze and override
• Average 2–4x
Dramatically reduces I/O
• Column storage
• Data compression
• Direct-attached storage
• Large data block sizes
• Track of the minimum and
maximum value for each block
• Skip over blocks that don’t
contain the data needed for a
given query
• Minimize unnecessary I/O
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage
to maximize throughput
• Hardware optimized for high
performance data
processing
• Large block sizes to make the
most of each read
Built-in security
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks encrypted on disks and in Amazon S3
– HSM support
• No direct access to compute nodes
• Audit logging and AWS CloudTrail integration
• Amazon VPC support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP, others
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
1/10 the price of a traditional data warehouse
DS2 (HDD)Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
DS2—50% more performance, same cost
• 2 x the memory and compute power of DS1
• Enhanced networking and 1.5 x gain in disk throughput
• 40% to 60% performance gain over DS1
• Same cost
• Restore from snapshot to migrate from DS1 to DS2
Custom ODBC and JDBC drivers
• Up to 35% higher performance than open source
drivers
• Supported by Informatica, Microstrategy, Pentaho,
Qlik, SAS, Tableau
• Will continue to support PostgreSQL open-source
drivers
• Download drivers from console
User-defined functions (upcoming)
• Define your own functions
• Python 2.7– Syntax is largely identical to PostgreSQL UDF
syntax
– System and network calls within UDFs are
prohibited
• Pandas, NumPy, and SciPy pre-installed– Import your own libraries for even more flexibility
Scalar UDF example—URL parsing
CREATE FUNCTION f_hostname (VARCHAR url)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
Interleaved multi-column sort
• Compound sort keys
– Optimized filtering data by one leading column
• Interleaved sort keys
– Optimized for filtering data by up to eight columns
– No storage overhead unlike an index
– Lower maintenance penalty compared to indexes
Compound sort keys illustrated
• Records in Amazon Redshift are stored in blocks.
• For this illustration, let’s assume that four records fill a block.
• Records with a given cust_id are all in one block.
• However, records with a given prod_id are spread across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given
cust_id are spread
across two blocks.
• Records with a given
prod_id are also
spread across two
blocks.
• Data is sorted in
equal measures for
both keys.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
How to use the feature
• Use ‘INTERLEAVED’ keyword with SORTKEY
– Existing syntax will still work and behavior is unchanged
• No change needed to queries
• Benefits are significant
[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Introducing Civis:
The End-to-End Data Science
Platform Built on the Amazon Cloud
Dan Wagner, Founder & CEO—Civis Analytics
How We Leverage AWS
End-to-end data science process was the
secret to driving precise, correct action at scale
Share
Insights across organization
Data in one place
Unify
Measure
Match
Records into a single view
Predict
Outcomes to drive decisions
Explore
Data in a fast database
Drive Action
Right actions based on data
How We Leverage AWS
“Wagner!—What the hell is the
Vertica and why does it not work?”
2012 Obama campaign manager
(said multiple, multiple times…)
How We Leverage AWS
“I go downstairs and I ask for a basic report
showing daily telephone plan attrition by region,
by customer type, by acquisition channel.
And then I wait… and wait… and wait.
Thirty days later, I get a spreadsheet
on my desk for last month’s attrition.
This is honestly killing me.”
CEO of global telecommunications firm
How We Leverage AWS
Conventional IT infrastructure serves foundational business needs
Data Sources
IntegrationLayer
Analytics Applications Users
How We Leverage AWS
But data scientists, a new entrant, need their own tools to support decision-making…
Data Sources
IntegrationLayer
Analytics Applications Users
How We Leverage AWS
And they need all the data to help facilitate decisions across the organization…
Data Sources
IntegrationLayer
Analytics Applications Users
How We Leverage AWS
The universal tragedy of big data:
The science is there; you just can’t plug it
in
How We Leverage AWS
Data Sources
IntegrationLayer
Analytics Applications Users
And they need all the data to help
facilitate decisions across the organization…
A lot of latency
Distracting custom extracts
Limited scale
$$$$$
Uncoordinated
unification
How We Leverage AWS
The challenge: Data scientists have unique end-to-end analytical and computational needs
Share
Insights across organization
Data in one place
Unify
Measure
Match
Records into a single view
Predict
Outcomes to drive decisions
Explore
Data in a fast database
Drive Action
Right actions based on data
How We Leverage AWS
The challenge: Data scientists have unique end-to-end analytical and computational needs
Share
Insights across organization
Data in one place
Unify
Measure
Match
Records into a single view
Predict
Outcomes to drive decisions
Explore
Data in a fast database
Drive
ActionRight actions based on data
IntuitiveBuilt for data scientists
ExpandableCapacity can grow as needed
ExtensibleCustomizable for developers
End-to-EndFrom ETL to Modeling to Action
Right CostFor any organization
How We Leverage AWS
The easy-to-use, end-to-end, powerful & flexible
data science platform in the Amazon cloud
Introducing Civis
How We Leverage AWS
Amazon
S3
Our Lego pieces
Amazon
EC2
Amazon
EMR
Amazon
Redshift
Amazon
DynamoDB
Amazon
RDS
How We Leverage AWS
Amazon Redshift is at the core of our platform
• Expand with minimal
interruption
• Multiple hardware options
for diverse needs/costs
• As performant as Hadoop
with any size data….
• …and way easier to set up
• Facilitates exploration—
no Map Reduce penalty
• Stays stable with lots of users
• Storage limit consequences
are low
• Use industry-standard SQL
Technical Value Analyst Value
How We Leverage AWS
Our pirate ship
Unify
Amazon
Redshift
DynamoDB
S3
EMREC2
S3
Explore
Predict
Measure
Drive Action
Share
Match
How We Leverage AWS
Report out on results
Share & Drive Action5 EC2S3EMRAmazon
Redshift
DynamoDB RDS
How We Leverage AWS
What it has to be: Amazon Redshift variable pricing
eliminates up-front capital costs for data science IT
$0
$50,000
$100,000
$150,000
$200,000
$250,000
$300,000
Amazon Redshift Vertica
May 2013: Estimated Starting Cost for Amazon Redshift vs. HP Vertica
How We Leverage AWS
And minimizes expansion time vs. on-premises hardware
(based on our experience)
3
850
0
100
200
300
400
500
600
700
800
900
Amazon Redshift Vertica
Estimated hours required to the double size of our cluster
How We Leverage AWS
Case client expanded from 5 users to 300 in
5 months, hitting 100K jobs during the period
-
50
100
150
200
250
300
350
400
450
500
-
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
100,000
5/24 6/8 6/23 7/8 7/23 8/7 8/22 9/6 9/21 10/6 10/21
Users
Job
Run
s
2014 Client Usage
Total Users Total Job Runs
Totals
12 regions
305 users
161 reports posted
98,449 job runs by region
19,420 job runs by Civis
How We Leverage AWS
Civis went through 4 major Amazon Redshift resizes and
scaled linearly in DynamoDB/EMR with no downtime
How We Leverage AWS
Extensibility: each module callable through
public-facing API for data science engineers
How We Leverage AWS
Unify
Amazon
Redshift
DynamoDB
S3
EMREC2
S3
Explore
Predict
Measure
Drive Action
Share
Match
How We Leverage AWS
“We would be a year behind where we
are now without Civis and the AWS
services it runs on. What we get are
Fortune-X-class data visualization,
modeling and data manipulation tools.”
CEO of clean energy startup
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Leveraging Amazon Redshift For
Your Data WarehousePavan Pothukuchi – Principal PM, Amazon Redshift
Dan Wagner – Founder & CEO, Civis Analytics