Leveraging Amazon Redshift for Your Data Warehouse

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Leveraging Amazon Redshift for

Your Data WarehousePavan Pothukuchi—Principal PM, Amazon Redshift

Dan Wagner—Founder & CEO, Civis Analytics

What is Amazon Redshift?

Analytics workflow

Generate Ingest Analyze

• Transactional

• Semi-structured

• Log data

• Sensor/IoT

Amazon

S3

Amazon

RDS

Amazon

DynamoDB

Amazon Kinesis

Amazon

Redshift

Amazon EMR

Data warehouse—challenges

Cost

Complexity

Performance

1990 2000 2010 2020

Enterprise Data Data in Warehouse

Petabyte scale; massively parallel

Relational data warehouse

Fully managed; zero admin

SSD and HDD platforms

As low as $1,000/TB/year

Amazon

Redshift

a lot cheaper

a lot faster

a whole lot simpler

Clickstream analytics for Amazon.com

• Web log analysis for Amazon.com– 1 PB+ workload, 2 TB/day@67% YoY

– Largest table: 400 TB

• Understand customer behavior

• Solution– Legacy DW—query across 1 week/hr.

– Hadoop—query across 1 month/hr.

Results with Amazon Redshift

• Query 15 months in 14 minutes

• Load 5 B rows in 10 minutes

• 21 B w/ 10 B rows: 3 days to 2 hours(Hive Amazon Redshift)

• Load pipeline: 90 hours to 8 hours (Oracle Amazon Redshift)

• 100 node DS1.8XL

• Easy resizing

• Managed backups and restore

• Failure tolerance and recovery

• 20% time of one DBA

• Increased productivity

Who Uses Amazon Redshift?

Common customer use cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to business

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW and SW costs by an order of magnitude

Traditional enterprise DW Companies with big data SaaS Companies

Selected Amazon Redshift customers

Amazon Redshift Partners

Amazon Redshift architecture

• Leader node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via

Amazon S3; load from

Amazon DynamoDB or SSH

• Two hardware platforms

– DS2: HDD; scale from 2 TB to 2 PB

– DC1: SSD; scale from 160 GB to 326 TB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375



• Zone maps



ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375


• Column storage


• Zone maps



analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

• COPY compresses automatically

• You can analyze and override

• Average 2–4x


• Column storage




• Track of the minimum and

maximum value for each block

• Skip over blocks that don’t

contain the data needed for a

given query

• Minimize unnecessary I/O

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959


• Column storage


• Zone maps



• Use direct-attached storage

to maximize throughput

• Hardware optimized for high

performance data

processing

• Large block sizes to make the

most of each read

Parallelizes and distributes operations

• Query

• Load

• Backup/Restore

• Resize

Built-in security

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated

– All blocks encrypted on disks and in Amazon S3

– HSM support

• No direct access to compute nodes

• Audit logging and AWS CloudTrail integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP, others

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

1/10 the price of a traditional data warehouse

DS2 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690



Expanding Amazon Redshift

Functionality

DS2—50% more performance, same cost

• 2 x the memory and compute power of DS1

• Enhanced networking and 1.5 x gain in disk throughput

• 40% to 60% performance gain over DS1

• Same cost

• Restore from snapshot to migrate from DS1 to DS2

Custom ODBC and JDBC drivers

• Up to 35% higher performance than open source

drivers

• Supported by Informatica, Microstrategy, Pentaho,

Qlik, SAS, Tableau

• Will continue to support PostgreSQL open-source

drivers

• Download drivers from console

Explain plan visualization

User-defined functions (upcoming)

• Define your own functions

• Python 2.7– Syntax is largely identical to PostgreSQL UDF

syntax

– System and network calls within UDFs are

prohibited

• Pandas, NumPy, and SciPy pre-installed– Import your own libraries for even more flexibility

Scalar UDF example—URL parsing

CREATE FUNCTION f_hostname (VARCHAR url)

RETURNS varchar

IMMUTABLE AS $$

import urlparse

return urlparse.urlparse(url).hostname

$$ LANGUAGE plpythonu;

Interleaved multi-column sort

• Compound sort keys

– Optimized filtering data by one leading column

• Interleaved sort keys

– Optimized for filtering data by up to eight columns

– No storage overhead unlike an index

– Lower maintenance penalty compared to indexes

Compound sort keys illustrated

• Records in Amazon Redshift are stored in blocks.

• For this illustration, let’s assume that four records fill a block.

• Records with a given cust_id are all in one block.

• However, records with a given prod_id are spread across four blocks.

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

Interleaved sort keys illustrated

• Records with a given

cust_id are spread

across two blocks.

• Records with a given

prod_id are also

spread across two

blocks.

• Data is sorted in

equal measures for

both keys.

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

How to use the feature

• Use ‘INTERLEAVED’ keyword with SORTKEY

– Existing syntax will still work and behavior is unchanged

• No change needed to queries

• Benefits are significant

[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]

Amazon Redshift

Spend time with your data, not your database….


Introducing Civis:

The End-to-End Data Science

Platform Built on the Amazon Cloud

Dan Wagner, Founder & CEO—Civis Analytics

How We Leverage AWS

WE ARE:

95 Employees

Software & Services

Offices in Chicago & D.C.

How We Leverage AWS

WE STARTED

in “The Cave”

How We Leverage AWS

Obama Simulated Win Projection by StateNovember 3, 2012

How We Leverage AWS

End-to-end data science process was the

secret to driving precise, correct action at scale

Share

Insights across organization

Data in one place

Unify

Measure

Match

Records into a single view

Predict

Outcomes to drive decisions

Explore

Data in a fast database

Drive Action

Right actions based on data

How We Leverage AWS

“Wagner!—What the hell is the

Vertica and why does it not work?”

2012 Obama campaign manager

(said multiple, multiple times…)

How We Leverage AWS

And then we move into the real world…

How We Leverage AWS

…(completely broke by the way)…

How We Leverage AWS

…and we see the same problems

How We Leverage AWS

“I go downstairs and I ask for a basic report

showing daily telephone plan attrition by region,

by customer type, by acquisition channel.

And then I wait… and wait… and wait.

Thirty days later, I get a spreadsheet

on my desk for last month’s attrition.

This is honestly killing me.”

CEO of global telecommunications firm

How We Leverage AWS

“Uhhhh…I think I blew up Postgres…”

Data scientist at online publisher

How We Leverage AWS

Conventional IT infrastructure serves foundational business needs

Data Sources

IntegrationLayer

Analytics Applications Users

How We Leverage AWS

But data scientists, a new entrant, need their own tools to support decision-making…

Data Sources

IntegrationLayer


How We Leverage AWS

And they need all the data to help facilitate decisions across the organization…

Data Sources

IntegrationLayer


How We Leverage AWS

The universal tragedy of big data:

The science is there; you just can’t plug it

in

How We Leverage AWS

Data Sources

IntegrationLayer


And they need all the data to help

facilitate decisions across the organization…

A lot of latency

Distracting custom extracts

Limited scale

$$$$$

Uncoordinated

unification

How We Leverage AWS

The challenge: Data scientists have unique end-to-end analytical and computational needs

Share


Data in one place

Unify

Measure

Match


Predict


Explore


Drive Action

Right actions based on data

How We Leverage AWS

The challenge: Data scientists have unique end-to-end analytical and computational needs

Share


Data in one place

Unify

Measure

Match


Predict


Explore


Drive

ActionRight actions based on data

IntuitiveBuilt for data scientists

ExpandableCapacity can grow as needed

ExtensibleCustomizable for developers

End-to-EndFrom ETL to Modeling to Action

Right CostFor any organization

Our Solution

How We Leverage AWS

The easy-to-use, end-to-end, powerful & flexible

data science platform in the Amazon cloud

Introducing Civis

How We Leverage AWS

How We Leverage AWS

How We Leverage AWS

How We Leverage AWS

Amazon

S3

Our Lego pieces

Amazon

EC2

Amazon

EMR

Amazon

Redshift

Amazon

DynamoDB

Amazon

RDS

How We Leverage AWS

Amazon Redshift is at the core of our platform

• Expand with minimal

interruption

• Multiple hardware options

for diverse needs/costs

• As performant as Hadoop

with any size data….

• …and way easier to set up

• Facilitates exploration—

no Map Reduce penalty

• Stays stable with lots of users

• Storage limit consequences

are low

• Use industry-standard SQL

Technical Value Analyst Value

How We Leverage AWS

Our pirate ship

Unify

Amazon

Redshift

DynamoDB

S3

EMREC2

S3

Explore

Predict

Measure

Drive Action

Share

Match

How We Leverage AWS

Get data together in one place

Unify

EC2S31

How We Leverage AWS

How We Leverage AWS

Match people between datasets

Match

EC2EMRDynamoDB2

How We Leverage AWS

How We Leverage AWS

Transform and explore

Explore3 S3Amazon

Redshift

DynamoDB

How We Leverage AWS

How We Leverage AWS

Build predictions with modeling

Predict4 S3EMRAmazon

Redshift

How We Leverage AWS

How We Leverage AWS

Report out on results

Share & Drive Action5 EC2S3EMRAmazon

Redshift

DynamoDB RDS

How We Leverage AWS

How We Leverage AWS

What it has to be: Amazon Redshift variable pricing

eliminates up-front capital costs for data science IT

$0

$50,000

$100,000

$150,000

$200,000

$250,000

$300,000

Amazon Redshift Vertica

May 2013: Estimated Starting Cost for Amazon Redshift vs. HP Vertica

How We Leverage AWS

And minimizes expansion time vs. on-premises hardware

(based on our experience)

3

850

0

100

200

300

400

500

600

700

800

900

Amazon Redshift Vertica

Estimated hours required to the double size of our cluster

How We Leverage AWS

Case client expanded from 5 users to 300 in

5 months, hitting 100K jobs during the period

-

50

100

150

200

250

300

350

400

450

500

-

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

5/24 6/8 6/23 7/8 7/23 8/7 8/22 9/6 9/21 10/6 10/21

Users

Job

Run

s

2014 Client Usage

Total Users Total Job Runs

Totals

12 regions

305 users

161 reports posted

98,449 job runs by region

19,420 job runs by Civis

How We Leverage AWS

Civis went through 4 major Amazon Redshift resizes and

scaled linearly in DynamoDB/EMR with no downtime

How We Leverage AWS

Extensibility: each module callable through

public-facing API for data science engineers

How We Leverage AWS

Unify

Amazon

Redshift

DynamoDB

S3

EMREC2

S3

Explore

Predict

Measure

Drive Action

Share

Match

How We Leverage AWS

“We would be a year behind where we

are now without Civis and the AWS

services it runs on. What we get are

Fortune-X-class data visualization,

modeling and data manipulation tools.”

CEO of clean energy startup

Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you

think!

CHICAGO


Leveraging Amazon Redshift For

Your Data WarehousePavan Pothukuchi – Principal PM, Amazon Redshift

Dan Wagner – Founder & CEO, Civis Analytics

CHICAGO


Leveraging Amazon Redshift for Your Data Warehouse

Technology

enterprise data data

amazon redshift query

tbyear amazon redshift

leveraging amazon redshift

amazon redshift partners

data available

hours oracle amazon

amazon s3 load