Top Banner
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Leveraging Amazon Redshift for Your Data Warehouse Pavan Pothukuchi—Principal PM, Amazon Redshift Dan Wagner—Founder & CEO, Civis Analytics
78
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Leveraging Amazon Redshift for Your Data Warehouse

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Leveraging Amazon Redshift for

Your Data WarehousePavan Pothukuchi—Principal PM, Amazon Redshift

Dan Wagner—Founder & CEO, Civis Analytics

Page 2: Leveraging Amazon Redshift for Your Data Warehouse

What is Amazon Redshift?

Page 3: Leveraging Amazon Redshift for Your Data Warehouse

Analytics workflow

Generate Ingest Analyze

• Transactional

• Semi-structured

• Log data

• Sensor/IoT

Amazon

S3

Amazon

RDS

Amazon

DynamoDB

Amazon Kinesis

Amazon

Redshift

Amazon EMR

Page 4: Leveraging Amazon Redshift for Your Data Warehouse

Data warehouse—challenges

Cost

Complexity

Performance

1990 2000 2010 2020

Enterprise Data Data in Warehouse

Page 5: Leveraging Amazon Redshift for Your Data Warehouse

Petabyte scale; massively parallel

Relational data warehouse

Fully managed; zero admin

SSD and HDD platforms

As low as $1,000/TB/year

Amazon

Redshift

a lot cheaper

a lot faster

a whole lot simpler

Page 6: Leveraging Amazon Redshift for Your Data Warehouse

Clickstream analytics for Amazon.com

• Web log analysis for Amazon.com– 1 PB+ workload, 2 TB/day@67% YoY

– Largest table: 400 TB

• Understand customer behavior

• Solution– Legacy DW—query across 1 week/hr.

– Hadoop—query across 1 month/hr.

Page 7: Leveraging Amazon Redshift for Your Data Warehouse

Results with Amazon Redshift

• Query 15 months in 14 minutes

• Load 5 B rows in 10 minutes

• 21 B w/ 10 B rows: 3 days to 2 hours(Hive Amazon Redshift)

• Load pipeline: 90 hours to 8 hours (Oracle Amazon Redshift)

• 100 node DS1.8XL

• Easy resizing

• Managed backups and restore

• Failure tolerance and recovery

• 20% time of one DBA

• Increased productivity

Page 8: Leveraging Amazon Redshift for Your Data Warehouse

Who Uses Amazon Redshift?

Page 9: Leveraging Amazon Redshift for Your Data Warehouse

Common customer use cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to business

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW and SW costs by an order of magnitude

Traditional enterprise DW Companies with big data SaaS Companies

Page 10: Leveraging Amazon Redshift for Your Data Warehouse

Selected Amazon Redshift customers

Page 11: Leveraging Amazon Redshift for Your Data Warehouse

Amazon Redshift Partners

Page 12: Leveraging Amazon Redshift for Your Data Warehouse

Amazon Redshift architecture

• Leader node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via

Amazon S3; load from

Amazon DynamoDB or SSH

• Two hardware platforms

– DS2: HDD; scale from 2 TB to 2 PB

– DC1: SSD; scale from 160 GB to 326 TB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Page 13: Leveraging Amazon Redshift for Your Data Warehouse

Dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 14: Leveraging Amazon Redshift for Your Data Warehouse

Dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 15: Leveraging Amazon Redshift for Your Data Warehouse

Dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

• COPY compresses automatically

• You can analyze and override

• Average 2–4x

Page 16: Leveraging Amazon Redshift for Your Data Warehouse

Dramatically reduces I/O

• Column storage

• Data compression

• Direct-attached storage

• Large data block sizes

• Track of the minimum and

maximum value for each block

• Skip over blocks that don’t

contain the data needed for a

given query

• Minimize unnecessary I/O

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 17: Leveraging Amazon Redshift for Your Data Warehouse

Dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Use direct-attached storage

to maximize throughput

• Hardware optimized for high

performance data

processing

• Large block sizes to make the

most of each read

Page 18: Leveraging Amazon Redshift for Your Data Warehouse

Parallelizes and distributes operations

• Query

• Load

• Backup/Restore

• Resize

Page 19: Leveraging Amazon Redshift for Your Data Warehouse

Built-in security

• SSL to secure data in transit

• Encryption to secure data at rest

– AES-256; hardware accelerated

– All blocks encrypted on disks and in Amazon S3

– HSM support

• No direct access to compute nodes

• Audit logging and AWS CloudTrail integration

• Amazon VPC support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP, others

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Page 20: Leveraging Amazon Redshift for Your Data Warehouse

1/10 the price of a traditional data warehouse

DS2 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Page 21: Leveraging Amazon Redshift for Your Data Warehouse

Expanding Amazon Redshift

Functionality

Page 22: Leveraging Amazon Redshift for Your Data Warehouse

DS2—50% more performance, same cost

• 2 x the memory and compute power of DS1

• Enhanced networking and 1.5 x gain in disk throughput

• 40% to 60% performance gain over DS1

• Same cost

• Restore from snapshot to migrate from DS1 to DS2

Page 23: Leveraging Amazon Redshift for Your Data Warehouse

Custom ODBC and JDBC drivers

• Up to 35% higher performance than open source

drivers

• Supported by Informatica, Microstrategy, Pentaho,

Qlik, SAS, Tableau

• Will continue to support PostgreSQL open-source

drivers

• Download drivers from console

Page 24: Leveraging Amazon Redshift for Your Data Warehouse

Explain plan visualization

Page 25: Leveraging Amazon Redshift for Your Data Warehouse

User-defined functions (upcoming)

• Define your own functions

• Python 2.7– Syntax is largely identical to PostgreSQL UDF

syntax

– System and network calls within UDFs are

prohibited

• Pandas, NumPy, and SciPy pre-installed– Import your own libraries for even more flexibility

Page 26: Leveraging Amazon Redshift for Your Data Warehouse

Scalar UDF example—URL parsing

CREATE FUNCTION f_hostname (VARCHAR url)

RETURNS varchar

IMMUTABLE AS $$

import urlparse

return urlparse.urlparse(url).hostname

$$ LANGUAGE plpythonu;

Page 27: Leveraging Amazon Redshift for Your Data Warehouse

Interleaved multi-column sort

• Compound sort keys

– Optimized filtering data by one leading column

• Interleaved sort keys

– Optimized for filtering data by up to eight columns

– No storage overhead unlike an index

– Lower maintenance penalty compared to indexes

Page 28: Leveraging Amazon Redshift for Your Data Warehouse

Compound sort keys illustrated

• Records in Amazon Redshift are stored in blocks.

• For this illustration, let’s assume that four records fill a block.

• Records with a given cust_id are all in one block.

• However, records with a given prod_id are spread across four blocks.

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

cust_id prod_id other columns blocks

Page 29: Leveraging Amazon Redshift for Your Data Warehouse

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

Interleaved sort keys illustrated

• Records with a given

cust_id are spread

across two blocks.

• Records with a given

prod_id are also

spread across two

blocks.

• Data is sorted in

equal measures for

both keys.

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Page 30: Leveraging Amazon Redshift for Your Data Warehouse

How to use the feature

• Use ‘INTERLEAVED’ keyword with SORTKEY

– Existing syntax will still work and behavior is unchanged

• No change needed to queries

• Benefits are significant

[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]

Page 31: Leveraging Amazon Redshift for Your Data Warehouse

Amazon Redshift

Spend time with your data, not your database….

Page 32: Leveraging Amazon Redshift for Your Data Warehouse

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Introducing Civis:

The End-to-End Data Science

Platform Built on the Amazon Cloud

Dan Wagner, Founder & CEO—Civis Analytics

Page 33: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

WE ARE:

95 Employees

Software & Services

Offices in Chicago & D.C.

Page 34: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

WE STARTED

in “The Cave”

Page 35: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Obama Simulated Win Projection by StateNovember 3, 2012

Page 36: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

End-to-end data science process was the

secret to driving precise, correct action at scale

Share

Insights across organization

Data in one place

Unify

Measure

Match

Records into a single view

Predict

Outcomes to drive decisions

Explore

Data in a fast database

Drive Action

Right actions based on data

Page 37: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

“Wagner!—What the hell is the

Vertica and why does it not work?”

2012 Obama campaign manager

(said multiple, multiple times…)

Page 38: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

And then we move into the real world…

Page 39: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

…(completely broke by the way)…

Page 40: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

…and we see the same problems

Page 41: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

“I go downstairs and I ask for a basic report

showing daily telephone plan attrition by region,

by customer type, by acquisition channel.

And then I wait… and wait… and wait.

Thirty days later, I get a spreadsheet

on my desk for last month’s attrition.

This is honestly killing me.”

CEO of global telecommunications firm

Page 42: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

“Uhhhh…I think I blew up Postgres…”

Data scientist at online publisher

Page 43: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Conventional IT infrastructure serves foundational business needs

Data Sources

IntegrationLayer

Analytics Applications Users

Page 44: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

But data scientists, a new entrant, need their own tools to support decision-making…

Data Sources

IntegrationLayer

Analytics Applications Users

Page 45: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

And they need all the data to help facilitate decisions across the organization…

Data Sources

IntegrationLayer

Analytics Applications Users

Page 46: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

The universal tragedy of big data:

The science is there; you just can’t plug it

in

Page 47: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Data Sources

IntegrationLayer

Analytics Applications Users

And they need all the data to help

facilitate decisions across the organization…

A lot of latency

Distracting custom extracts

Limited scale

$$$$$

Uncoordinated

unification

Page 48: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

The challenge: Data scientists have unique end-to-end analytical and computational needs

Share

Insights across organization

Data in one place

Unify

Measure

Match

Records into a single view

Predict

Outcomes to drive decisions

Explore

Data in a fast database

Drive Action

Right actions based on data

Page 49: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

The challenge: Data scientists have unique end-to-end analytical and computational needs

Share

Insights across organization

Data in one place

Unify

Measure

Match

Records into a single view

Predict

Outcomes to drive decisions

Explore

Data in a fast database

Drive

ActionRight actions based on data

IntuitiveBuilt for data scientists

ExpandableCapacity can grow as needed

ExtensibleCustomizable for developers

End-to-EndFrom ETL to Modeling to Action

Right CostFor any organization

Page 50: Leveraging Amazon Redshift for Your Data Warehouse

Our Solution

Page 51: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

The easy-to-use, end-to-end, powerful & flexible

data science platform in the Amazon cloud

Introducing Civis

Page 52: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 53: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 54: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 55: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Amazon

S3

Our Lego pieces

Amazon

EC2

Amazon

EMR

Amazon

Redshift

Amazon

DynamoDB

Amazon

RDS

Page 56: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Amazon Redshift is at the core of our platform

• Expand with minimal

interruption

• Multiple hardware options

for diverse needs/costs

• As performant as Hadoop

with any size data….

• …and way easier to set up

• Facilitates exploration—

no Map Reduce penalty

• Stays stable with lots of users

• Storage limit consequences

are low

• Use industry-standard SQL

Technical Value Analyst Value

Page 57: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Our pirate ship

Unify

Amazon

Redshift

DynamoDB

S3

EMREC2

S3

Explore

Predict

Measure

Drive Action

Share

Match

Page 58: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Get data together in one place

Unify

EC2S31

Page 59: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 60: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Match people between datasets

Match

EC2EMRDynamoDB2

Page 61: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 62: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Transform and explore

Explore3 S3Amazon

Redshift

DynamoDB

Page 63: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 64: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Build predictions with modeling

Predict4 S3EMRAmazon

Redshift

Page 65: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 66: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Report out on results

Share & Drive Action5 EC2S3EMRAmazon

Redshift

DynamoDB RDS

Page 67: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Page 68: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

What it has to be: Amazon Redshift variable pricing

eliminates up-front capital costs for data science IT

$0

$50,000

$100,000

$150,000

$200,000

$250,000

$300,000

Amazon Redshift Vertica

May 2013: Estimated Starting Cost for Amazon Redshift vs. HP Vertica

Page 69: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

And minimizes expansion time vs. on-premises hardware

(based on our experience)

3

850

0

100

200

300

400

500

600

700

800

900

Amazon Redshift Vertica

Estimated hours required to the double size of our cluster

Page 70: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Case client expanded from 5 users to 300 in

5 months, hitting 100K jobs during the period

-

50

100

150

200

250

300

350

400

450

500

-

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

5/24 6/8 6/23 7/8 7/23 8/7 8/22 9/6 9/21 10/6 10/21

Users

Job

Run

s

2014 Client Usage

Total Users Total Job Runs

Totals

12 regions

305 users

161 reports posted

98,449 job runs by region

19,420 job runs by Civis

Page 71: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Civis went through 4 major Amazon Redshift resizes and

scaled linearly in DynamoDB/EMR with no downtime

Page 72: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Extensibility: each module callable through

public-facing API for data science engineers

Page 73: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

Unify

Amazon

Redshift

DynamoDB

S3

EMREC2

S3

Explore

Predict

Measure

Drive Action

Share

Match

Page 74: Leveraging Amazon Redshift for Your Data Warehouse

How We Leverage AWS

“We would be a year behind where we

are now without Civis and the AWS

services it runs on. What we get are

Fortune-X-class data visualization,

modeling and data manipulation tools.”

CEO of clean energy startup

Page 75: Leveraging Amazon Redshift for Your Data Warehouse

Your Feedback is Important to AWSPlease complete the session evaluation. Tell us what you

think!

Page 76: Leveraging Amazon Redshift for Your Data Warehouse

CHICAGO

Page 77: Leveraging Amazon Redshift for Your Data Warehouse

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Leveraging Amazon Redshift For

Your Data WarehousePavan Pothukuchi – Principal PM, Amazon Redshift

Dan Wagner – Founder & CEO, Civis Analytics

Page 78: Leveraging Amazon Redshift for Your Data Warehouse

CHICAGO

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved