Migrate your Data Warehouse to Amazon Redshift - September Webinar Series

Migrate Your Data Warehouse to Amazon Redshift

Greg Khairallah, Business Development Manger, AWSDavid Giffin, VP Technology, TrueCarSharat Nair, Director of Data, TrueCarBlagoy Kaloferov, Data Engineer, TrueCar

September 21, 2016

Agenda

• Motivation for Change and Migration

• Migration patterns and Best Practices

• AWS Database Migration Service

• Use Case – TrueCar

• Questions and Answers

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon Redshift

a lot fastera lot simplera lot cheaper

Amazon Redshift delivers performance

“[Amazon] Redshift is twenty times faster than Hive.” (5x–20x reduction in query times) link

“Queries that used to take hours came back in seconds. Our analysts are orders of magnitude more

productive.” (20x–40x reduction in query times) link

“…[Amazon Redshift] performance has blown away everyone here (we generally see 50–100x

speedup over Hive).” link

“Team played with [Amazon] Redshift today and concluded it is awesome. Un-indexed complex

queries returning in < 10s.”

“Did I mention it's ridiculously fast? We'll be using it immediately to provide our analysts an

alternative to Hadoop.”

“We saw… 2x improvement in query times.”

Channel “We regularly process multibillion row datasets and we do that in a matter of hours.” link

http://nerds.airbnb.com/redshift-performance-cost/

http://aws.amazon.com/solutions/case-studies/accordant-media/

http://www.quora.com/Amazon-Web-Services/Is-Amazon-Redshift-ready-for-production-use

http://www.computerworlduk.com/news/it-business/3480800/channel-4-handles-ten-fold-customer-data-increase-with-public-cloud-strategy/

Amazon Redshift is cost optimized

DS2 (HDD)Price Per Hour for

DS2.XLarge Single NodeEffective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DC1.Large Single NodeEffective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690



Pricing is simple

Number of nodes x price/hour

No charge for leader node

No up front costs

Pay as you go

Prices shown for US East

Other regions may vary

Considerations Before You Migrate

• Data is often being loaded into another warehouse

– existing ETL process with investment in code and process

• Temptation is to ‘lift & shift’ workload.

• Resist temptation. Instead consider:

– What do I really want to do?

– What do I need?

• Some data does not lend itself to a relational schema

• Common pattern is to use Amazon EMR:

– impose structure

– import into Amazon Redshift

Amazon Redshift architecture

• Leader Node

Simple SQL end point

Stores metadata

Optimizes query plan

Coordinates query execution

• Compute Nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

• Start at just $0.25/hour, grow to 2 PB

(compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

Ingestion/BackupBackupRestore

JDBC/ODBC

10 GigE(HPC)

A deeper look at compute node architecture

Leader Node

Dense compute nodes

Large• 2 slices/cores• 15 GB RAM• 160 GB SSD

8XL• 32 slices/cores• 244 GB RAM• 2.56 TB SSD

Dense storage nodes

X-large• 2 slices/4 cores• 31 GB RAM• 2 TB HDD

8XL• 16 slices/ 36 cores• 244 GB RAM• 16 TB HDD

Amazon Redshift Migration Overview

AWS CloudCorporate Data center

AmazonDynamoDB

Amazon S3

Data Volume

Amazon Elastic MapReduce

Amazon RDS

Amazon Redshift

Amazon Glacier

logs / files

Source DBs

VPN Connection

AWS Direct

Connect

S3 Multipart Upload

Amazon Snowball

EC2 or On-Prem(using SSH)

Database Migration Service

Kinesis

AWS Lambda

AWS Datapipeline

Uploading Files to Amazon S3

Amazon Redshiftmydata

Client.txt

Corporate Data center

Region

Ensure that your data resides in the

same Region as your Redshift clusters

Split the data into multiple files to facilitate parallel

processing

Optionally, you can encrypt your data using Amazon S3

Server-Side or Client-Side Encryption

Client.txt.1

Client.txt.2

Client.txt.3

Client.txt.4

Files should be individually

compressed using GZIP or LZOP

• Use the COPY command

• Each slice can load one file at a time

• A single input file means only one slice is ingesting data

• Instead of 100MB/s, you’re only getting 6.25MB/s

Loading – Use multiple input files to maximize

throughput

• Use the COPY command

• You need at least as many input files as you have slices

• With 16 input files, all slices are working so you maximize throughput

• Get 100MB/s per node; scale linearly as you add nodes

Loading – Use multiple input files to maximize

throughput

Loading Data with Manifest Files

• Use manifest to loads all required files

• Supply JSON-formatted text file that lists the files to be loaded

• Can load files from different buckets or with different prefix

{

"entries": [

{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},

{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},

{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},

{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}

]

}

Redshift COPY Command

• Loads data into a table from data files in S3 or from an Amazon DynamoDB table.

• The COPY command requires only three parameters: – Table name

– Data Source

– Credentials

Copy table_name FROM data_source CREDENTIALS

‘aws_access_credentials’

• Optional Parameters include:– Column mapping options – mapping source to target

– Data Format Parameters – FORMAT, CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON, BZIP2, GZIP, LZOP

– Data Conversion Parameters – Data type conversion between source and target

– Data Load Operations –troubleshoot load times or reduce load times with parameters like COMROWS, COMPUPDATE, MAXERROR, NOLOAD, STATUPDATE

Loading JSON Data

• COPY uses a jsonpaths text file to parse JSON data

• JSONPath expressions specify the path to JSON name elements

• Each JSONPath expression corresponds to a column in the Amazon Redshift target table

Suppose you want to load the VENUE table with the following content { "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ],

"seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [

"Oakland", "MA" ], "seats": 63026 }

You would use the following jsonpaths file to parse the JSON data.{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]",

"$['location'][1]", "$['seats']" ] }

Loading Data in Avro Format

• Avro is a data serialization protocol. An Avro source file includes a schema that defines the structure of the data. The Avro schema type must be record.

• COPY uses a avro_option to parse Avro data. Valid values for avro_option are as follows:

– 'auto’ (default) - COPY automatically maps the data elements in the Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table.

– 's3://jsonpaths_file' - To explicitly map Avro data elements to columns, you can use an JSONPaths file.

Avro Schema

{

"name": "person",

"type": "record",

"fields": [

{"name": "id", "type": "int"},

{"name": "guid", "type": "string"},

{"name": "name", "type": "string"},

{"name": "address", "type": "string"}

}

Amazon Kinesis FirehoseLoad massive volumes of streaming data into Amazon S3, Redshift

and Elasticsearch

• Zero administration: Capture and deliver streaming data into Amazon S3, Amazon

Redshift, and other destinations without writing an application or managing

infrastructure.

• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations.

• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention

Capture and submit streaming data

Analyze streaming data using your favorite BI tools

Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch

Best Practices for Loading Data

• Use a COPY Command to load data

• Use a single COPY command per table

• Split your data into multiple files

• Compress your data files with GZIP or LZOP

• Use multi-row inserts whenever possible

• Bulk insert operations (INSERT INTO…SELECT and CREATE TABLE AS)

provide high performance data insertion

• Use Amazon Kinesis Firehose for Streaming Data direct load to S3 and/or

Redshift

Best Practices for Loading Data Continued

• Load your data in sort key order to avoid needing to vacuum

• Organize your data as a sequence of time-series tables, where each table is

identical but contains data for different time ranges

• Use staging tables to perform an upsert

• Run the VACUUM command whenever you add, delete, or modify a large

number of rows, unless you load your data in sort key order

• Increase the memory available to a COPY or VACUUM by increasing

wlm_query_slot_count

• Run the ANALYZE command whenever you’ve made a non-trivial number of

changes to your data to ensure your table statistics are current

Amazon Partner ETL

• Amazon Redshift is supported by a variety of ETL vendors

• Many simplify the process of data loading

• Visit http://aws.amazon.com/redshift/partners

• There are a variety of vendors offering a free trial of their products, allowing you

to evaluate and choose the one that suits your needs.

http://aws.amazon.com/redshift/partners

• Start your first migration in 10 minutes or less

• Keep your apps running during the migration

• Replicate within, to, or from Amazon EC2 or RDS

• Move data from commercial database engines to

open source engines

• Or…move data to the same database engine

• Consolidate databases and/or tables

AWS Database Migration Service (DMS)

Benefits:

Sources and Targets for AWS DMS

Sources:

On-premises and Amazon EC2 instance databases:• Oracle Database 10g – 12c• Microsoft SQL Server 2005 – 2014• MySQL 5.5 – 5.7• MariaDB (MySQL-compatible data source)• PostgreSQL 9.4 – 9.5• SAP ASE 15.7+

RDS instance databases:• Oracle Database 11g – 12c• Microsoft SQL Server 2008R2 - 2014. CDC operations

are not supported yet.• MySQL versions 5.5 – 5.7• MariaDB (MySQL-compatible data source)• PostgreSQL 9.4 – 9.5. CDC operations are not

supported yet.• Amazon Aurora (MySQL-compatible data source)

Targets:

On-premises and EC2 instance databases:• Oracle Database 10g – 12c• Microsoft SQL Server 2005 – 2014• MySQL 5.5 – 5.7 • MariaDB (MySQL-compatible data source)• PostgreSQL 9.3 – 9.5• SAP ASE 15.7+

RDS instance databases:• Oracle Database 11g – 12c• Microsoft SQL Server 2008 R2 - 2014• MySQL 5.5 – 5.7 • MariaDB (MySQL-compatible data source)• PostgreSQL 9.3 – 9.5• Amazon Aurora (MySQL-compatible data source)

Amazon Redshift

AWS Database Migration Service Pricing

• T2 for developing and periodic data migration tasks

• C4 for large databases and minimizing time

• T2 pricing starts at $0.018 per hour for T2.micro

• C4 pricing starts at $0.154 per hour for C4.large

• 50 GB GP2 storage included with T2 instances

• 100 GB GP2 storage included with C4 instances

•

• Data transfer inbound and within AZ is free

• Data transfer across AZs starts at $0.01 per GB

AWS Schema Conversion Tool

Resources on the AWS Big Data Blog

• Best Practices for Micro-Batch Loading on Amazon Redshift

• Using Attunity Cloudbeam at UMUC to Replicate Data to Amazon RDS and

Amazon Redshift

• A Zero-Administration Amazon Redshift Database Loader

• Best Practices for Designing Tables

• Best Practices for Designing Queries

• Best Practices for Loading Data

Best Practices References

http://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift

http://blogs.aws.amazon.com/bigdata/post/Tx3D3LR954YJ0XS/-Using-Attunity-CloudBeam-at-UMUC-to-Replicate-Data-to-Amazon-RDS-and-Amazon-spa

https://blogs.aws.amazon.com/bigdata/post/Tx24VJ6XF1JVJAA/A-Zero-Administration-Amazon-Redshift-Database-Loader

http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

http://docs.aws.amazon.com/redshift/latest/dg/c_designing-queries-best-practices.html

http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

This Is The Presentation Title Entered In Master Slide Footer Area

Amazon Redshift at TrueCar

Sep 21, 2016


● About TrueCar

● David Giffin – VP Technology

● Sharat Nair – Director of Data

● Blagoy Kaloferov – Data Engineer

About us

27


● Amazon Redshift use case overview

● Architecture and migration process

● Tips and lessons learned

Agenda

28


29


● Datasets that flow into Amazon Redshift● Clickstream Transaction

● Sales Inventory

● Dealer Leads

● How we do analytics and reporting● Redshift is our data store for BI tools and ad-hoc

● Data that is loaded into Amazon Redshift is already processed


30


31

Architecture

31

ETL

(MR, Hive, Pig,

Oozie,Talend) Postgres

HDFS

Leads

Dealer

Transactions

Sales

Inventory

Clickstream

Staging, DWH ETL

Data Processing


3232

ETL

(MR, Hive, Pig,

Oozie,Talend) Postgres

HDFS

Amazon

Redshift

Leads

Dealer

Transactions

Sales

Inventory

Clickstream

Staging, DWH ETL

S3

Loading utility

MSTR

Tableau

Data Processing Reporting

Ad Hoc

Architecture


● Schemas

● Our datasets are in a read only schema for ad-hoc and scheduled reporting

● Ad-hoc and User tables in separate schemas

● Makes it easy to separate final data from user created one.

● Simple table’s naming conventions● F_ - facts

● D_ - dimensions,

● AGG_ - aggregates

● V_ - views

Schema design

33

34

Amazon Redshift learnings


● ETL is orchestrated through Talend and Oozie

● Processing tools: Talend, Hive , Pig and MapReduce pushing data into HDFS and S3

● We built our own Amazon Redshift loading utility

● Handles all loading use cases:

● Load

● TruncateLoad

● DeleteAppend

● Upsert

Redshift loading process

35


● Train developers on table design and Redshift best practices

● Compress columns and encodings

● Analyze compression

● It makes a significant difference on space usage

● Sort and distribution keys

● Plan on Workload management strategy

● As usage of Redshift cluster grows you need to ensure that critical jobs get bandwidth

Table design considerations

36


● Retain pre “COPY” data in S3

● It can easily be used by other tools (Spark, Pig, MapReduce)

● Offload historical datasets into separate tables on rolling basis

● Pre aggregate data when possible to reduce load on the system

Space considerations

37


● Have a cluster resize strategy

● Use Reserved instances for cost savings

● Plan on having enough space for long-term growth

● Plan on your maintenance

● Vacuuming

● System tables are your friends

● Useful collection of utilities: https://github.com/awslabs/amazon-redshift-utils/

Long-term usage tips

38

Thanks!Questions?

39