Top Banner
What is Redshift ?
24

Redshift spike

Jan 21, 2017

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Redshift spike

What is Redshift ?

Page 2: Redshift spike

Secure

Fully Managed

MPP Data Warehouse

Based on Postgres 8.0.2

SQL interface

3000$ Data Warehouse

Analytics on Large scale Datasets

Peta-byte scale

Page 3: Redshift spike

DemoRDS vs Redshift

Page 4: Redshift spike

Table Name Rows

LINEORDER 600,037,902

PART 1,400,000

CUSTOMER 3,000,000

SUPPLIER 1,000,000

DWDATE 2,556

Page 5: Redshift spike

How is Redshift so fast ?

Page 6: Redshift spike

1. Distributed Computing 2. Columnar Storage

Page 7: Redshift spike

3. Data Compression 4. Large Block StorageRaw Default storage

ByteDict External dict of unique values

Delta Difference between the values

Runlength Count of repetitive values

Text Repetitive words in long text

Block Size of 1Mb as opposed to smaller block sizes in RDBMS

Page 8: Redshift spike

Tuning

Page 9: Redshift spike

1. Sort Keys and Zone Maps● Amazon Redshift stores

columnar data in 1 MB disk blocks

● Min and Max values for each block are stored as part of the metadata - Zone Maps

● Helps in skipping over large numbers of blocks during table scans

Page 10: Redshift spike

Compound Sort Key

Page 11: Redshift spike

Interleaved Sort Key

Page 12: Redshift spike

Choosing a Sort Key

● Timestamp Column

● Range or Equality filtering on specific column

● Join Column

Page 13: Redshift spike

2. Distribution Key

Page 14: Redshift spike

Effect of distribution on query performance

Page 15: Redshift spike

Distribution Styles

Page 16: Redshift spike

Integration with other tools

Page 17: Redshift spike

● EMR

● RDS

● Spark

● ETL - SnapLogic,

Treasure data

Page 18: Redshift spike

Where not to use it ?

Page 19: Redshift spike

● Using it for something non-analytics○ It is not for OLTP operations

● When you have small queries like select * from <table>

● When you have constraints, indexes or table partitioning

● When you have many concurrent users accessing same

cluster○ Limit of 50 concurrent users

Page 20: Redshift spike

Gotchas

Page 21: Redshift spike

● Data loading into Redshift○ Only S3, DynamoDB, EMR and Data-Pipeline supported parallely

● Concurrent queries downgrades performance○ More load more latency

● Cluster resizing○ Read only mode while resizing

○ Does not easily scale up for dynamic workloads

Page 22: Redshift spike

● Multi-AZ deployments not supported○ Cluster unavailable until power and network access to the AZ are

restored

● Frequent Deletes & Updates

○ Every UPDATE = DELETE + INSERT

○ Vacuum to the rescue

● Can not apply compression after table is created

Page 23: Redshift spike

References

Page 24: Redshift spike

● http://docs.aws.amazon.com/redshift/latest/dg/welcome.html

● https://www.periscopedata.com/blog/redshift-and-rds-postgres-benchmarke

d.html

● http://blog.blazeclan.com/what-is-amazon-redshift-11-key-points-remember/

● https://redshiftuser.wordpress.com/

● http://engineering.thinknear.com/blog/2015/01/21/understanding-the-decisi

on-to-move-from-aws-emr-slash-hive-to-redshift/

● http://www.eshioji.co.uk/2013/07/a-simplistic-redshift-trouble-shooting.html