Top Banner
Introduction to Amazon Redshift May, 2014 / Abdullah Cetin CAVDAR @accavdar
28

Introduction to Amazon Redshift

May 10, 2015

Download

Technology

This presentation summarizes Amazon Redshift data warehouse service, its architecture and best practices for application development using Amazon Redshift.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Amazon Redshift

Introduction toAmazon Redshift

May, 2014 / Abdullah Cetin CAVDAR @accavdar

Page 2: Introduction to Amazon Redshift

What's Amazon Redshift?Amazon Redshift is a fast and powerful, fully

managed, petabyte-scale data warehouse service inthe cloud

https://aws.amazon.com/redshift/

Page 3: Introduction to Amazon Redshift

FeaturesPetabyte scale, massively parallelRelational data warehouseFully managed, zero adminSSD and HDD platforms$999/TB/Year

Page 4: Introduction to Amazon Redshift

Architecture

Page 5: Introduction to Amazon Redshift

Client ApplicationsIntegrates with various data loading and ETL (Extract, Transform, andLoad) tools and business intelligence (BI) reporting, data mining, andanalytics toolsRedshift is based on industry-standard PostgreSQL, so most existingSQL client applications will work with only minimal changes

Page 6: Introduction to Amazon Redshift

ConnectionsRedshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers

Page 7: Introduction to Amazon Redshift

ClustersA cluster is composed of one or more compute nodesLeader Node coordinates the compute nodes and handles externalcommunication

Page 8: Introduction to Amazon Redshift

Leader NodeManage communications with client programs and communicationswith compute nodesStore metadataCoordinate query execution

Page 9: Introduction to Amazon Redshift

Compute NodesExecute the compiled code, send intermediate results back to theleader node for final aggregationIt has own dedicated CPU, memory, and attached disk storage, whichare determined by the node type

Page 10: Introduction to Amazon Redshift

DatabasesA cluster contains one or more databasesUser data is stored on the compute nodesAmazon Redshift is a Relational Database Management System(RDBMS)Amazon Redshift is optimized for high-performance analysis andreporting of very large datasetsAmazon Redshift is based on PostgreSQL

Page 11: Introduction to Amazon Redshift

Redshift reduces I/OColumn storage - read data you needData compression - analyzes and compress your dataZone Map

Keep track of minimum and maximum value for each blockSkip over blocks that don't contain data needed for a given queryMinimize unnecessary I/O

Direct attached storageHardware optimized for high performance data processing

Large data block sizesLarge block sizes to make the most of each read

Page 12: Introduction to Amazon Redshift

Redshift runs on optimizedhardware

Optimized for I/O intensive workloadsHigh disk densityRuns in HPC - fast network

Page 13: Introduction to Amazon Redshift

Redshift parallelizes anddistributes everything

QueryLoadBackup/RestoreResize

Page 14: Introduction to Amazon Redshift

Redshift is easy to useProvision in minutesMonitor query performancePoint and click resizeBuilt in securityAutomatic backups

Page 15: Introduction to Amazon Redshift

Redshift has security built-inSSL to secure data in transitEncryption to secure data at rest

AES 256 - hardware acceleratedAll blocks on disk and in Amazon S3 encrypted

No direct access to compute nodesAmazon VPC support

Page 16: Introduction to Amazon Redshift

Redshift backs up your dataand recovers from failures

Replication within the cluster and backup to Amazon S3Backup to Amazon S3 are continuous, automatic and incrementalContinuous monitoring and automated recovery from failuresAble to restore snapshots to any Availability Zone

Page 17: Introduction to Amazon Redshift

Use Cases

Page 18: Introduction to Amazon Redshift

Traditional Enterprise DWReduce costs by extending DW rather than adding HWMigrate completely from existing DW systemsRespond faster to business

Page 19: Introduction to Amazon Redshift

Companies with Big DataImprove performance by an order of magnitudeMake more data available for analysisAccess business data via standard reporting tools

Page 20: Introduction to Amazon Redshift

SaaS CompaniesAdd analytic functionality to applicationsScale DW capacity as demand growsReduce HW and SW costs by an order of magnitude

Page 21: Introduction to Amazon Redshift

 Use Caseskillpages

Page 22: Introduction to Amazon Redshift

Data Architecture

Page 23: Introduction to Amazon Redshift

Redshift ImplementationHigh Storage Extra Large (XL) DW NodeETL Activities

Approx. 90 minutes including exports from RDBMS, copying to S3,loading stage tables, loading target tables, vacuuming andanalysing tables

SchemaCompressionRetention

Page 24: Introduction to Amazon Redshift

DW Anatomy

Page 25: Introduction to Amazon Redshift

Why Redshift works forSkillPages?

Scale - MPPPerformance - Columnar data access and compressionPlatform Integration - S3, DynamoOperational AdvantagesEase of AccessCost

Page 26: Introduction to Amazon Redshift

Best PracticesAvoid large number of singleton Data Manipulation Language (DML)statements if possibleUse COPY for uploading large datasetsChoose SORT and DISTRIBUTION keys with careEncode data and time with TIMESTAMP data typeExperiment with WLM (Workload Manager) settings

Page 27: Introduction to Amazon Redshift

Slideshttps://github.com/accavdar/AmazonRedshift

Page 28: Introduction to Amazon Redshift

THE ENDby Abdullah Cetin CAVDAR / @accavdar