- 1. 2014 Amazon.com, Inc. and its affiliates. All rights
reserved. May not be copied, modified, or distributed in whole or
in part without the express consent of Amazon.com, Inc. May 2014
Usos e Melhores Prticas para Amazon Redshift Eric Ferreira Sr.
Database Engineer
2. Fast, simple, petabyte-scale data warehousing for less than
$1,000/TB/Year Amazon Redshift 3. Redshift EMR EC2 Analyze Glacier
S3 DynamoDB Store Direct Connect Collect Kinesis AWS Summit 2014 4.
Petabyte scale Massively parallel Relational data warehouse Fully
managed; zero admin Amazon Redshift a lot faster a lot cheaper a
whole lot simpler AWS Summit 2014 5. Common Customer Use Cases
Reduce costs by extending DW rather than adding HW Migrate
completely from existing DW systems Respond faster to business
Improve performance by an order of magnitude Make more data
available for analysis Access business data via standard reporting
tools Add analytic functionality to applications Scale DW capacity
as demand grows Reduce HW & SW costs by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies 6.
Amazon Redshift Customers AWS Summit 2014 7. Growing Ecosystem AWS
Summit 2014 8. Data Loading Options Parallel upload to Amazon S3
AWS Direct Connect AWS Import/Export Amazon Kinesis Systems
integrators Data Integration Systems Integrators AWS Summit 2014 9.
Amazon Redshift Architecture Leader Node SQL endpoint Stores
metadata Coordinates query execution Compute Nodes Local, columnar
storage Execute queries in parallel Load, backup, restore via
Amazon S3; load from Amazon DynamoDB or SSH Two hardware platforms
Optimized for data processing DW1: HDD; scale from 2TB to 1.6PB
DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup
Restore JDBC/ODBC AWS Summit 2014 10. Amazon Redshift Node Types
Optimized for I/O intensive workloads High disk density On demand
at $0.85/hour As low as $1,000/TB/Year Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2
GB/sec scan rate High performance at smaller storage size High
compute and memory density On demand at $0.25/hour As low as
$5,500/TB/Year Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2
Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32
Cores, 2.56 TB of compressed SSD storage AWS Summit 2014 11. Amazon
Redshift dramatically reduces I/O Column storage Data compression
Zone maps Direct-attached storage With row storage you do
unnecessary I/O To get total amount, you have to read everything ID
Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37
WA 375 AWS Summit 2014 12. With column storage, you only read the
data you need Amazon Redshift dramatically reduces I/O Column
storage Data compression Zone maps Direct-attached storage ID Age
State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA
375 AWS Summit 2014 13. analyze compression listing; Table | Column
| Encoding ---------+----------------+---------- listing | listid |
delta listing | sellerid | delta32k listing | eventid | delta32k
listing | dateid | bytedict listing | numtickets | bytedict listing
| priceperticket | delta32k listing | totalprice | mostly32 listing
| listtime | raw Amazon Redshift dramatically reduces I/O Column
storage Data compression Zone maps Direct-attached storage COPY
compresses automatically You can analyze and override More
performance, less cost AWS Summit 2014 14. Amazon Redshift
dramatically reduces I/O Column storage Data compression Zone maps
Direct-attached storage 10 | 13 | 14 | 26 | | 100 | 245 | 324 375 |
393 | 417 512 | 549 | 623 637 | 712 | 809 | 834 | 921 | 959 10 324
375 623 637 959 Track the minimum and maximum value for each block
Skip over blocks that dont contain relevant data AWS Summit 2014
15. Amazon Redshift dramatically reduces I/O Column storage Data
compression Zone maps Direct-attached storage Use local storage for
performance Maximize scan rates Automatic replication and
continuous backup HDD & SSD platforms AWS Summit 2014 16.
Amazon Redshift parallelizes and distributes everything Query Load
Backup/Restore Resize AWS Summit 2014 17. Load in parallel from
Amazon S3 or Amazon DynamoDB or any SSH connection Data
automatically distributed and sorted according to DDL Scales
linearly with number of nodes Amazon Redshift parallelizes and
distributes everything Query Load Backup/Restore Resize AWS Summit
2014 18. Backups to Amazon S3 are automatic, continuous and
incremental Configurable system snapshot retention period. Take
user snapshots on-demand Cross region backups for disaster recovery
Streaming restores enable you to resume querying faster Amazon
Redshift parallelizes and distributes everything Query Load
Backup/Restore Resize AWS Summit 2014 19. Resize while remaining
online Provision a new cluster in the background Copy data in
parallel from node to node Only charged for source cluster Amazon
Redshift parallelizes and distributes everything Query Load
Backup/Restore Resize AWS Summit 2014 20. Automatic SQL endpoint
switchover via DNS Decommission the source cluster Simple operation
via Console or API Amazon Redshift parallelizes and distributes
everything Query Load Backup/Restore Resize AWS Summit 2014 21.
Amazon Redshift is priced to let you analyze all your data Number
of nodes x cost per hour No charge for leader node No upfront costs
Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node
Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year
Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2
(SSD) Price Per Hour for DW2.L Single Node Effective Annual Price
per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $
8,794 3 Year Reservation $ 0.100 $ 5,498 22. Amazon Redshift has
security built-in SSL to secure data in transit; load encrypted
from S3; ECDHE perfect forward security Encryption to secure data
at rest AES-256; hardware accelerated All blocks on disks & in
Amazon S3 encrypted On-premises HSM & CloudHSM support No
direct access to compute nodes Audit logging & AWS CloudTrail
integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP
10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC
JDBC/ODBC AWS Summit 2014 23. Amazon Redshift continuously backs up
your data and recovers from failures Replication within the cluster
and backup to Amazon S3 to maintain multiple copies of data at all
times Backups to Amazon S3 are continuous, automatic, and
incremental Designed for eleven nines of durability Continuous
monitoring and automated recovery from failures of drives and nodes
Able to restore snapshots to any Availability Zone within a region
Easily enable backups to a second region for disaster recovery AWS
Summit 2014 24. 50+ new features since launch in Feb 2013 Regions
N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
Certifications PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others
Security Load/unload encrypted files, Resource-level IAM, Temporary
credentials, HSM, ECDHE for perfect forward security Manageability
Snapshot sharing, backup/restore/resize progress indicators, Cross-
region backups Query Regex, Cursors, MD5, SHA1, Time zone, workload
queue timeout, HLL, Concurrency to 50 slots Ingestion S3 Manifest,
LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character
substitution, CSV, auto datetime format detection, epoch, Ingest
from SSH, JSON, EMR AWS Summit 2014 25. 2014 Amazon.com, Inc. and
its affiliates. All rights reserved. May not be copied, modified,
or distributed in whole or in part without the express consent of
Amazon.com, Inc. MicroStrategy Industrys best BI, web and mobile
applications on-demand in the cloud May 2014 26. Youve Got the
Data. Now What? 27. Deriving Big Insights from Big Data Trends in
business analytics Popular use cases Agile Business Intelligence
Governed Self-Service Information Driven Mobile Apps Redshift
Certified Customer Success 28. Popular Use Cases Information driven
purchasing reviews searches pricing comparisons social networks
recommendations Omni-channel Customer 360 Improve sales efficiency
and effectiveness Data blending o sales o product o Marketing CRM
integration Mobility o BYOD o Caching Information driven experience
o Interaction o Videos o Documents In-store apps o analytics o
customer 360 o personal shopper Store of the future Real-time
decisions Inventory management Customer Analytics Sales Enablement
Retail 29. Self-Service Analytics Revolutionizes Traditional BI
Boost user satisfaction while massively increasing productivity
More Productive More content per creator More Producers More users
can create content More Collaborative Peer-to-peer sharing 5-10x
More Content 5-10x More Content Creators 5-10x More Sharing
>100x more content creation and consumption 30. Governed
Self-Service 31. World-Class Information-Driven Mobile Apps The
future of dashboards More than graphs Multi-media Transaction
enabled Live data Comprehensive data Intuitive Easy to use Guided
workflow for consistent user experience Personalized for each user
32. Business User Access to 1000s of Data Sources Faster access to
your data MicroStrategy Modeled Data Personal or Departmental
Cloud- Based Data Relational Databases Big Data & Hadoop
Enterprise Applications SAP, Oracle e-Business, Siebel, Peoplesoft,
etc. Redshift, Oracle, SQL Server, MySQL, Teradata, Netezza, etc.
Salesforce.com, NetSuite. Facebook, Eloqua, Google Docs, etc.
Spreadsheets, Access databases, CSV, public data downloads, etc.
EMR, MapR, Cloudera, Hortonworks, etc Enterprise-certified single-
version of the truth Redshift Certified Redshift integration 33.
Customer Success Netflix has deployed the MicroStrategy business
intelligence software platform on top of Amazon Elastic MapReduce
(Amazon EMR) for interactive insights. Netflix analysts use
advanced visualizations to explore the performance of its streaming
service closer to recorded time by directly accessing Hadoop data
on an ad-hoc basis and without writing code. 34. Big Data Analytics
Transform your growing Big Data resources into insight and profit.
Self-Service Analytics See and understand data in minutes. No IT
needed. Enterprise-Grade Business Intelligence Produce and publish
trusted analytics to improve your business operations. AWS Partner
Comprehensive Analytics Platform #1 Mobile Analytics MicroStrategy
Analytics Enterprise Business Agility with Trusted, Governed Data
35. Experience MicroStrategy on AWS Today! 36. Ingestion Best
Practices Goal 1 leader node & n compute nodes, Leverage all
the compute nodes and minimize overhead Best Practices Preferred
method - COPY from S3 Loads data in sorted order through the
compute nodes Single Copy command, Split data into multiple files
Strongly recommend that you gzip large datasets copy time from
's3://mybucket/data/timerows.gz credentials
'aws_access_key_id=;aws_secret_access_key= gzip delimiter '|; If
you must ingest through SQL Multi-row inserts Avoid large number of
singleton insert/update/delete operations To copy from another
table CREATE TABLE AS or INSERT INTO SELECT insert into
category_stage values (default, default, default, default), (20,
default, 'Country', default), (21, 'Concerts', 'Rock', default);
AWS Summit 2014 37. Ingestion Best Practices (Cotd..) Best
Practices Verifying load data files For US east S3 provides
eventual consistency. Verify files are in S3 Listing Object Keys
Query Redshift after load. This query returns entries for loading
the tables in the TICKIT database select query, trim(filename),
curtime, status from stl_load_commits where filename like
'%tickit%' order by query; query | btrim | curtime | status
-------+---------------------------+----------------------------+--------
22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1
22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1
22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1
22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1
22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1
22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1
22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1
22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1
22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1
22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1
22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
AWS Summit 2014 38. Ingestion Best Practices (Cotd..) Best
Practices We do not support an upsert statement. Use staging tables
to perform an upsert by doing a join on the staging table with the
target Update then Insert. We do NOT enforce primary key
constraint, if you COPY same data twice, there will be a duplicate
copy. Increase the memory available to a COPY or VACUUM by
increasing wlm_query_slot_count set wlm_query_slot_count to 3; Run
the ANALYZE command whenever youve made a non-trivial number of
changes to your data to ensure your table statistics are current
Amazon Redshift system table that can be helpful in troubleshooting
data load issues:STL_LOAD_ERRORS discovers the errors that occurred
during specific loads. Adjust MAX ERRORS as needed. Check character
set : Support only UTF8 AWS Summit 2014 39. Choose a Sort key Goal
Skip over data blocks to minimize IO Best Practice Sort based on
range or equality predicate (WHERE clause) If you access recent
data frequently, sort based on TIMESTAMP AWS Summit 2014 40. Choose
a Distribution Key Goal Distribute data evenly across nodes
Minimize data movement among nodes : Co-located Joins and
Co-located Aggregates Best Practice Consider using Join key as
distribution key (JOIN clause) If multiple joins, use the foreign
key of the largest dimension as distribution key Consider using
Group By column as distribution key.( GROUP BY clause) Avoid Keys
used as equality filter as your distribution key If de-normalized
tables and no aggregates, do not specify a distribution key.
Redshift will use round robin AWS Summit 2014 41. Query Performance
Best practices Encode date and time using TIMESTAMP data type
instead of Char Specify Constraints RedShift does not enforce
constraints (primary key, foreign key, unique values) but the
optimizer uses it. Loading and/or applications need to be aware
Specify redundant predicate on the sort column SELECT * FROM tab1,
tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > '1/1/2013'
AND tab2.timestamp > '1/1/2013'; AWS Summit 2014 42. Workload
Manager Allows you to manage and adjust query concurrency WLM
allows you to Increase query concurrency up to 50 Define user
groups and query groups Segregate short and long running queries
Help improve performance of individual queries Be aware that query
workload is distributed to every compute node. Increasing
concurrency may not always help due to resource contention (cpu,
memory and I/O). Total throughput may increase by letting one query
to complete first and other queries to wait. AWS Summit 2014 43.
Workload Best Practices Organizing and keeping your load files in
S3 allows for re-run or scenario testing as you evolve your
workflow in the platform. Keep in S3 or Glacier for fiscal/legal
reasons Data updated for short-term consider having a short-term
version of the table for staging and a long term version once data
gets stable. Round Robin distribution key When you dont have a good
Distribution Key Check Part 1 for query on checking for
distribution skew Trade off with collocated joins Loading the
target (final) table Use a chronological date/timestamp columns for
first sortkey. Vacuum is needed less often and runs faster When
first sort column has low cardinality/resolution (i.e, date instead
of timestamp), subsequent columns should match common filters
and/or grouping columns AWS Summit 2014 44. Workload Best Practices
cont. Use UNLOAD command to archive data that is not needed for
business reasons Data that needs to exist only for fiscal/legal
reasons can be re-loaded as needed. Consider applying retention
policies less often than the regular workflow Weekly/Monthly
process during a less busy time Make space provision for the data
growth Make sure all queries have date/timestamp range filters
(> and Snapshot -> Spin Query clusters -> Tear down High
ratio: Consider Performance above space needs when choosing number
of nodes Normalization Rule of Thumb De-normalize only to avoid
large non-collocated joins Slow Changing Dimensions (type II): Keep
normalized, match distkey with fact table AWS Summit 2014 46. Space
Management Redshift has a single pool of space used for tables and
temporary segments. Loads need 2.5 times the space of the data
being loaded if table has a sortkey Vacuum may need 2.5 times the
size of the table. Monitor the free space Performance Tab in the
console Cloudwatch Alarms SQL AWS Summit 2014 47. Space Management
cont. Tables Sizes select trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema, trim(a.name) as Table, b.mbytes,
a.rows from ( select db_id, id, name, sum(rows) as rows from
stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc
on pgc.oid = a.id join pg_namespace as pgn on pgn.oid =
pgc.relnamespace join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes from stv_blocklist group by
tbl) b on a.id=b.tbl order by mbytes desc, a.db_id, a.name; Free
Space select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024
as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes
from stv_partitions where part_begin=0; Redshift allows you to
resize your cluster up and down and across node types. Online
(read-only access). AWS Summit 2014 48. For more best practices,
search youtube for Amazon Redshift Best Practices Thank You ! AWS
Summit 2014