This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Tom Johnston, S3 Product Management, AWS Tom Fuller, Senior Solutions Architect, AWS
John Elliott, Infrastructure Engineering, Pinterest
April 19, 2017
Deep Dive on Object Storage Amazon S3 and Amazon Glacier
Cloud Data Migration
Direct Connect
Snow* data transport
family
3rd Party Connectors
Transfer Acceleration
Storage Gateway
Amazon Kinesis Firehose
The AWS Storage Portfolio
Object
Amazon Glacier Amazon S3
Block
Amazon EBS (persistent)
Amazon EC2 Instance Store
(ephemeral) File
Amazon EFS
What to Expect from the Session • Pick the right storage class for your use cases • Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis Firehose
S3 Transfer Acceleration
AWS Storage Gateway
Data transfer into Amazon S3
AWS Snowmobile
AWS Snowball Edge
Amazon Storage Partner Solutions
aws.amazon.com/backup-recovery/partner-solutions/ Note: Represents a sample of storage partners
Backup and Recovery Primary Storage Archive
Solutions that leverage file, block, object, and streamed data formats as an extension to on-premises storage
Solutions that leverage Amazon S3 for durable data backup
Solutions that leverage Amazon Glacier for durable and cost-effective
long-term data backup
Choice of storage classes on S3
Standard
Active data Archive data Infrequently accessed data
Standard - Infrequent Access Amazon Glacier
Storage classes designed for your use case
S3 Standard • Big data analysis • Content distribution • Static website
hosting
Standard - IA • Backup & archive • Disaster recovery • File sync & share • Long-retained data
Amazon Glacier • Long term archives • Digital preservation • Magnetic tape
replacement
When should you move to Standard-IA?
S3 Analytics - storage class analysis
• Visualize the access pattern on your data over time
• Measure the object age where data is infrequently accessed
• Dive deep by bucket, prefixes, or specific object tag
• Easily create a lifecycle policy based on the analysis
Visualize access pattern on your data
Export S3 Analytics to the tools of your choice
Pick the right storage class for your use cases Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage
Automate data management Lifecycle policies
• Automatic tiering and cost controls • Includes two possible actions:
• Transition: archives to Standard - IA or Amazon Glacier based on object age you specified
• Expiration: deletes objects after specified time
• Actions can be combined • Set policies by bucket, prefix, or tags • Set policies for current version or non-
current versions Lifecycle policies
Set up a lifecycle policy on the AWS Management Console
Protect your data from accidental deletes
• Protects from unintended user deletes or application logic failures
• New version with every upload
• Easy retrieval of deleted objects and roll back to previous versions
Best Practice
Versioning
Easily recover from unintended delete Tip: Create a recycle bin for your storage
Best Practice
Automate with trigger-based workflow Amazon S3 event notifications
Events
SNS topic
SQS queue
Lambda function
• Notification when objects are created via PUT, POST, Copy, Multipart Upload, or DELETE
• Filter on prefixes and suffixes
• Trigger workflow with Amazon SNS, Amazon SQS, and AWS Lambda functions
Cross-region replication Automated, fast, and reliable asynchronous replication of data across AWS Regions
Use cases: • Compliance - store data hundreds of miles apart • Lower latency - distribute data to regional customers • Security - create remote replicas managed by separate AWS accounts
How it works: • Only replicates new PUTs. Once configured, all new uploads into source
bucket will be replicated • Entire bucket or prefix based • 1:1 replication between any 2 regions • Versioning required • Deletes and lifecycle actions are not replicated
Summary – automate management tasks
Cross-region replication
Automate transition and expiration with
lifecycle policies
Trigger-based workflow with
event notification
Easily recover from accidental delete with versioning
Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance • Tools to help you manage storage
Faster upload over long distances S3 Transfer Acceleration
S3 Bucket AWS Edge Location
Uploader
Optimized Throughput!
Change your endpoint, not your code
No firewall changes or client software
Longer distance, larger files, more benefit
Faster or free
68 global edge locations
Try it at S3speedtest.com
Faster upload of large objects Parallelize PUTs with multipart uploads
• Increase aggregate throughput by parallelizing PUTs on high-bandwidth networks • Move the bottleneck to the network,
where it belongs
• Increase resiliency to network errors; fewer large restarts on error-prone networks
Best Practice
Faster download You can parallelize GETs as well as PUTs
GET /example-object HTTP/1.1 Host: example-bucket.s3.amazonaws.com x-amz-date: Fri, 28 Jan 2016 21:32:02 GMT Range: bytes=0-9 Authorization: AWS AKIAIOSFODNN7EXAMPLE:Yxg83MZaEgh3OZ3l0rLo5RTX11o=
For large objects, use range-based GETs align your get ranges with your parts
For content distribution, enable Amazon CloudFront • Caches objects at the edge • Low latency data transfer to end user
SQL Query on S3
Amazon Athena
• No loading of data
• Serverless
• Supports text, CSV, TSV, JSON, AVRO, and columnar formats such as Apache ORC and Apache Parquet
Faster upload over long distances with S3 Transfer Acceleration
Faster upload for large objects with S3 multipart upload
Optimize GET performance with Range GET and CloudFront
SQL Query on S3 with Athena
Distribute key name for high TPS workload
Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance Tools to help you manage storage
Organize your data with object tags
Manage data based on what it is as opposed to where its located
• Classify your data, up to 10 tags per object
• Tag your objects with key-value pairs
• Write policies once based on the type of data
• Put object with tag or add tag to existing objects
Use cases: • Perform security analysis • Meet your IT auditing and compliance needs • Take immediate action on activity How it works: • Capture S3 object-level requests • Enable at the bucket level • Logs delivered to your S3 bucket • $0.10 per 100,000 data events
Audit and monitor access AWS CloudTrail data events
Monitor performance and operation Amazon CloudWatch metrics for S3
• Generate metrics for data of your choice • Entire bucket, prefixes, and tags • Up to 1,000 groups per bucket
• 1-minute CloudWatch metrics • Alert and alarm on metrics • $0.30 per metric per month
CloudWatch Metrics for S3
Metric Name value AllRequests Count PutRequests Count GetRequests Count ListRequests Count DeleteRequests Count HeadRequests Count PostRequests Count
Metric Name value BytesDownloaded MB BytesUploaded MB 4xxErrors Count 5xxErrors Count FirstByteLatency ms TotalRequestLatency ms
Example
S3 Inventory
Save time Daily or weekly delivery Delivery to S3 bucket CSV File Output
Use case: trigger business workflows and applications such as secondary index garbage collection, data auditing, and offline analytics
• More information about your objects than provided by LIST API, such as replication status, multipart
upload flag and delete marker
• Simple pricing: $0.0025 per million objects listed
S3 Inventory
Eventually consistent rolling snapshot • New objects may not be listed • Removed objects may still be included
Name Value Type Description
Bucket String Bucket name. UTF-8 encoded.
Key String Object key name. UTF-8 encoded.
Version Id String Version ID of the object
Is Latest boolean true if object is the latest version (current version) of a versioned object, otherwise false
Delete Marker boolean true if object is a delete marker of a versioned object, otherwise false
Size long Object size in bytes
Last Modified String Last modified timestamp. Format in ISO: YYYY-MM-DDTHH:mm:ss.SSSZ