AWS offers many data services, each optimized for a specific set of structure, size, latency, and concurrency requirements. Making the best use of all specialized services has historically required custom, error-prone data transformation and transport. Now, users can use the AWS Data Pipeline service to orchestrate data flows between Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and on-premise data stores, seamlessly and efficiently applying EC2 instances and EMR clusters to process and transform data. In this session, we demonstrate how you can use AWS Data Pipeline to coordinate your Big Data workflows, applying the optimal data storage technology to each part of your data integration architecture. Swipely's Head of Engineering shows how Swipely uses AWS Data Pipeline to build batch analytics, backfilling all their data, while using resources efficiently. Consequently, Swipely launches novel product features with less development time and less operational complexity.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Orchestrating Big Data Integrationand Analytics Data Flows withAWS Data PipelineJon Einkauf (Sr. Product Manager, AWS)
Friday, November 15, 13
What are some of the challenges in dealing with data?
Friday, November 15, 13
1. Data is stored in different formats andlocations, making it hard to integrate
Amazon Redshift
Amazon S3
Amazon EMRAmazon DynamoDB
Amazon RDS
On-Premises
Friday, November 15, 13
2. Data workflows require complexdependencies
Input Data Ready? Run…
No
Yes
• For example, a data processing step may depend on:• Input data being ready • Prior step completing• Time of day• Etc.
Friday, November 15, 13
3. Things go wrong - you must handle exceptions
• For example, do you want to:
• Retry in the case of failure?
• Wait if a dependent step is taking longer than expected?
• Be notified if something goes wrong?
Friday, November 15, 13
4. Existing tools are not a good fit
• Expensive upfront licenses• Scaling issues• Don’t support scheduling• Not designed for the cloud• Don’t support newer data stores (e.g., Amazon DynamoDB)
Friday, November 15, 13
Introducing AWS Data Pipeline
Friday, November 15, 13
A simple pipeline
Input DataNode with PreCondition check
Activity with failure & delay notifications
Output DataNode
Friday, November 15, 13
Amazon Redshift
Amazon S3
Amazon EMRAmazon DynamoDB
Amazon RDS
Activities
Manages scheduled data movement andprocessing across AWS services
• Amazon DynamoDB table exists/has data• Amazon S3 key exists• Amazon S3 prefix is not empty• Success of custom Unix/Linux shell command• Success of other pipeline tasks
S3 key exists? Copy…
No
Yes
Friday, November 15, 13
Alerting and exception handling
• Notification• On failure• On delay
• Automatic retry logic
Task 1
Success Failure
Alert
Task 2
Success Failure
Alert
Friday, November 15, 13
Flexible scheduling
• Choose a schedule• Run every: 15 minutes, hour, day, week, etc.• User defined
• Backfill support• Start pipeline on past date• Rapidly backfills to present day
Friday, November 15, 13
Massively scalable
• Creates and terminates AWS resources (Amazon EC2 and Amazon EMR) to process data
• Manage resources in multiple regions
Friday, November 15, 13
Easy to get started
• Templates for common use cases
• Graphical interface• Natively understands
CSV and TSV• Automatically
configures Amazon EMR clusters
Friday, November 15, 13
Inexpensive
• Free tier• Pay per activity/precondition• No commitment• Simple pricing:
Friday, November 15, 13
An ETL example (1 of 2)• Combine logs in Amazon S3 with customer data in Amazon RDS• Process using Hive on Amazon EMR• Put output in Amazon S3• Load into Amazon Redshift• Run SQL query and load table for BI tools
Friday, November 15, 13
An ETL example (2 of 2)• Run on a schedule (e.g. hourly)• Use a precondition to make Hive activity depend on Amazon S3 logs being available• Set up Amazon SNS notification on failure• Change default retry logic
Friday, November 15, 13
Swipely
Friday, November 15, 13
1 TB
How big is your data?
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Don’t use Hadoop: your data isn’t that big.
Friday, November 15, 13
How big is your data?
Do you have a big data problem?
Don’t use Hadoop: your data isn’t that big.
Keep your data smalland manageable.
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Decrease development timeby easily backfilling data
Friday, November 15, 13
Get ahead of your Big Datadon’t wait for data to become a problem
Build novel product featureswith a batch architecture
Decrease development timeby easily backfilling data