© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
October 31, 2017 | 10:00 AM PT
Architecting an Open Data
Lake for the Enterprise
© 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s PresentersPratap Ramamurthy, Solutions Architect, Amazon Web Services
Ashwin Viswanath, Director, Cloud Product Marketing, Talend
Eric Anderson, Executive Director, Data, Beachbody
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s Agenda• An overview of AWS and AWS Marketplace, with an
emphasis on AWS data lake solutions and Talend
• Overview of the Talend solutions featured in our story
• Challenges faced by Beachbody
• The Beachbody success story with AWS and Talend
• Q&A/Discussion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning Objectives:1. How to migrate a variety of structured and unstructured data sources
to a data lake
2. How to shorten development and testing cycles
3. How to mitigate complex deployment challenges common to real-time
data
4. How to take advantage of Spark and Hadoop by generating native
code
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Data Lake and AWS
Drive business value with any type of data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy Data Warehouses & RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new
data sources
• Queries take too long
• Cost $MM upfront
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Should I Build a Data Lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink How to Become a Data-driven Business
• Business outcomes - start with the insights and actions
you want to drive, then work backwards to a streamlined
design
• Experimentation - start small, test many ideas, keep
the good ones and scale those up, paying only for what
you consume
• Agile and timely - deploy data processing infrastructure
in minutes, not months. take advantage of a rich platform
of services to respond quickly to changing business
needs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Case Determines Platform Design
Ingest/
Collect
Consume/
visualizeStore
Process/
analyze
Data
1 40 9
5
Answers &
Insights
START HEREWITH A BUSINESS CASE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Experiment and Scale Based on Your Business Needs
MATCHAVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualizeStore
Process/
analyze
Data
1 40 9
5
Answers &
Insights
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Outcomes on a Modern Data Architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use an Optimal Combination of Highly
Interoperable Services
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon S3 for Modern Data Architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Amazon Athena
IntegratedEasy to use
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple Storage and Compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big Data architectures often benefit
from decoupling storage and compute
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Improving Data Agility with
TalendAshwin Viswanath,
Director of Cloud Product Marketing,
Talend
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes: The First Phase and the Future
First phase: Capture and store raw data of many
different types at scale
Next phase: Augment enterprise data warehousing
strategies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Data Lake Projects Fail
No DevOps
Practices for
Scalability &
Testing
Lack of Expertise Siloed
Operating Model
Poor Data
Governance
Poor Architectural
Design &
Integration
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Foundational Elements of a Data Lake
Data Preparation
Self-service Data IngestMetadata Management
Data Classification
Data Lake
Data GovernanceData Lineage
Security Data Profiling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streamline DevOps Process for Big Data
Custom Cluster Configuration
• Retrieve Hadoop configuration
data from job server
• Upload configuration files to
different clusters based on role:
dev/test/prod
• Enforce uniform security
standards
• Available for Spark and Spark
Streaming jobs
Portable integration jobs across your environment
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Matching and Machine Learning on
Spark
• New Data Stewardship interface simplifies matching process
• Improved performance through continuous matching speeding time to insight
Harmonize data at scale by learning from your key experts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Faster and Better Real-time Insights with Spark
Enterprise – class robustness and Intelligent Integration
• New Spark Support
• Production-ready with Spark 2.1
• Toggle between Spark 1.X and 2.X
• Easily upgrade to Spark 2.X
• Natural Language Processing with Spark
• Data Preparation for Spark Streaming
• Talend Data Mapper runs with Spark Streaming
• Spark Streaming support for Kerberized Kafka 0.10
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Governance
Complete End-to-end Data Lineage
• Understand more about your unstructured data with new cloud and big data metadata bridges
• Save time by automatically harvesting data structures to build a data lake inventory
• Manage change with version control and notifications
Metadata bridges
S3, Hadoop HDFS, Hive, MongoDB,
Couchbase, Cassandra, Apache Atlas
Files systems
Amazon S3, Hadoop HDFS, Unix, Windows, Linux
File formats
CSV, Excel, JSON, Avro, Parquet
Know Your Data for Increased Data Protection, Accessibility and Compliance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beachbody – Fitness goes Big
DataDriving innovation with Talend on AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
About Beachbody
• A leading provider of fitness, nutrition, and weight-loss
programs
• Creator of P90X® Series, INSANITY®, FOCUS T25®, 21
Day Fix®, Body Beast®, PiYo®, and Hip Hop Abs®
• Empowers over 23 Million customers
• Supports 350K+ independent “Coach” distributors
• Operates with 800+ employees
• Sees 5 million+ monthly unique visits across digital
platforms
• Reached $1 billion in gross sales in 2015
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Challenge - Do More Better, Faster, Cheaper
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Project Vision – Open Enterprise Data Lake
• Build an OPEN Enterprise Data Platform
• Open Source Technology: Bring Your Own Tool
• Decentralized Data Ownership: Many teams can publish
• Centralized people, processes, and tools available
• Capture All Data as real-time as possible
• Access to All raw + processed data by Authorized Users
• HIPAA/PII encrypted or masked to for compliance
• Shift Time from Collecting Data -> Analyzing Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Architecture
Amazon S3 Data Lake folder structure
AWS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Component - Storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Component – Data Pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Component – RDBMS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Component – Compute
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Component – Analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business Benefits
• Reduced Data Acquisition Time by 5x
• Improved Marketing Campaigns
• Reduced Site Tagging Costs
• Improved Employee Retention and Satisfaction
• Automated Customer Self-Service Order Status
• Identified Web Funnel Conversion Opportunities (testing now)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Take a Free 30-Day Trial of Talend Integration Cloud:
https://iam.integrationcloud.talend.com/idp/federation/up/login
• Try AWS for free:
https://aws.amazon.com/