Building Serverless Data Infrastructure in the AWS Cloud Ryan Plant @ryan_plant November 10, 2017
Building ServerlessData Infrastructure in the AWS Cloud
Ryan Plant@ryan_plant
November 10, 2017
ThankstoourSponsors!Partners
Premier
Marquee:
Prize:
Gettheapp!Givefeedback!
WHAT WE’LL COVER
The New Data Economy
Reference Architecture
Using the AWS Cloud
The world’s most valuable resource is no longer oil, but data…
May 6th, 2017
Data => Revenue(but extraction, refinement, packaging, and distribution needed)
DW
Traditional Data Warehousing
Volume, variety, and velocity…
Advanced analytics…
Artificial intelligence…
”What got us here won’t (entirely) get us there…”
Mostly proprietary…
Costly and complex to scale…
Next Generation Data Infrastructure
(i.e. the “data lake”)
James “Data Lake” Dixon
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption –the data lake is a large body of water in a more natural state…
From Data Warehouses to Lakes
A data pond, lake, ocean is not a product it’s an architecture…(and architecture is a principled and pattern-oriented approach to building systems)
Any and all data…Any source and format…
Any time…
WHAT WE’LL COVER
The New Data Economy
Reference Architecture
Using the AWS Cloud
APPS & SOURCES
STORAGE AND PROCESSING LAYER
SERVING LAYER
Storage
Catalog
ProcessingAnalytics
& Artificial
IntelligenceIngestion
Models & Marts
DATA OPS
API
Search Security
Config
Telemetry
Cost Mgmt
DATA OPS
Security
Config
Telemetry
Cost Mgmt
SERVING LAYER
Models & Marts
API
Search
APPS & SOURCES
STORAGE AND PROCESSING LAYER
StorageIngestion
Catalog
ProcessingAnalytics
& Artificial
Intelligence
Data Ingestion Pipelines
SERVICESERVICE
SERVICE
MONOLITHMONOLITH
MONOLITH Change Data Capture(CDC)
STREAMS
MESSAGING
FILE EXTRACTS
STORAGE
source data aggregated, stored indefinitelymany supported formats
append
append
PUT
Securitysegregation & encryption
Storage and Catalog
STORAGE
RAW REFINED
Catalog
• Register source and schema• Data attribute inventory• Relationships and dependencies• Etc…
dataIngestion
Catalog
Raw to Refined Processing Pipelines
STORAGE
RAW REFINED
Processing Pipelines
dataIngestion
C1 C2 C3 C..n
• Preserve RAW data; enrich only• Apply transforms to create new, REFINED
datasets (e.g. customer partitioned views)• Catalog new datasets• Enable new use cases:
• Reporting/Analytical views• Machine/Deep Learning
X Y ZALL DATA
Processing Pipelines
Catalog
Analytics and AI
STORAGE
RAW REFINED
dataIngestion
Analytics and Artificial Intelligence
C1 C2 C3 C..nALL DATAX Y Z
… … …
DATA OPS
Security
Config
Telemetry
Cost Mgmt
APPS & SOURCES
STORAGE AND PROCESSING LAYER
StorageIngestion
Catalog
ProcessingAnalytics
& Artificial
Intelligence
SERVING LAYER
Models & Marts
API
Search
Processing Pipelines
Catalog
Curation and Serving
STORAGE
RAW REFINED
dataIngestion
Analytics and Artificial Intelligence
C1 C2 C3 C..nALL DATAX Y Z
Models and Marts
… … …
Search
… … …
Processing Pipelines
Catalog
STORAGE
RAW REFINED
dataIngestion
Analytics and Artificial Intelligence
C1 C2 C3 C..nALL DATAX Y Z
Models and Marts
… … …
Search
… … …
API
APPS & SOURCES
STORAGE AND PROCESSING LAYER
SERVING LAYER
Storage
Catalog
ProcessingAnalytics
& Artificial
IntelligenceIngestion
Models & Marts
DATA OPS
API
Search Security
Config
Telemetry
Cost Mgmt
WHAT WE’LL COVER
The New Data Economy
Reference Architecture
Using the AWS Cloud
Lots of software, hardware, etc.
TRADITIONAL INVESTMENT IN NEXT GENERATION DATA
CAPITAL AND RISK BARRIERS
acquire/write and maintain software
procure, install, and maintain hardware
get commercial real estate license
PUBLIC CLOUD ECONOMIES OF SCALE
CLOUD OPTIMIZATION
Infrastructure as a ServiceSomeone else’s hardware and real estate
Your software, your (virtual) servers
Platform as a ServiceSomeone else’s software, servers, hardware and real estate
Your custom application software
Software as a ServiceSomeone else’s application software, you provide the data
(everything else doesn’t matter)
Cycle TimeCapital OptimizationDifferentiation Focus
High
Higher
Highest
Go Serverless!(as much as possible)
everything is an event: messages, log entries, file I/Os, clock alarms, etc.listen for events: trigger a handler with an eventstateless event handling: avoid state, persist as event source, handoff as soon as possibleautomation through orchestration and coordination
Principles for event-driven, reactive data infrastructure primed for serverless architectures
StorageIngestion
SQS
SNS
Kinesis
DynamoDB/RDS
event triggers y = f (x)
y = f (x, y)
y = f ([x, y])
event handlers
AWS Glacier(archival)
/{source}-raw/{key}/YYYY-MM-DD/{source}-refined/{key}/YYYY-MM-DD
AWS Lambda AWS S3(ready)
KMS(encryption) lifecycle policies
IAM + Directory(access control)
CloudWatch/Trail
to S3 direct
AWS Step Functions(coordinated state)
Catalog
StorageSources
Ingestion
AWS Glue(serverless ETL/ELT)
source crawlers
metadata
classifier
classifierdoSomething(…) {…} trigger
Processing Pipelines
jobs and job runner
To Targets
Catalog
Storage
Sources &
Targets Ingestion
Processing Pipelines
AWS Glue(serverless ETL/ELT)
AWS EMR(Managed Hadoop)
Streaming
Kinesis
Batch
AWS Batch
Targets &
SourcesIngestion
Serving Layer
Catalog
Storage
Processing Pipelines
AWS Glue(serverless ETL/ELT)
Serving Layer
AWS ElasticSearch(managed ES)
AWS RedShiftSpectrum
(Parallel DW)
SourcesIngestion
AWS Athena(Ad-hoc Query)
Catalog
Storage
Processing Pipelines
Serving Layer
SourcesIngestion
AWS API Gateway(serverless APIs)
AWS QuickSight(visualization)
AWS Cognito(Web/Mobile Identity and SSO)
WHAT WE’LL COVER
The New Data Economy
Reference Architecture
Using the AWS Cloud
CLOUD OPTIMIZATION
Infrastructure as a ServiceSomeone else’s hardware and real estate
Your software, your (virtual) servers
Platform as a ServiceSomeone else’s software, servers, hardware and real estate
Your custom application software
Software as a ServiceSomeone else’s application software, you provide the data
(everything else doesn’t matter)
Cycle TimeCapital OptimizationDifferentiation Focus
High
Higher
Highest
CLOUD OPTIMIZATION
Infrastructure as a ServiceSomeone else’s hardware and real estate
Your software, your (virtual) servers
Platform as a ServiceSomeone else’s software, servers, hardware and real estate
Your custom application software
Software as a ServiceSomeone else’s application software, you provide the data
(everything else doesn’t matter)
You are likely here…
Aim here…
TBD
Opportunity!
Public Cloud R&D Investment
SERVERLESS: USE CAUTION
The floor is wet (and is constantly getting mopped!)
The edges are sharp:• Development, Test, Debug tools and experience• Configuration and Deployment challenges• Variable, non-deterministic performance
Extremely new (but inevitable) paradigm…