Page 1
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
MED402: Building a Scalable Video / Digital Asset
Management (DAM) Platform in the Cloud
Michael Limcaco – Enterprise Solutions Architect (AWS)
Jonathan Rivers – Director, Technical Operations (PBS)
November 15, 2013
Page 2
Agenda
• The big picture
• Architecture
• Build-out exercise
• Customer case study (PBS)
• Observations and summary
Page 3
Big Picture: Enterprise Media Architecture
Transcoders
Store output
profile and file
Content
Management,
Discovery &
Delivery
Store output
profile and file
Media
Files
RTMP
MPEG-TS
HD-SDI
Camera
Physical
Media
Live
Stream
Integrated
Workflow
Page 4
Big Picture: Digital Asset Management (DAM)
Transcoders
Store output
profile and file
Content
Management,
Discovery, &
Delivery
Store output
profile and file
Media
Files
RTMP
MPEG-TS
HD-SDI
Camera
Physical
Media
Live
Stream
Integrated
Workflow
DAM
Page 5
Ingest Processing Discovery &
Delivery
Workflow Management
Storage
Page 6
Ingest Processing Discovery &
Delivery
Workflow Management
Storage
Page 7
Key DAM Requirements
• Ingest
• Metadata extraction
• Create renditions
• Build the catalog
• Enable rich search
• Manage storage lifecycle
• Provide efficient delivery of media assets
Page 8
Key DAM Requirements
• Ingest
• Metadata extraction
• Create renditions
• Build the catalog
• Enable rich search
• Manage storage lifecycle
• Provide efficient delivery of media assets
Page 9
Key DAM Requirements
• Ingest
• Metadata extraction
• Create renditions
• Build the catalog
• Enable rich search
• Manage storage lifecycle
• Provide efficient delivery of media assets
Page 10
Why Scalable?
• Increasing volume, variety, velocity – Collectors, cameras, sensors and sources
• Ex: UGC, raw source, Mezzanine, B-roll, creative collateral
• Final content
– Formats and standards • Transport, containers, codecs, metadata
• SD, HD, 4K …. 8K
– Devices and user expectations
• Opportunities through cloud enablement – Media platform as a service
– Multitenancy
Page 11
What about Search? Ugh …
• Core elements – Project, keyword, asset name, tags, date/time capture, timecode range,
subject, format, size
• Extended structured search – Dublin core, XMP, MPEG-7, IPTC, EXIF, FCXML, SMPTE, MISB
• Unstructured search – Comments, notes, transcript, closed captioning
Page 12
Enough Theory …
Let’s Build a DAM in the Cloud!
Page 13
The User Experience
(Notional Reference Client)
(Demo)
Page 15
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Interface
Delivery
Cache
Page 16
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Web
Interface
Delivery
Cache
Page 17
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Web
Interface
Delivery
Cache
Page 18
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Web
Interface
Delivery
Cache
Page 19
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Web
Interface
Delivery
Cache
Page 20
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Interface
Delivery
Cache
Page 21
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Interface
Delivery
Cache
Page 22
Tools Available to Us Need Description AWS Service
Ingest Integrate w / existing file-based workflows Amazon S3
Metadata Process inline and sidecar files EC2 / Elastic Beanstalk
Renditions Autogenerate thumbnails and proxies Amazon Elastic Transcoder
Catalog part 1 Administrative entities, simple retrieval Amazon DynamoDB
Catalog part 2 Field and free-form search Amazon CloudSearch
Storage Nearline, online, offline infinite storage Amazon S3, Amazon Glacier
Delivery Global caching and streaming footprint Amazon CloudFront
Page 23
Catalog: A word on why DynamoDB
Container-A
Header
Layer-1
Layer-2
Container-B
Header
Layer-1
Layer-2
Core Elem1 Core Elem2 Elem from A Elem from B
Name_A Size Some_Field
Name_B Size Some_Field
Name_C Size
NoSQL Data Model
Container-C
Header
Page 24
Catalog: A Word on Why CloudSearch
• Video and text
– Header fields with textual descriptions, synopsis, comments
– Tracks with speech to text, closed caption data
– Links to scripts
• Video and structured elements
– XMP dynamic media
– Sidecar files
• A managed search engine dedicated to these kinds of problems
– Case folding, stemming, stopword removal, synonyms
– Also accent normalization, UTF-8 normalization, etc.
Page 25
Other Goodies
• Back-end services – AWS CLI
– Open source decode utilities
• EXIFtool
• MediaInfo
– ETL support
• Talend (representative)
• Front-end services – Node.js + AWS Node SDK
Page 26
S3 Buckets
For Renditions,
Metadata Sidecar
Files
Auto scaling
Group
EC2 Workers
Auto scaling
Group DynamoDB
Amazon
CloudSearch
EC2 Workers
AWS
Beanstalk
DAM
Web Service
DAM
Storage &
Archive
Catalog
Rendition
Processing
Metadata
Processing
Event
Handler
Mailbox
Mailbox
DAM
Interface
Delivery
Cache
Page 27
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
Amazon SQS Queue
Rendition Jobs
Amazon SQS
Queue
Metadata
Processing Jobs Metadata
Workers
EC2 ASG
Rendition
Workers Amazon
DynamoDB
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
DAM
Catalog
EC2 Crawler
Page 28
Walkthrough
(Dual Screen)
Page 29
Setup
• Amazon Simple Storage Service (S3) buckets ready to go – External staging locations
– Internal working locations
• Amazon Simple Notification Service (SNS) + Amazon Simple Queue Service (SQS) wired up
• Catalog data models established – Amazon DynamoDB table “catalog” created
– Amazon CloudSearch search domain “catalog” created
Page 30
1. Ingest, Crawl, Notify
a. End user initiates data copy
b. EC2 worker scans Amazon S3 staging bucket
c. EC2 worker copies or moves content
d. EC2 worker broadcasts “NEW DATA” event
Page 31
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers Amazon
DynamoDB
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
DAM
Catalog
EC2 Crawler
Page 32
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers Amazon
DynamoDB
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
DAM
Catalog
EC2 Crawler
Page 33
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 34
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 35
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 36
1. Ingest, Crawl, Notify
a. End user initiates data copy
b. EC2 worker scans Amazon S3 staging bucket
c. EC2 worker copies or moves content
d. EC2 worker broadcasts “NEW DATA” event
(SNS)
Page 37
2. Metadata Extraction
a. EC2 worker polls inbox (SQS)
b. EC2 worker pulls down media asset from Amazon S3
c. EC2 worker parses media files
d. EC2 worker pumps metadata through ETL flow to prepare for catalog insertion
e. EC2 worker inserts into catalog (Amazon DynamoDB)
Page 38
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 39
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 40
2. Metadata Extraction
a. EC2 worker polls inbox (SQS)
b. EC2 worker pulls down media asset from Amazon S3
c. EC2 worker parses media files
d. EC2 worker pumps metadata through ETL flow to prepare for catalog insertion
e. EC2 worker inserts into catalog (Amazon DynamoDB)
Page 41
Preparing for Amazon DynamoDB Insert
{
"COMPLETE_NAME" :
{ "S" : "01_01_SoccerF_05_A.mp4" },
"FORMAT" :
{ "S" : "MPEG-4" },
"CODEC_ID" :
{ "S" : "mp42" }
}
Page 42
Model It and Deploy to EC2! (Talend)
Page 43
3. Catalog Processing
a. Store metadata record in Amazon DynamoDB
b. Reflect searchable subset to Amazon
CloudSearch
c. Go crazy (HTTP GET)
Page 44
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
Amazon
DynamoDB
DAM
Catalog
Page 45
Amazon SNS Topic
Amazon S3 Storage
For Source,
Renditions, Metadata
Sidecar Files
SQS Queue
Rendition Jobs
SQS Queue
Metadata
Processing
Jobs Metadata
Workers
EC2 ASG
Rendition
Workers
Amazon
CloudSearch
EC2 ASG
Media
Content
AWS Elastic
Beanstalk
Elastic Transcoder
Proxy / Thumbnail
Generation
DAM
Web Service
CloudFront
Download
Distribution
EC2 Crawler
2
Amazon
DynamoDB
DAM
Catalog
1
Page 46
Querying the Catalog (Amazon CloudSearch)
• In Node.js
var optionsget = {
host : 'cloudsearch.demo.aws.com', // here only the domain name
port : 80,
path : '/2011-02-01/search?bq=complete_name:\'-STRAWBERRY\'&
return-fields=complete_name,text_relevance,codec_id_info,
duration,file_size, duration,encoded_date',
method : 'GET'
}
• http://cloudsearch.demo.aws.com/2011-02-
01/search?bq=complete_name : …<field=value>
Page 47
Customer Case Study (PBS)
Page 48
Merlin: PBS CMS/DAM
• Code name Merlin
• Structured metadata
• 200+ web object records daily
– 29,046 web objects
• 150+ Video objects daily
– 91,436 videos
• Users from over 150 stations 30 national producers
– Frontline
– Downton Abbey
– PBS Newshour
Page 49
What’s It Do?
• Large multitenant system – 1200 registered users
• 250 million streams per month
• 20 million unique viewers
• 8 PB of video delivered monthly
Page 50
Getting Data In
• 33 ingestible web feeds
– Content editors
– Web page listings
• Batch video ingest API
– Video content editors
– External workflow integration
• Manually entered videos
– Video content editors from all 50 states
– Number of user accounts
Page 51
System Overview
Workflow
Service
Content
API
Amazon
S3 Amazon
RDS
Amazon
RDS
Search Util Amazon
CloudSearch
Amazon
SWF
DAM (Merlin)
RSS
Ingest
API
User
Input
CDN
Page 52
Basic Workflow
• Object registered with Merlin
• Images registered and processed with ITS
– Stored in CDN fronted Amazon S3 bucket
• Videos registered with VTS
– Jobs sent to Zencoder for processing
– Video stored in CDN fronted Amazon S3 bucket
• Objects ready for clients
– Objects rendered for consumption in Amazon S3
– Objects registered with APIs
– Objects discoverable
Page 53
Making It Discoverable
• Search util service
• Runs every hour
– Re-indexes last several hours each time
• Polls APIs
– Content API
– Modified time
• Updates Amazon CloudSearch index
– 2 primary indexes
Page 54
Search Considerations
• Hidden objects
• Rights management
• Partitioned search – Local station search
– Results by geo
– Restrict results for international customers
• Unify and normalize existing APIs – Flatten data model
• Users looking for programs – Specific searches
– Suitable for structured data
Page 55
Challenges
• No native time field – Convert dates to integers
– Epoch time
• Versioning of documents – Epoch for versioning
• Exposing two versions of most fields – Text searchable
– Facets (copy of text version)
Page 56
Search Consumers (PBS.org)
Site Search
Page 57
Search Consumers (Video Portal)
Site Search Programs A-Z
Page 60
Summary
• Build an enterprise-scale DAM platform now
– Managed storage and archive (Amazon S3, Amazon Glacier)
– Managed database for catalog processing (Amazon DynamoDB, Amazon
Relational Database Service [RDS])
– Managed search (CloudSearch)
• Application development accelerators
– Elastic Beanstalk harness (web, API, and worker roles)
– Reduced effort with the AWS CLI
• (Almost) fire and forget
Page 61
AWS Marketplace Can Help
• AWS online software store – Customer can find, research, buy software
– Simple pricing, aligns with EC2 usage model
– 1-click launch in minutes
– Marketplace billing integrated into your AWS account
– 1,000+ products across 24 categories
• Digital asset management related options Include: – WebDAM – centralize, store, manage and distribute collateral
– Digital asset management cloud – web-based open source DAM
– Widen – manage and distribute digital media and brand assets with
user roles and permissions
– Adobe Experience Manager – unified asset management including
mobile
Learn more at: http://aws.amazon.com/marketplace
Page 63
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
MED-402 Building a Scalable Video / DAM Solution in the Cloud