Presentation from AWS re:Invent 2013. See session video here: http://www.youtube.com/watch?v=MjZdiDotRU8 Presentation is in two parts: (1) Introduction to moving workloads to the cloud, (2) deep dive on how the BBC moved their playout to the cloud.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• The result of 1 year of development by ~18 engineers
And here they are!
Why Did We Build Video Factory?
• Old system– Monolithic– Slow– Couldn’t cope with spikes– Mixed ownership with third party
• Video Factory– Highly scalable, reliable– Completely elastic transcode resource
– Complete ownership
Why Use the Cloud?• Background of 6 channels, spikes up to 24 channels, 6 days a week• A perfect pattern for an elastic architecture
Off-Air Transcode Requests for 1 week
Video Factory – Architecture
• Entirely message driven– Amazon Simple Queuing Service (SQS)
• Some Amazon Simple Notification Service (SNS)
– We use lots of classic message patterns
• ~20 small components– Singular responsibility – “Do one thing, and do it well”
• Share libraries if components do things that are alike• Control bloat
– Components have contracts of behavior• Easy to test
Video Factory – Workflow
SDI Broadcast Video Feed
x 24
Playout Data Feed
Broadcast Encoder
Live Ingest Logic
Amazon Elastic Transcoder
ElementalCloud
DRM
QC
Editorial Clipping
MAM
Amazon S3Mezzanine
Time AddressableMedia Store
Amazon S3Distribution Renditions
RTPChunker
Transcode Abstraction
Layer
Mezzanine
Playout Video
Transcoded Video
Metadata
SMPTE Timecode
Mezzanine Video Capture
Detail
• Mezzanine video capture• Transcode abstraction• Eventing demonstration
Mezzanine Video Capture
Mezzanine Capture
SDI Broadcast Video Feed
x 24Broadcast Grade Encoder
Amazon S3Mezzanine
Chunks
RTPChunker
ChunkUploader
MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD
MPEG2 Transport Stream (H.264) Chunks
3 GB HD/1 GB SD
ChunkConcatenator
Amazon S3Mezzanine
Control Messages
SMPTE Timecode
Concatenating Chunks
• Build file using Amazon S3 multipart requests – 10 GB Mezzanine file constructed in under 10 seconds
• Amazon S3 multipart APIs are very helpful– Component only makes REST API calls
• Small instances; still gives very high performance
• Be careful – Amazon S3 isn’t immediately consistent when dealing with multipart built files– Mitigated with rollback logic in message-based applications
By Numbers – Mezzanine Capture
• 24 channels– 6 HD, 18 SD– 16 TB of Mezzanine data every day per capture
• 200,000 chunks every day– And Amazon S3 has never lost one– That’s ~2 (UK) billion RTP packets every day… per capture
• Broadcast grade resiliency– Several data centers / 2 copies each
Transcode Abstraction
Transcode Abstraction• Abstract away from single supplier
– Avoid vendor lock in– Choose suppliers based on performance and quality and broadcaster-friendly feature sets– BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles
• Smart routing & smart bundling– Save money on non–time critical transcode– Save time & money by bundling together “like” outputs
• Hybrid cloud friendly– Route a baseline of transcode to local encoders, and spike to cloud
• Who has the next game changer?
Transcode Abstraction
TranscodeRequest
Transcode Router
Amazon Elastic Transcoder
ElementalCloud
Amazon Elastic Transcoder
Backend
Elemental Backend
RESTSQS
Amazon S3Mezzanine
Amazon S3Distribution Renditions
SQS
Subtitle Extraction Backend
Transcode Abstraction - Future
TranscodeRequest
Transcode Router
Amazon Elastic Transcoder
ElementalCloud
Amazon Elastic Transcoder
Backend
Elemental Backend
SQS
Amazon S3Mezzanine
Amazon S3Distribution Renditions
SQS
Subtitle Extraction Backend
Unknown Future Backend X
?
REST
Example – A Simple Elastic Transcoder Backend
XMLTranscodeRequest
Get Message from Queue
Unmarshal and Validate Message
Initialize Transcode
Wait for SNS Callback over HTTP
XMLTranscode
StatusMessage
Amazon Elastic Transcoder
POSTPOST(Via SNS)
SQS Message Transaction
Example – Add Error Handling
XMLTranscodeRequest
Get Message from Queue
Unmarshal and Validate Message
Initialize Transcode
Wait for SNS Callback over HTTP
XMLTranscode
StatusMessage
Amazon Elastic Transcoder
POSTPOST(Via SNS)
Bad MessageQueue
FailQueue
Dead LetterQueue
SQS Message Transaction
Example – Add Monitoring Eventing
XMLTranscodeRequest
Get Message from Queue
Unmarshal and Validate Message
Initialize Transcode
Wait for SNS Callback over HTTP
XMLTranscode
StatusMessage
Amazon Elastic Transcoder
POSTPOST(Via SNS)
Bad MessageQueue
FailQueue
Dead LetterQueue
MonitoringEvents
MonitoringEvents
MonitoringEvents
MonitoringEvents
SQS Message Transaction
BBC eventing framework
• Key-value pairs pushed into Splunk– Business-level events, e.g.:
• Message consumed• Transcode started
– System-level events, e.g.:
• HTTP call returned status 404• Application’s heap size• Unhandled exception
• Fixed model for “context” data– Identifiable workflows, grouping of events; transactions– Saves us a LOT of time diagnosing failures
Component Development – General Development & Architecture• Java applications
– Run inside Apache Tomcat on m1.small EC2 instances– Run at least 3 of everything– Autoscale on queue depth
• Built on top of the Apache Camel framework– A platform for build message-driven applications– Reliable, well-tested SQS backend– Camel route builders Java DSL
• Full of messaging patterns
• Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD)– Cucumber
• Deployed continuously– Many times a day, 5 days a week
Error Handling Messaging Patterns
• We use several message patterns– Bad message queue– Dead letter queue– Fail queue
• Key concept– Never lose a message– Message is either in-flight, done, or in an error queue somewhere
• All require human intervention for the workflow to continue– Not necessarily a bad thing
Message Patterns – Bad Message Queue
• Wrapped in a message wrapper which contains context• Never retried• Very rare in production systems• Implemented as an exception handler on the route builder
The message doesn’t unmarshal to the object it should OR
We could unmarshal the object, but it doesn’t meet our validation rules
Message Patterns – Dead Letter Queue
• Message is an exact copy of the input message• Retried several times before being put on the DLQ• Can be common, even in production systems• Implemented as a bean in the route builder for SQS
We tried processing the message a number of times, and something we weren’t expecting went wrong each time
Message Patterns – Fail Queue
• Wrapped in a message wrapper that contains context• Requires some level of knowledge of the system to be retried• Often evolve from understanding the causes of DLQ’d messages• Implemented as an exception handler on the route builder