Code Shaming; Anti Patterns at Work Silicon Valley Code Camp – October 2014 Mark Simms (@mabsimms) Principal Group Program Manager Windows Azure Customer Advisory Team
Nov 27, 2014
Code Shaming; Anti Patterns at WorkSilicon Valley Code Camp – October 2014
Mark Simms (@mabsimms)Principal Group Program ManagerWindows Azure Customer Advisory Team
Designing resilient large-scale services requires careful design and architecture choicesIn this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale.
Azure Customer Advisory Team (CAT) Works with internal and external customers to build out some of the largest applications on Azure
Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting
This is meant to be an interactive discussion – if you don’t ask questions, we will!
This session will be customer stories, patterns & code.
We will get deeply nerdy with .NET and Azure services.
Setting the stage
A large web site, processing asynchronous work
«...
Azure Cloud Service
Web Role
100k+ connected devices publishing activity reports
Target end to end latency (including cellular link) – 8 seconds
Target throughput 5000 messages / second
Connected device(s) service, asynchronous processing
Azure Cloud Service
Web Role Worker
Service Bus
Batch receiving messages for throughput
Flag completion for individual messages
Connected device(s) service, asynchronous processing
Serialized processing – increasing latency
Batching receive for chunky communication – needed to meet throughput goalsProcessing messages in sequence drives up latency
Service Bus
QueueMessage
Batch
Process Messages
Process Message
Process Message � ..
Switch to parallel processing
Service BusQueue
Message Batch
Process Messages
Process Message
Process Message
� ..
Initial performance very smooth
App quickly spikes to 100% CPU on all cores
Execution time spikes to minutes!
Something isn’t right
Most threads blocked in FindEntry of Dictionary
Using a Dictionary to look up the message handlers
What does windbg say?
Large variations in avg/max latency
After time, processing rate drops to ~5 msg / second
CPU at ~ 0%
Something still isn’t right
Message Type 1
Message Type 2
Message Type 3
Message Type 4
Message Type 5
Message Type 6
Message Type 7
Message Type 8
00:00.0
00:04.3
00:08.6
00:12.9
00:17.3
00:21.6
00:25.9
00:30.2
Variation in Message ProcessingAvg Min Max
What does perf view have to say?
http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics
System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached
Looks simple enough…Required messaging exchange patterns for queuing (pub/sub, competing consumer)Partitioning and load balancing (affinity) for queue resourcesLatency vs. throughput – batchingResources vs. latency – bounding concurrency of task executionMessage dispatch – dynamic vs. fixed function tablesPoison messages, retriesIdempotent processing
Asynchronous & queue based processing
Cloud Service Boundary
Load Balancer
Web Servers
Database
App Servers
Azure Queue(s)
(Very) Large scale website, backed by 500 Azure SQL databases
Physically collapsed web/app tiers to reduce latency
What can happen during periods of extreme success?
Large website, scale-out relational data storage
«...
Azure Cloud Service
Web Role
500 databases
Each cloud service has a single public IP (VIP)
Each Azure SQL Database cluster also has a single public IP
120 web role instances, 500 databases
Connection pool default size = 100
What’s the limit?
Large website, scale-out relational data storage
Azure Load Balancer
DB1 DB2 DB3
SrcIp SrcPort DestIp DestPort
A.B.C.D 1 E.F.G.H 1433
A.B.C.D 2 E.F.G.H 1433
(Very) Large scale website, leveraging an external service for content moderation
Protected the external service dependency with a retry policy
On average called in 0.5% of service calls
Large website, leveraging external services
«...
Azure Cloud Service
Web Role
500 databases
Content moderation
service
Too much trust in downstream services and client proxies
Not bounding non-deterministic calls
Blocking synchronous operations
No load shedding
Unintended consequences
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290
50
100
150
200
250
300
350
400
450Web Request Response Latency
Avg Latency Response Latency
Seco
nds
Rich clients (mobile and desktop) publishing documents for processing
Using Shared Access Signature (SAS) tokens for direct writes to storage
Looks like a good design…
Large website, asynchronous document processing
«...
Azure Cloud Service
Web Role Worker
Azure Storage Account
Blob
Queue
Storage account URI is “hard coded” into the client application
Need to update all 100k+ client applications to change storage account
Large website, asynchronous document processing
Design Choices & Challenges
Devices and Services workload – connected embedded devices and applications streaming data to the cloud100k+ devices, growing 50k / monthRegional affinity (North America only)
Optimize for the most stringent case
Simplicity is king
No one, true solution
Exploration – Data Design
Query Throughput
Latency Reach
Every 30 seconds, each device publishes a status update (location, health, etc)
4k – 100k msgs/sec
2000 – 5000 ms
Single device
Every 10 minutes, a batch job retrieves all of the status updates delivered in the past 10 minutes
2M msgs / 10 minutes
2 minutes All devices
On an ad-hoc basis, a user may request the current status and recent history of all of their devices
15 requests / second
500 ms Limited device set
On an ad-hoc basis, a user may request a historical time range of all of their devices
5 requests / second
750 ms Limited device set
Cannot fulfill with a single database Exceeds transactional throughput limitData growth will exceed practical size limits
Insert heavy workloadPressure on transaction log
Partitioning keys?Device ID, User account?
Partitioning approachBucket, range, lookup?
Option 1: Relational – Considerations and Challenges
Periodic query spike on bulk reportingImpact to online operations (30M+ rows)
RebalancingMoving data between partitions / databases
Distribution of reference data (relational model)Keeping in sync
Impact of noisy neighbors (Azure SQL DB)Variable latency, pushback under heavy load
Cost of management (SQL IaaS)Cost of automation for patching, maintenance
Option 1: Relational – Considerations and Challenges
Inserting large volumes of streaming data into a data storeData store is governed on number of operations (transactions)
Trade consistency for throughput – enqueue, batch and publishGet: increased throughput, shift work to ”cheap” resource (app memory)Give up: full durability (potential data loss)
Tackling the Insert Challenge
Challenge: know that your site is having issues before Twitter doesThis is not a randomly chosen anecdote.
Instrument, collect, analyze - reactBest: buy your way to victory (AppDynamics, New Relic, etc)Also need to instrument application effectively for ”contextual” data (aka, logging)
Tackling the Insight Challenge
Instrument for production loggingIf you didn’t log & capture it, it didn’t happen
Implement inter-service monitoring and alertingNothing interesting happens on a single instance
Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time
Getting logging rightAll logging must be asynchronous Buffer and filter before pushing to remote service or store
Instrumenting Applications
Bringing down a production system with logging…
Demo: Instrumenting Applications with Event Source
STB Readiness
Option 2: Compositional Azure Storage
This isn’t a relational workloadPer-device insert and lookupPeriodic batch transfer
Per-device lookupNatural fit for table storage Device ID = Pk
Data type = Rk
Periodic batch transferNatural fit for blob storageInstance + Timestamp = blob idBuffer and write into blocksRoll over on time interval (10 min)
0101 1101 0111
1101 0111 ...Time/space
buffer
Pk={Device;Day}, Rk={Timestamp}Payload={fields}
Table Storage
BlobStorage
Uri={Minute;Instance}Payload={JSON Data}
Querying by device By time - direct { PkRk } lookup
By day - direct { Pk } max of 2880 records per partition
Batch transfer by time frameParallel download of all blobs matching timeframe pattern
Adding scale capacity20k operations per storage account,
Azure Storage Account - Blob
Max blob size (block) 200 GB (50k blocks)
Max block size 4 MB
Max blob size (page) 1 TB
Max page size 512 bytes
Max bandwidth / blob 480 Mbps
Latency bounds (per operation)
100ms nominal1-3 sec duringload balancing
Scale-out unit Blob
Scale-out impedance Low
Use the appropriate blob type • Prefer block blogs with immutable / append-only data)
Use the largest practical block size• Note: network performance may require smaller blocks
for“long-haul”
For partial reads use 64 KB block size to maximize throughput
ScaleUse the appropriate blob type
• Prefer block blogs with immutable / append-only data)
Use the largest practical block size• Note: network performance may require smaller blocks
for“long-haul”
Use Async Copy API for copying blobs between accounts, providers, etc
Azure Storage Account - Table
Max operations / secondper partition 5000
Max row size (names + data) 1 MB
Max column size (byte[] or string) 64 KB
Maximum number of rowsN/A (up to
storage account size limit)
Scale-out unit Table partition
Scale-out impedance Low
• Use appropriate partition keys to co-locate data (for query or batch operations) or break data into more partitions (for throughput)
• Avoid use of table storage for applications requiring non-trivial aggregation or function projection
• Store multiple types in same table for normalized queries (do not denormalize table storage schema!)
• Avoid large scans (can be very expensive!); explore use of separate (partially consistent) index table
Scale
• Leverage multiple storage accounts (not multiple tables) to increase operations/second
Azure Storage Account - Queues
Max messages in a queueN/A (up to
storage account size limit)
Max lifetime of a message 1 week (auto purged)
Max message size 64 KB
Max throughput 2000 messages / second
Scale-out unit Queue
Scale-out impedance Medium
• Optimize storage format to reduce message size / avoid 64 KB limit (for larger messages leverage Service Bus or Queues + Blob)
• Retrieve messages in batches to increase throughput
• Use dequeue count on message for poison messages
Scale
• Leverage multiple queues to increase messages / second
• Vertical partitioning: split queues by function
• Horizontal partitioning: split messages between queues (round robin/direct assignment)
Services site for mobile device applications1M+ users at launch, 1M+ users added per monthFront ended by Android, iOS, Windows Phone
Personalized information feeds and data setsExamples: browsing history, shopping cartAssuming up to 30% of user base can be online at any point in timeMaximum response latency 250 ms @ 99th percentile
User centric web application
Where are the scalability bottlenecks?
Where are the availability and failure points? Where are the key insight and instrumentation points?
Tearing apart the architectureCloud Service
Front End Web Role Instance Instance Instance Instance
CachingRole Instance Instance Worker
Role Instance
Databases
DB DB DB DB
Storage
StorageAccount
StorageAccount
Demo: Implementing an information publishing site
Recap
Know the numbers – platform scalability targetsCompute, storage, networking and platform servicesScalability == capacity * efficiency
Watch out for shared resources and contention pointsAt high load and concurrency “interesting” things happenDefault to asynchronous, bound all calls
Insight is power – measuring and observation of behavior Without rich telemetry and instrumentation – down to the call level – apps are running blindBuy your way to victory, leverage asynchronous and structured logging
Resources
Failsafe: Building scalable, resilient cloud services http://channel9.msdn.com/Series/FailSafe Cloud Service Fundamentals - Reference code for Azurehttp://code.msdn.microsoft.com/windowsazure/ContosoSocial-in-Windows-8dd9052c
© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.