Shopify’s Architecture to Handle 80K RPS Celebrity Sales Simon Eskildsen – @Sirupsen Production Engineering Lead, Shopify
Shopify’s Architecture to Handle 80K RPS Celebrity Sales
Simon Eskildsen – @Sirupsen Production Engineering Lead, Shopify
Shopify is handling some of the largest sales in the world from Kylie Jenner, Kanye, Superbowl, and others
— Tobi Lütke, CEO in internal essay on why we optimize for flash sales
“We learned to absorb these shocks and become stronger as a result. [..] The school of hard knocks has taught us well.”
500K $5.8BMerchants powered Processed Q2, 2017
80K 40+Peak RPS Daily deploys
Rails 2000+Ruby on Rails since 2006 Employees
Traffic
Application
Data
Application
Data
Region A Region B
Traffic
Application
Data
Application
Data
Region A Region B
• Global Routing
• Openresty
• Bots
• Cache hits
• Checkout Throttling
Traffic
ISP
ISP
ISP
ISP
ISP
ISP
ISP
ISP
ISP
ISP
Region A
BGP ANNOUNCE 23.227.38.0/24
BGP ANNOUNCE 23.227.38.0/24
Region B
walrusser.myshopify.com 23.227.38.64
OpenResty allows Lua scripting of your load balancers, it’s been one of the most impactful additions to our stack in recent memory
https://github.com/openresty/openresty
Nginx with OpenResty
Rule Banner
Kafka Logging
Edgecache
Checkout Throttle
worker_processes 1; error_log logs/error.log; events { worker_connections 1024; } http { server { listen 8080; location / { default_type text/html; content_by_lua ' ngx.say("<p>hello, world</p>") '; } } }
Bot squasher analyzes the Kafka stream of incoming requests to ban bots with a rule banner module
Nginx with OpenResty
Rule Banner
KafkaBot Squasher
Kafka Logger
POST /checkoutBAN
23.227.38.178
Nginx with OpenResty
Edgecache
Memcached
GET /collections/walruses
HIT
Edgecache can serve full page cache hits out of the load-balancers in microseconds
Web Process
MISS
FILL
Nginx with OpenResty
Checkout Throttle
GET /checkout
Queue
/wait_area /checkout
Throttle
Checkout Throttle throttles the number of customers in the processing heavy checkout path
Traffic
Application
Data
Application
Data
Region A Region B
Pod is an isolated unit of one or more shops
shop1
shop4
shop9
shop17
shop72
Data in Region A
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
sho52
shop23
Pod 14
Pod 2
Pod 7
Pod 14
Each Pod in Region A
Pod 2
Pod 7
MySQLRedis Memcache
MySQLRedis Memcache
MySQLRedis Memcache
Cron
Cron
Cron
Pod 14
Pod 2
Pod 7
MySQLRedis Memcache
MySQLRedis Memcache
MySQLRedis Memcache
Cron
Cron
Cron
Shared Workers
Pod 14
Pod 2
Pod 7
MySQLRedis Memcache
MySQLRedis Memcache
MySQLRedis Memcache
Cron
Cron
Cron
Shared Load Balancing
Genghis is our load-testing tool to test scale
Pod Balancer balances shops between pods with minimal downtime to keep load and size even
shop1
shop4
shop9
shop17
shop72Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop1
shop4
shop9
shop17
shop72
Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop1
shop4
shop9shop17
shop72
Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop98
shop1
shop4
shop9
shop17
shop72
Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop98
shop99shop100
shop1
shop4
shop9
shop17
shop72Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop98
shop99shop100
Pod 74
shop1shop4
shop9shop17
shop72Pod Balancer
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
shop52
shop23
Pod 14
Pod 2
Pod 7
shop98shop99
shop100
Pod 74
MySQLRedis MySQLRedis
COPY SHOP SELECT * FROM products WHERE shop_id = 38493 SELECT * from orders WHERE shop_id = 38493
Source Pod 9
Target Pod 23
MySQLRedis MySQLRedis
COPY SHOP SELECT * FROM products WHERE shop_id = 38493 SELECT * from orders WHERE shop_id = 38493
NEW CHECKOUT INSERT INTO CHECKOUTS …
Source Pod 9
Target Pod 23
MySQLRedisSource
Pod 9MySQLRedis
Target Pod 23
COPY SHOP_ID 238 SELECT * FROM products WHERE shop_id = 238 SELECT * from orders WHERE shop_id = 238
Bin LogREPLICATE SHOP_ID 238 CHECKOUT id: 383293
MySQLRedisSource
Pod 9MySQLRedis
Target Pod 23
LOCK SHOP_ID 238
Routing
UPDATE SHOP_ID 238 pod_id=23
Traffic
Application
Data
Application
Data
Region A Region B
Sorting Hat routes requests for a shop to the region the pod is active in
Traffic
Region A Region B
ActivePod 7
InactivePod 2
ActivePod 14 Pod 14 Inactive
Inactive
ActivePod 2
Pod 7
Pod 14
Sorting Hat
GET /products Host: sneakershop.com
Routing
ROUTE sneakershop.com
shop238 pod2:B
Traffic
Application
Data
Application
Data
Region A Region B
Pod Mover moves pods between regions with minimal downtime
Traffic
Region A Region B
ActivePod 7
Pod 2 ActivePod 14 Pod 14 Inactive
Inactive
ActivePod 2
Pod 7
Pod 14
Sorting Hat
InactivePod 2
Traffic
Region A Region B
ActivePod 7
Pod 2 ActivePod 14 Pod 14 Inactive
Inactive
ActivePod 2
Pod 7
Pod 14
Sorting Hat
InactivePod 2
Update Routing for pod to target region pod2:b -> pod2:a
Sorting Hat routes requests to target region
Disable cron in both regions
Fail over MySQL to target region
Enable cron in both regions
Transfer jobs to target region
What about errors while the database fails over?
Nginx with OpenResty
Pauser
POST /checkout (during failover)
Pauser will pause requests in the middle of failovers to avoid serving errors
QueueThrottle
HTTP 200 (seconds later)
Update Routing for pod to target region pod2:b -> pod2:a
Sorting Hat routes requests to target region and pause requests
Disable cron in both regions
Fail over MySQL to target region
Enable cron in both regions
Resume requests
Transfer jobs to target region
Cloud Migration with the Pods Architecture
shop1
shop4
shop9
shop17
shop72
Region A
shop3
shop72
shop92
shop18
shop64
shop22
shop88
shop0
sho52
shop23
Cloud Region C
Thanks! @Sirupsen