Page 1
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nate Wiger, Principal Solutions Architect, AWS
Tom Kerr, Software Engineer, Riot Games
October 8, 2015
Amazon ElastiCache Deep DiveScaling Your Data in a Real-Time World
DAT407
Page 2
Amazon ElastiCache
• Managed in-memory service
• Memcached or Redis
• Cluster of nodes
• Read replicas
• Monitoring + alerts
Page 3
ELB App
External APIs
Modern Web / Mobile App
Page 4
Memcached vs Redis
• Flat string cache
• Multithreaded
• No persistence
• Low maintenance
• Easy to scale horizontally
• Single-threaded
• Persistence
• Atomic operations
• Advanced data types -
http://redis.io/topics/data-types
• Pub/sub messaging
• Read replicas / failover
Page 5
Storing JSON – Memcached vs Redis
# Memcached: Serialize string
str_json = Encode({“name”: “Nate Wiger”, “gender”: “M”})
SET user:nateware str_json
GET user:nateware
json = Decode(str_json)
# Redis: Use a hash!
HMSET user:nateware name “Nate Wiger” gender M
HGET user:nateware name
>> Nate Wiger
HMGET user:nateware name gender
>> Nate Wiger
>> M
Page 7
ElastiCache with Memcached – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 8
ElastiCache with Memcached – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Nope
Page 9
Add Nodes to Memcached Cluster
Page 10
Add Nodes to Memcached Cluster
Page 11
Add Nodes to Memcached Cluster
aws elasticache modify-cache-cluster
--cache-cluster-id my-cache-cluster
--num-cache-nodes 4
--apply-immediately
# response
"CacheClusterStatus": "modifying",
"PendingModifiedValues": {
"NumCacheNodes": 4
},
Page 12
ElastiCache with Memcached – High Availability
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 13
ElastiCache with Memcached – Scale Out
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 15
Consistent HashingClient pre-calculates a hash ring for best key distribution
http://berb.github.io/diploma-thesis/original/062_internals.html
Page 16
It’s All Been Done Before• Ruby
• Dalli https://github.com/mperham/dalli
• Plus ElastiCache https://github.com/ktheory/dalli-elasticache
• Python• HashRing / MemcacheRing https://pypi.python.org/pypi/hash_ring/
• Django w/ Auto-Discovery https://github.com/gusdan/django-elasticache
• Node.js• node-memcached https://github.com/3rd-Eden/node-memcached
• Auto-Discovery example http://stackoverflow.com/questions/17046661
• Java• SpyMemcached https://github.com/dustin/java-memcached-client
• ElastiCache Client https://github.com/amazonwebservices/aws-elasticache-cluster-client-memcached-for-java
• PHP• ElastiCache Client https://github.com/awslabs/aws-elasticache-cluster-client-
memcached-for-php
• .NET• ElastiCache Client https://github.com/awslabs/elasticache-cluster-config-net
Page 17
Auto-Discovery Endpoint
Page 18
# PHP
$server_endpoint = "mycache.z2vq55.cfg.usw2.cache.amazonaws.com";
$cache = new Memcached();
$cache->setOption(
Memcached::OPT_CLIENT_MODE, Memcached::DYNAMIC_CLIENT_MODE);
# Set config endpoint as only server
$cache->addServer($server_endpoint, 11211);
DIY: http://bit.ly/elasticache-autodisc
Memcached Node Auto-Discovery
Page 19
App Caching Patterns
Page 20
Be Lazy
# Python
def get_user(user_id):
record = cache.get(user_id)
if record is None:
# Run a DB query
record = db.query("select * from users where id = ?", user_id)
cache.set(user_id, record)
return record
# App code
user = get_user(17)
Page 21
Write On Through
# Python
def save_user(user_id, values):
record = db.query("update users ... where id = ?", user_id, values)
cache.set(user_id, record)
return record
# App code
user = save_user(17, {"name": "Nate Dogg"})
Page 22
Combo Move!
def save_user(user_id, values):
record = db.query("update users ... where id = ?", user_id, values)
cache.set(user_id, record, 300) # TTL
return record
def get_user(user_id):
record = cache.get(user_id)
if record is None:
record = db.query("select * from users where id = ?", user_id)
cache.set(user_id, record, 300) # TTL
return record
# App code
save_user(17, {"name": "Nate Diddy"})
user = get_user(17)
Page 23
Web Cache with Memcached
# Gemfile
gem 'dalli-elasticache’
# config/environments/production.rb
endpoint = “mycluster.abc123.cfg.use1.cache.amazonaws.com:11211”
elasticache = Dalli::ElastiCache.new(endpoint)
config.cache_store = :dalli_store, elasticache.servers,
expires_in: 1.day, compress: true
# if you change ElastiCache cluster nodes
elasticache.refresh.client
Ruby on Rails Example
Page 24
Thundering Herd
Causes
• Cold cache – app startup
• Adding / removing nodes
• Cache key expiration (TTL)
• Out of cache memory
Large # of cache misses
Spike in database load
Mitigations
• Script to populate cache
• Gradually scale nodes
• Randomize TTL values
• Monitor cache utilization
Page 26
Not if I
destroy
it first!It’s
mine!
Need uniqueness + ordering
Easy with Redis Sorted Sets
ZADD "leaderboard" 1201 "Gollum”
ZADD "leaderboard" 963 "Sauron"
ZADD "leaderboard" 1092 "Bilbo"
ZADD "leaderboard" 1383 "Frodo”
ZREVRANGE "leaderboard" 0 -1
1) "Frodo"
2) "Gollum"
3) "Bilbo"
4) "Sauron”
ZREVRANK "leaderboard" "Sauron"
(integer) 3
Real-time Leaderboard!
Page 27
Ex: Throttling requests to an API
Leverages Redis Counters
ELB
Externally
Facing
API
Reference: http://redis.io/commands/INCR
FUNCTION LIMIT_API_CALL(APIaccesskey)limit = HGET(APIaccesskey, “limit”)time = CURRENT_UNIX_TIME()keyname = APIaccesskey + ":” + timecount = GET(keyname)IF current != NULL && count > limit THEN
ERROR ”API request limit exceeded"ELSE
MULTIINCR(keyname)EXPIRE(keyname,10)
EXECPERFORM_API_CALL()
END
Rate Limiting
Page 28
• Redis counters – increment likes/dislikes
• Redis hashes – list of everyone’s ratings
• Process with algorithm like Slope One or Jaccardian similarity
• Ruby example - https://github.com/davidcelis/recommendable
Recommendation Engines
INCR item:38927:likesHSET item:38927:ratings "Susan" 1
INCR item:38927:dislikesHSET item:38927:ratings "Tommy" -1
Page 29
Chat and Messaging
• PUBLISH and SUBSCRIBE Redis commands
• Game or Mobile chat
• Server intercommunication
SUBSCRIBE chat_channel:114PUBLISH chat_channel:114 "Hello all"
["message", " chat_channel:114 ", "Hello all"]UNSUBSCRIBE chat_channel:114
Page 30
ElastiCache with Redis – Development
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 31
Availability Zone A Availability Zone B
Use Primary Endpoint
Use Read Replicas
Auto-Failover
Chooses replica with
lowest replication lag
DNS endpoint is same
Redis Multi-AZ
Page 32
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 33
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 34
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 35
ElastiCache with Redis Multi-AZ
Region
Availability Zone A Availability Zone B
Auto Scaling group
ElastiCache cluster
Page 39
Redis Multi-AZ – Reads and Writes
ELB App
External APIs
Replication Group
ReadsWrites
Page 40
Redis – Read/Write Connections
# Ruby example
redis_write = Redis.new(
'mygame-dev.z2vq55.ng.0001.usw2.cache.amazonaws.com')
redis_read = Redis::Distributed.new([
'mygame-dev-002.z2vq55.ng.0001.usw2.cache.amazonaws.com',
'mygame-dev-003.z2vq55.ng.0001.usw2.cache.amazonaws.com'
])
redis_write.zset("leaderboard", "nateware", 1976)
top_10 = redis_read.zrevrange("leaderboard", 0, 10)
Page 41
Recap – Endpoint Autodetection
• Cluster endpoints:
aws elasticache describe-cache-clusters
--cache-cluster-id mycluster
--show-cache-node-info
• Redis read replica endpoints:
aws elasticache describe-replication-groups
--replication-group-id myredisgroup
• Can listen for SNS events: http://bit.ly/elasticache-sns
http://bit.ly/elasticache-whitepaper
Page 42
Splitting Redis By Purpose
ELB App
External APIs
ReadsWrites
Replication Group
Leaderboards
Replication Group
User Profiles
Reads
Page 43
Don’t Plan Ahead!!
1. Start with one Redis Multi-AZ cluster
2. Split as needed
3. Scale read load via replicas
4. Rinse, repeat
Page 45
Alarms
Monitoring with CloudWatch
• CPU
• Evictions
• Memory
• Swap Usage
• Network In/Out
Page 46
Key ElastiCache CloudWatch Metrics
• CPUUtilization
• Memcached – up to 90% ok
• Redis – divide by cores (ex: 90% / 4 = 22.5%)
• SwapUsage low
• CacheMisses / CacheHits Ratio low / stable
• Evictions near zero
• Exception: Russian doll caching
• CurrConnections stable
• Whitepaper: http://bit.ly/elasticache-whitepaper
Page 47
Scaling Up Redis
1. Snapshot existing cluster to Amazon S3
http://bit.ly/redis-snapshot
2. Spin up new Redis cluster from snapshot
http://bit.ly/redis-seeding
3. Profit!
4. Also good for debugging copy of production data
Page 49
DNS Caching – Redis Failover
• Failover requires updating a DNS CNAME
• Can take up to two minutes
• Watch out for app DNS caching – esp. Java!
http://bit.ly/jvm-dns
• No API for triggering Redis failover• Turn off Multi-AZ temporarily
• Promote replica to primary
• Turn on Multi-AZ
Page 50
1. Forks main Redis process
2. Writes data to disk from child process
3. Continues to accept traffic on main process
4. Any key update causes a copy-on-write
5. Potentially DOUBLES memory usage by Redis
Swapping During Redis Backup (BGSAVE)
Page 51
Reduce memory allocated to Redis
• Set reserved-memory field in parameter groups
• Evicts more data from memory
Use larger cache node type
• More expensive
• But no data eviction
Write-heavy apps need extra Redis memory
Swapping During Redis Backup – Solutions
Page 52
Redis reserved-memory Parameter
Page 53
Redis Engine Enhancements
• Only Available in Amazon ElastiCache
• Forkless backups = Lower memory usage
• If enough memory, will still fork (faster)
• Improved replica sync under heavy write loads
• Smoother failovers (PSYNC)
• Two new CloudWatch metrics
• ReplicationBytes: Number of bytes sent from primary node
• SaveInProgress: 1/0 value that indicates if save is running
• Try it today! Redis 2.8.22 or later.`
Page 54
Riot Games: ElastiCache in the Wild
Tom Kerr
Page 56
LEAGUE OF LEGENDS
Page 60
APOLLO: COMMENTS ANYWHERE
Page 61
APOLLO: COMMENTS ANYWHERE
Page 62
APOLLO: ARCHITECTURE
Page 63
Replication with automatic failover
Replication across availability zones
More snapshots, more often
Page 69
LESS GOOD
Fun Stuff Deploy Stuff
GOOD
Fun Stuff Deploy Stuff
Page 74
LEADERBOARDS: ARCHITECTURE
Page 75
LEADERBOARDS: DATA STORE
Page 76
US-WEST2:NA:3848433 37
US-WEST2:NA:3848 37433
http://redis.io/topics/memory-optimization
Page 78
Replicas with automatic failoverBEST
PRACTICES
Manually snapshot more often
Monitor your replication metrics
Redis hash key trick
Page 79
Thank you!
Nate Wiger, Principal Solutions Architect, AWS
Tom Kerr, Software Engineer, Riot Games
Page 80
Remember to complete
your evaluations!