Scaling Memcache at Facebook esenter: Rajesh Nishtala ([email protected]) -authors: Hans Fugal, Steven Grimm, Marc iatkowski, Herman Lee, Harry C. Li, Ryan McElroy, ke Paleczny, Daniel Peek, Paul Saab, David Stafford, ny Tung, Venkateshwaran Venkataramani
33
Embed
Scaling Memcache at Facebook Presenter: Rajesh Nishtala ([email protected]) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scaling Memcacheat Facebook
Presenter: Rajesh Nishtala ([email protected])Co-authors: Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, Venkateshwaran Venkataramani
3. Be able to access and update very popularshared content
4. Scale to process millions of user requestsper second
Design Requirements
Support a very heavy read load• Over 1 billion reads / second
• Insulate backend services from high read rates
Geographically Distributed
Support a constantly evolving product• System must be flexible enough to support a variety of use cases
• Support rapid deployment of new features
Persistence handled outside the system• Support mechanisms to refill after updates
Need more read capacity
Database
DatabaseDatabase
Memcache
4. Set (key)2. Miss (key)
3. DB lookup
• Two orders of magnitudemore reads than writes
• Solution: Deploy a fewmemcache hosts to handlethe read capacity
• How do we store data?
• Demand-filled look-aside cache
• Common case is data is
available in the cache
Web Server
1. Get (key)
Handling updates
• Memcache needs to beinvalidated after DB write
• Prefer deletes to sets• Idempotent
• Demand filled
• Up to web application
to specify which keysto invalidate afterdatabase update
Database
Memcache
2. Delete1. Database
update
Web Server
Memcache
While evolving our system we prioritize two major design goals.
• Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered
• We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.
Roadmap
• Single front-end cluster–Read heavy workload
–Wide fanout
–Handling failures
• Multiple front-end clusters–Controlling data replication
–Data consistency
• Multiple Regions–Data consistency
Single front-end cluster
• Reducing Latency:focusing on the memcache client
– stale sets :A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached. This can occur when concurrent updates to memcacheget reordered
– Thundering herd :happens when a specific key undergoes heavy read and write activity
• Stale values:–When a key is deleted, its value is transferred to a data
structure that holds recently deleted items, where it lives for a short time before being flushed.A get request can return a lease token or data that is marked as stale.
• Memcache Pools: We designate one pool (named wildcard) as the default and provision separate pools for keys whose residence in wildcard is problematic.
• Replication Within Pools,We choose to replicate a cat
egory of keys within a pool :– the application routinely fetches many keys simultaneousl
y– the entire data set fits in one or two memcached
servers– the request rate is much higher than what
a single server can manage
• Handling Failures –There are two scales at which we must address failures:• a small number of hosts are inaccessible due to a network or server failure •a widespread outage that affects a significant percentage of the servers within the cluster
–Gutter
Multiple front-end clusters;Region
• Region:–We split our web and memcached servers into multiple
front-end clusters. These clusters, along with a storage cluster that contain the databases, define a region.– We trade replication of data for more independent fail
ure domains, tractable network configuration, and a reduction of incast congestion
Databases invalidate caches
• Cached data must be invalidated after database updates
• Solution: Tail the mysql commit log and issue deletes basedon transactions that have been committed• Allows caches to be resynchronized in the event of a problem
Front-End Cluster #1
Web Server
MC MC MC MC
Commit Log
MySQLStorage Server
Front-End Cluster #2
Web Server
MC MC MC
Front-End Cluster #3
Web Server
MC MC MC MC
McSqueal
Invalidation pipelineToo many packets
• Aggregating deletes reducespacket rate by 18x
• Makes configurationmanagement easier
• Each stage buffers deletes in
case downstream component isdown
McSqueal
DB
McSqueal
DB
McSqueal
DB
MC MC MC MC
Memcache
Routers
MC MC MC
Memcache
Routers
MC MC MC MC
Memcache
Routers
Memcache Routers
• Regional Pools–We can reduce the number of replicas by having multipl
e frontend clusters share the same set of memcached servers. We call this aregional pool .
• Cold Cluster Warmup–A system called Cold Cluster Warmup mitigates this by allo
wing clients in the “cold cluster” to retrieve data from the “warm cluster” rather than the persistent storage.
Across Regions: Geographically distributed clusters
• One region to hold the master databases and the other regions to contain read-only replicas;
• Advantages –putting web servers closer to end users can significantly
reduce latency–geographic diversity can mitigate the effects of events s
uch as natural disasters or massive power failures– new locations can provide cheaper power and other ec
onomic incentives
Replica Master
Geographically distributed clusters
Replica
Writes in non-masterDatabase update directly in master
• Race between DB replication and subsequent DB read
ReplicaDB
MemcacheMasterDB
1. Write to master
WebServer
3. Read from DB
(get missed)2. Delete from mc
WebServer
4. Set potentially
state value to
3. MySQL replication
memcache
Race!
ReplicaDB
Memcache
Web Server
MasterDB
2. Write to master
3. Delete from
memcache
5. Delete remote
marker
4. Mysql replication
Remote markersSet a special flag that indicates whether a race is likely
• Software Upgrades– Memcache的数据是保存在 System V的共享内存区域,方便机器上的软件升级。
Memcache Workload• Fanout
• Response size
• Pool Statistics
• Invalidation Latency
Conclusion• Separating cache and persistent storage systems allows us to i
ndependently scale them• Features that improve monitoring, debugging and operational
efficiency are as important as performance• Managing stateful components is operationally more complex
than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.
• Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption.