How We Scaled Freshdesk To Take 65M Requests/week

How We Scaled Freshdesk to Handle 150M Requests/Week

Kiran DarisiDirector, Technical Operations at Freshdesk

Our customer base grew by 400% and the number of requests per week boomed from 10 to 65 million in a year (2013).

Not from an engineering perspectiveCool for a 3 year old startup?

We used a bunch of methods to scale vertically in a really short amount of time.

Sure, we eventually had to shard our databases, but some of these techniques helped us stay afloat, for quite a while.

MOORE’S WAYIncreasing the RAM, CPU and I/O

But the amount of RAM we added and the CPU cycles did not correlate with the workload we got out of the instance. So we stayed put at 64GB.

We upgraded from Medium Instance Amazon EC2 First Generation to High Memory Quadruple Extra Large (thus increasing our RAM from 3.75 GB to 64 GB)

R/W split increased the number of I/Os performed on our databases but it didn’t do much for write performance.

We marked dedicated roles for each slave because using round robin algorithm to select different slaves for different queries proved ineffective.

THE READ/WRITE SPLIT

Using MySQL replication and distributing the reads between master and slave

We chose the partition key and the number of partitions and the table was partitioned automatically.

Post-partitioning, our read performance increased dramatically but again, the write performance was a problem.

MYSQL PARTITIONINGUsing the MySQL 5 built-in

partitioning capability.

1. Choose the partition key carefully or alter the current schema to follow the MySQL partition rules.

2. The number of partitions you start with will affect the I/O operations on the disk directly.

3. If you use a hash-based algorithm with hash-based keys, you cannot control who goes where. This means you’ll be in trouble if two or more noisy customers fall within the same partition.

4. Make sure that every query contains the MySQL partition key. A query without the partition key ends up scanning all the partitions. Performance is sure to take a dive.

Things to keep in mind while performing MySQL partitioning

We cached ActiveRecord objects as well as HTML partials (bits and pieces of HTML) using Memcached.

We chose Memcached because it scales well with multiple clusters. The Memcached client used makes a lot of difference in response time so we went with dalli.

CACHING

Caching objects that rarely change in their lifetime

DISTRIBUTED FUNCTIONS

Keeping response time low by using different storage engines for

different purposes

We started using Amazon RedShift for analytics and data mining, and Redis to store state information and background jobs for Resque.

But because Redis can’t scale or fallback, we don’t use it for atomic operations.

We decided that scaling horizontally by sharding was the only cost-effective way to increase write scalability beyond the instance size.

But scaling vertically can only get you so far.

Two main concerns we had before we took the final call on sharding:

1. No distributed transactions – We wanted all tenant details to be in one shard.

2. Rebalancing the shards should be easy – We wanted control over which tenant sits in which shard and to be able to move them around when needed.

A little research showed us that directory-based sharding was the only way to go.

REASONS FOR CHOOSING DIRECTORY-

BASED SHARDING

It is simpler than hash key-based or range-based sharding.

Rebalancing shards is easier here than in other methods.

A typical directory entry looks like this

tenant info shard_details shard_status

Stark Industries shard1 Read & Write

• tenant_info - unique key referring to the DB entry

• shard_details - shard in which that tenant exists

• shard_status - tells what kind of activity the tenant is ready for (we have multiple shard statuses like Not Ready, Only Reads, Read & Write etc)

The sharding API even acts as a unique ID generator so that the tenant ID generated is unique across shards.

How directory lookups work

API wrapper is tuned to accept the tenant information in multiple forms like tenant URL, tenant ID etc.

Sometimes a customer grows from processing 1000 tickets per day to 10,000 tickets per day. This will affect the performance of the whole shard.

We can’t solve this by splitting up customer data into multiple shards because we didn’t want the mess of distributed transactions.

So, in these cases, we’d move the noisy customer to a shard of his own. That way, everybody wins.

Why we care about rebalancing

Steps to Rebalance a Shard

Every shard will have its own slave to scale the reads. For example, say Wayne Enterprises and Stark industries are in shard1.

1

Wayne Enterprises shard1 Read & Write


The directory entry looks like this:

If Wayne enterprises grows at a breakneck pace, we would decide to move it to another shard.

(averting the danger of Bruce Wayne and Tony Stark being mad at us the same time).

2

So we would boot up a new slave to shard1 and call it shard2. Then, we’d attach a read replica to the new slave and wait for it to sync with the master.

3

We would then stop the writes for Wayne Enterprises by changing the shard status in the directory.

4

Wayne Enterprises shard1 Read Only


Then we would stop the replication of master data in shard2 and promote it to master.

5

Now the directory entry will be updated accordingly.

Wayne Enterprises shard2 Read & Write


This effectively moves Wayne Enterprises to its own shard.

Batman is happy and so is Iron Man.

6

1. Don’t do it unless it’s absolutely necessary. You will have to rewrite code for your whole app, and maintain it.

2. You could use functional partitioning (moving an over-sized table to another DB altogether) to completely avoid sharding if writes are not a problem.

3. Choosing the right sharding algorithm is a bit tricky as each has its own benefits and drawbacks. You need to make a thorough study of all your requirements while picking one.

4. You will have to take care of the Unique ID generation across shards.

Word of caution

We get 250,000 tickets across Freshdesk every day and 100 M queries during the same time (with a peak of 3-4k QPS). We have a separate shard now for all new sign ups. And each shard can roughly carry 20,000 tenants.

In the future, we’d like to explore Multi-pod architecture and also look at Proxy architecture using MySQL Fabric, Scalebase etc.

What’s next for Freshdesk

“Behind every slideshare is a great blogpost”

Read more about scaling freshdesk here http://blog.freshdesk.com/how-freshdesk-scaled-using-

sharding/

http://blog.freshdesk.com/how-freshdesk-scaled-using-sharding/

How We Scaled Freshdesk To Take 65M Requests/week

Technology

tenant info shard

details shard

details shard

tenant details

tenant id

multiple shard statuses

tenant information

db entry shard