How To Design Cache?

LRU cache
Eviction policy,
Cache concurrency
Distributed cache system

How to design a cache system?

Cache system is a widely adopted technique in almost every applications today. In addition, it applies to every layer of the technology stack. For instance, at network area cache is used in DNS lookup and in web server cache is used for frequent requests.

In short, a cache system stores common used resources (maybe in memory) and when next time someone requests the same resource, the system can return immediately. It increases the system efficiency by consuming more storage space.

LRU

One of the most common cache systems is LRU (least recently used). In fact, another common interview question is to discuss data structures and design of an LRU cache. Let’s start with this approach.

The way LRU cache works is quite simple. When the client requests resource A, it happens as follow:

If A exists in the cache, we just return immediately.
If not and the cache has extra storage slots, we fetch resource A and return to the client. In addition, insert A into the cache.
If the cache is full, we kick out the resource that is least recently used and replace it with resource A.

The strategy here is to maximum the chance that the requesting resource exists in the cache. So how can we implement a simple LRU?

Eviction policy

When the cache is full, we need to remove existing items for new resources. In fact, deleting the least recently used item is just one of the most common approaches. So are there other ways to do that?

As mentioned above, The strategy is to maximum the chance that the requesting resource exists in the cache. I’ll briefly mention several approaches here:

Random Replacement (RR) – As the term suggests, we can just randomly delete an entry.
Least frequently used (LFU) – We keep the count of how frequent each item is requested and delete the one least frequently used.
W-TinyLFU – I’d also like to talk about this modern eviction policy. In a nutshell, the problem of LFU is that sometimes an item is only used frequently in the past, but LFU will still keep this item for a long while. W-TinyLFU solves this problem by calculating frequency within a time window. It also has various optimizations of storage.

Concurrency

To discuss concurrency, I’d like to talk about why there is concurrency issue with cache and how can we address it.

It falls into the classic reader-writer problem. When multiple clients are trying to update the cache at the same time, there can be conflicts. For instance, two clients may compete for the same cache slot and the one who updates the cache last wins.

The common solution of course is using a lock. The downside is obvious – it affects the performance a lot. How can we optimize this?

One approach is to split the cache into multiple shards and have a lock for each of them so that clients won’t wait for each other if they are updating cache from different shards. However, given that hot entries are more likely to be visited, certain shards will be more often locked than others.

An alternative is to use commit logs. To update the cache, we can store all the mutations into logs rather than update immediately. And then some background processes will execute all the logs asynchronously. This strategy is commonly adopted in database design.

Distributed cache

When the system gets to certain scale, we need to distribute the cache to multiple machines.

The general strategy is to keep a hash table that maps each resource to the corresponding machine. Therefore, when requesting resource A, from this hash table we know that machine M is responsible for cache A and direct the request to M. At machine M, it works similar to local cache discussed above. Machine M may need to fetch and update the cache for A if it doesn’t exist in memory. After that, it returns the cache back to the original server.

If you are interested in this topic, you can check more about Memcached.

Q: What is the amount of data that we need to cache?

A: Let's assume we are looking to cache on the scale of Google or Twitter. The total size of the cache would be a few TBs.

Q: What should be the eviction strategy?

A: It is possible that we might get entries when we would not have space to accommodate new entries. In such cases, we would need to remove one or more entries to make space for the new entry.

Q: What should be the access pattern for the given cache?

A: There are majorly three kinds of caching systems :

Write through cache : This is a caching system where writes go through the cache and write is confirmed as success only if writes to DB and the cache BOTH succeed. This is really useful for applications which write and re-read the information quickly. However, write latency will be higher in this case as there are writes to 2 separate systems.

Write around cache : This is a caching system where write directly goes to the DB. The cache system reads the information from DB incase of a miss. While this ensures lower write load to the cache and faster writes, this can lead to higher read latency incase of applications which write and re-read the information quickly.

Write back cache : This is a caching system where the write is directly done to the caching layer and the write is confirmed as soon as the write to the cache completes. The cache then asynchronously syncs this write to the DB. This would lead to a really quick write latency and high write throughput. But, as is the case with any non-persistent / in-memory write, we stand the risk of losing the data incase the caching layer dies. We can improve our odds by introducing having more than one replica acknowledging the write ( so that we don’t lose data if just one of the replica dies ).

Q: What is the kind of QPS we expect for the system?

A: This estimation is important to understand the number of machines we will need to answer the queries. For example, if our estimations state that a single machine is going to handle 1M QPS, we run into a high risk of high latency / the machine dying because of queries not being answered fast enough and hence ending up in the backlog queue.
Again, let's assume the scale of Twitter / Google. We can expect around 10M QPS if not more.

Q: What is the number of machines required to cache?

A: A cache has to be inherently of low latency. Which means all cache data has to reside in main memory.
A production level caching machine would be 72G or 144G of RAM. Assuming beefier cache machines, we have 72G of main memory for 1 machine. Min. number of machine required = 30 TB / 72G which is close to 420 machines.
Do know that this is the absolute minimum. Its possible we might need more machines because the QPS per machine is higher than we want it to be.