software | Sharing ideas, Sharing experiences

All of us are pretty familiar with STAMPEDE, a real world situation. FATAL, leads to losses including LOSS of Lives.

A situation when suddenly group of living beings rush towards single direction, and start crushing whatever comes in the way including the other living beings which formed that situation.

How does it sound to you?

Scary? Dangerous? Isn’t it?

Imagine the same situation occuring within your software systems and bringing your system(s) down due to mad rush.

Very much like detecting Single Point of Failures within software architecture, it’s also important to understand these issues. Identify the issues that may happen due to a miss in the Cache. This can lead to choking your Storage system.

What is Cache Stampede?

While it is somewhat similar to a real world stampede and occurs in software systems, let’s first understand some key concepts.

What is Caching? What is a Cache Miss? What is the purpose of having Caching within the system?

Imagine the drinking water stored in the kitchen. Everytime you want to drink water, you would have to go to the kitchen and drink.
For lazy person like me, this is crazy.
I would prefer to store water in my bottle which i can keep on my desk and drink from it whenever needed. And when bottle gets empty, i would re-fill it from the Kitchen store.

So, here, in some way, bottle is acting as a Cache and Kitchen store as database :)

And, process of filling data from Kitchen Store to Water bottle is filling the Cache with data for later use.

So, what is Caching?

A strategy and solution to quickly access frequently used data. A Data which may not change so often.

In many software applications, you would have noticed use of some data Structures’ classes like HASH, MAP or DICTIONARY etc.

These are all the constructs to keep a local copy within process memory.

and, Why do we need Cache?

To get quick access to that data. This not only reduces processing time, it also prevents unnecessary load on your storage system(s) like Database etc. and prevent database becoming bottleneck. Caching can thus help in improving the overall throughput of your application or system.

What’s the catch?

Remember, I talked about process of filling the water bottle from Kitchen Store?

That’s how water bottle will get water. There is no magic that will keep on filling/re-filling water bottle.
You have to determine that there is no water in it and the need to fill it.

Likewise, cache will not have data on it’s own. There will be someway through which the required data will get into it. [There are many strategies and would not be covering here]. Ultimately your application will first try to read the data from cache, finds it missing, gets the data from Database and puts it into Cache.

Now, imagine, other family members, friends or colleagues in your company who would like to also keep the water bottle near to them for easy and quick access to douse the thirst.

They will all go to Kitchen store and fill their water bottles.
What’s the big deal there?

Right, if all of you are going at different times, it will not be problematic et all. No QUEUEING in the kitchen store.

But, what if all of you go to Kitchen store at roughly the same time or near the same time?

You will end up in the Queue depending upon the size of the bottle, the other person in front of you is carrying, depending upon the outflow\speed of the Kitchen Store etc.

Likewise, imagine in your application, many users, requests come at the same time, all of them find the data not present in the cache. All those requests will rush to your Storage system such as Database to get the data at the same time. This will create a situation in which your database system will get bombarded by the large number of requests to get the data and can result into either latencies / slowness or even unavailability of your Database and your application.

This is Cache Stampede, where in, application requests resembling bulls :) in the real world, rush towards database.

And, this requires careful consideration and protection. Otherwise, the prupose of using cache gets defeated.

Situations : When can this happen?

This may work for your system where there are not many simultaneous, concurrent requests or transactions looking for that same data.

Or, that data to be fetched from Storage is pretty small in size, has a low memory footprint and processing needs.

But, what if those are not applicable to your system?

What if the static data your application needs is quite big, ranging in few hundreds of MBs?

What if your application supports very high Requests per second as opposed to an external system from where it sources the data ?

Imagine you have built an E-commerce site like Amazon or Flipkart or Expedia where your system needs Product information and millions of active users at any time.
Imagine You are preparing for big event, say , black Friday or Diwali or some product launch, and millions of concurrent active users are just waiting for sale to open and book.

All of them start shopping for the product, requests come to the system, system sees that product information is not there in the cache due to any reasons such as TTL (Time To Live) expired, eviction happened etc and has to be loaded from the data store.

All requests now have to be served from the database. This can result into:
a) Data store getting overwhelmed with those many requests and may result into failures or start throttling.
RISK : Database Performance degradation or unavailability

b) With some resiliency built-in, your system may have the capability to retry with some wait period baked-in.
RISK : Latency. Requests take longer time to complete. User experience degrades

c) Lets assume that somehow database is scaled and all those million requests got successful in loading the data from database into memory. However, data size is big, say 1 MB.

With millions of concurrent active users, lets assume that we have requests = 10^6 requests being served by all Servers.

this results into memory requirements of atleast 1 MB * 10 ^6
= 10^3 * 10 ^ 3 MB
= 10 ^ 3 GB (roughly)

This means that even when your system cluster is provisioned with 100 servers and load balancing in place, each server would end up consuming atleast 10 GB (10 ^ 3 GB / 100) at some point of time just to populate the cache.

This happened just for 1 product which is just 1MB big.
Imagine What will happen, when no of products increase, when size of the product increases?

Where did we built Cache Stampede Protection?

One of the products my team built is Azure Migrate. It provides the Customers with SKU recommendations. It also shows what the total cost would be as per those recommendations.

New SKUs may get added on any day, existing SKUs may get deprecated on any day.
Likewise, Prices for those SKUs may change on any day.

SKUs and prices information are being maintained by the other teams (following separation of concerns :)) and our system integrates with those systems.

There were multi thousands of SKUs available and multi millions of prices’ data points. Imagine different offers running, different regions across the world where things will be different. All of this contribute to the large size, hundreds of MBs of data (~250 MB).

Our product and thus system has SLOs/SLAs too like any other system and our product supports assessing multi thousands of resources such as VMs, Databases, storage disks etc at any time. Imagine, Enterprise running 30-40K machines which has 60-90k disks attached to them and multi thousands other resources which need to be assessed and recommendations need to be generated.

Now imagine such tens or hundreds or thousands of those enterprises.

As you would have thought of, you would think of caching.
Caching both SKUs data , prices data and refresh once a day to keep the latest data.

Now imagine, having TTL in place and the whole of SKUs data and prices data getting expired every day at some time and 250MB of data needs to be reloaded into cache.

250 MB / request * (30K + 90K) resources in the request * 10 ^3 requests. All these resources getting assessed concurrently in batches without affecting system SLOs/SLAs.

This can easily cause LOW or ZERO Memory availability in the whole cluster and bring the system down.

You may ask, I got the problem. How can i prevent this problem to occur?
What can i do to make system resilient and be still able to perform and do what it is supposed to do?

Solution : Remember the common and famous saying : Prevention is better than Cure.

Yes, Protect against Cache Stampede. Prevent stampede to happen.

There are again different strategies to protect against Cache Stampede. Each strategy has some pros and cons.

Strategy 1: Prevent keys expiration at the same time. Add Jittering to cache keys. Very simple strategy. Non blocking, no stale data. However, Hot Keys (single key being used a lot of times, think of static data) would still be problematic.

Strategy 2: Remember locks? Define data access procedure from Storage as area \ block of mutual exclusion and prevent simultaneous access.

Sounds simple. However, Locks inherently limit throughput, introduce latencies and are recommended to be avoided wherever possible.

Strategy 3: Refresh data async in other thread. Basically, application keeps a track of what and when data may get expired and refreshes the data asynchronously in other thread(s). This may result into surfacing the stale data and requires proper tracking and parallel processing and thus multi-threading.

Strategy 4: Another could be an hybrid of Strategy 3 and 4. Data is indeed refreshed async but done when the data is accessed and thus during actual request processing. Multiple requests can detect the need to refresh the data. Therefore, there must be a way to ensure the data is refreshed through one request. Other requests should wait until the data gets refreshed. Again the implementation could be to return the last fetched data till data gets refreshed.
This can be complex, requires thorough understanding and detection of deadlocks, can use Atomic Lock strategies.

Based on your application needs like it’s fine to surface the stale data for time being, hot keys pattern etc, you can choose the strategy.

In the next blog, I can cover the implementation of the above strategies. Please leave the comment and the strategy for which you would like me to cover the implementation.

← Back

Sharing ideas, Sharing experiences

Tag Archives: software

Wrath of Cache Miss : Protect against Cache Stampede (Part – 1)

Thank you for your response. ✨

Jaagrugta Failao – By Sunil Singhal

Thank you for your response. ✨

Share this:

Jaagrugta Failao – By Sunil Singhal