Category Archives: Common Technical Problems

Wrath of Cache Miss : Protect against Cache Stampede (Part – 1)


All of us are pretty familiar with STAMPEDE, a real world situation. FATAL, leads to losses including LOSS of Lives.

A situation when suddenly group of living beings rush towards single direction, and start crushing whatever comes in the way including the other living beings which formed that situation.

How does it sound to you?

Scary? Dangerous? Isn’t it?

Imagine the same situation occuring within your software systems and bringing your system(s) down due to mad rush.

Very much like detecting Single Point of Failures within software architecture, it’s also important to understand these issues. Identify the issues that may happen due to a miss in the Cache. This can lead to choking your Storage system.

What is Cache Stampede?

While it is somewhat similar to a real world stampede and occurs in software systems, let’s first understand some key concepts.

What is Caching? What is a Cache Miss? What is the purpose of having Caching within the system?

Imagine the drinking water stored in the kitchen. Everytime you want to drink water, you would have to go to the kitchen and drink.
For lazy person like me, this is crazy.
I would prefer to store water in my bottle which i can keep on my desk and drink from it whenever needed. And when bottle gets empty, i would re-fill it from the Kitchen store.

So, here, in some way, bottle is acting as a Cache and Kitchen store as database :)

And, process of filling data from Kitchen Store to Water bottle is filling the Cache with data for later use.

So, what is Caching?

A strategy and solution to quickly access frequently used data. A Data which may not change so often.

In many software applications, you would have noticed use of some data Structures’ classes like HASH, MAP or DICTIONARY etc.

These are all the constructs to keep a local copy within process memory.

and, Why do we need Cache?

To get quick access to that data. This not only reduces processing time, it also prevents unnecessary load on your storage system(s) like Database etc. and prevent database becoming bottleneck. Caching can thus help in improving the overall throughput of your application or system.

What’s the catch?

Remember, I talked about process of filling the water bottle from Kitchen Store?

That’s how water bottle will get water. There is no magic that will keep on filling/re-filling water bottle.
You have to determine that there is no water in it and the need to fill it.

Likewise, cache will not have data on it’s own. There will be someway through which the required data will get into it. [There are many strategies and would not be covering here]. Ultimately your application will first try to read the data from cache, finds it missing, gets the data from Database and puts it into Cache.

Now, imagine, other family members, friends or colleagues in your company who would like to also keep the water bottle near to them for easy and quick access to douse the thirst.

They will all go to Kitchen store and fill their water bottles.
What’s the big deal there?

Right, if all of you are going at different times, it will not be problematic et all. No QUEUEING in the kitchen store.

But, what if all of you go to Kitchen store at roughly the same time or near the same time?

You will end up in the Queue depending upon the size of the bottle, the other person in front of you is carrying, depending upon the outflow\speed of the Kitchen Store etc.

Likewise, imagine in your application, many users, requests come at the same time, all of them find the data not present in the cache. All those requests will rush to your Storage system such as Database to get the data at the same time. This will create a situation in which your database system will get bombarded by the large number of requests to get the data and can result into either latencies / slowness or even unavailability of your Database and your application.

This is Cache Stampede, where in, application requests resembling bulls :) in the real world, rush towards database.

And, this requires careful consideration and protection. Otherwise, the prupose of using cache gets defeated.

Situations : When can this happen?

This may work for your system where there are not many simultaneous, concurrent requests or transactions looking for that same data.

Or, that data to be fetched from Storage is pretty small in size, has a low memory footprint and processing needs.

But, what if those are not applicable to your system?

What if the static data your application needs is quite big, ranging in few hundreds of MBs?

What if your application supports very high Requests per second as opposed to an external system from where it sources the data ?

Imagine you have built an E-commerce site like Amazon or Flipkart or Expedia where your system needs Product information and millions of active users at any time.
Imagine You are preparing for big event, say , black Friday or Diwali or some product launch, and millions of concurrent active users are just waiting for sale to open and book.

All of them start shopping for the product, requests come to the system, system sees that product information is not there in the cache due to any reasons such as TTL (Time To Live) expired, eviction happened etc and has to be loaded from the data store.

All requests now have to be served from the database. This can result into:
a) Data store getting overwhelmed with those many requests and may result into failures or start throttling.
RISK : Database Performance degradation or unavailability

b) With some resiliency built-in, your system may have the capability to retry with some wait period baked-in.
RISK : Latency. Requests take longer time to complete. User experience degrades

c) Lets assume that somehow database is scaled and all those million requests got successful in loading the data from database into memory. However, data size is big, say 1 MB.

With millions of concurrent active users, lets assume that we have requests = 10^6 requests being served by all Servers.

this results into memory requirements of atleast 1 MB * 10 ^6
= 10^3 * 10 ^ 3 MB
= 10 ^ 3 GB (roughly)

This means that even when your system cluster is provisioned with 100 servers and load balancing in place, each server would end up consuming atleast 10 GB (10 ^ 3 GB / 100) at some point of time just to populate the cache.

This happened just for 1 product which is just 1MB big.
Imagine What will happen, when no of products increase, when size of the product increases?

Where did we built Cache Stampede Protection?

One of the products my team built is Azure Migrate. It provides the Customers with SKU recommendations. It also shows what the total cost would be as per those recommendations.

New SKUs may get added on any day, existing SKUs may get deprecated on any day.
Likewise, Prices for those SKUs may change on any day.

SKUs and prices information are being maintained by the other teams (following separation of concerns :)) and our system integrates with those systems.

There were multi thousands of SKUs available and multi millions of prices’ data points. Imagine different offers running, different regions across the world where things will be different. All of this contribute to the large size, hundreds of MBs of data (~250 MB).

Our product and thus system has SLOs/SLAs too like any other system and our product supports assessing multi thousands of resources such as VMs, Databases, storage disks etc at any time. Imagine, Enterprise running 30-40K machines which has 60-90k disks attached to them and multi thousands other resources which need to be assessed and recommendations need to be generated.

Now imagine such tens or hundreds or thousands of those enterprises.

As you would have thought of, you would think of caching.
Caching both SKUs data , prices data and refresh once a day to keep the latest data.

Now imagine, having TTL in place and the whole of SKUs data and prices data getting expired every day at some time and 250MB of data needs to be reloaded into cache.

250 MB / request * (30K + 90K) resources in the request * 10 ^3 requests. All these resources getting assessed concurrently in batches without affecting system SLOs/SLAs.

This can easily cause LOW or ZERO Memory availability in the whole cluster and bring the system down.

You may ask, I got the problem. How can i prevent this problem to occur?
What can i do to make system resilient and be still able to perform and do what it is supposed to do?

Solution : Remember the common and famous saying : Prevention is better than Cure.

Yes, Protect against Cache Stampede. Prevent stampede to happen.

There are again different strategies to protect against Cache Stampede. Each strategy has some pros and cons.

Strategy 1: Prevent keys expiration at the same time. Add Jittering to cache keys. Very simple strategy. Non blocking, no stale data. However, Hot Keys (single key being used a lot of times, think of static data) would still be problematic.

Strategy 2: Remember locks? Define data access procedure from Storage as area \ block of mutual exclusion and prevent simultaneous access.

Sounds simple. However, Locks inherently limit throughput, introduce latencies and are recommended to be avoided wherever possible.

Strategy 3: Refresh data async in other thread. Basically, application keeps a track of what and when data may get expired and refreshes the data asynchronously in other thread(s). This may result into surfacing the stale data and requires proper tracking and parallel processing and thus multi-threading.

Strategy 4: Another could be an hybrid of Strategy 3 and 4. Data is indeed refreshed async but done when the data is accessed and thus during actual request processing. Multiple requests can detect the need to refresh the data. Therefore, there must be a way to ensure the data is refreshed through one request. Other requests should wait until the data gets refreshed. Again the implementation could be to return the last fetched data till data gets refreshed.
This can be complex, requires thorough understanding and detection of deadlocks, can use Atomic Lock strategies.

Based on your application needs like it’s fine to surface the stale data for time being, hot keys pattern etc, you can choose the strategy.

In the next blog, I can cover the implementation of the above strategies. Please leave the comment and the strategy for which you would like me to cover the implementation.

← Back

Thank you for your response. ✨

Message Queuing Frameworks – Comparison Grid


We often come with requirements which are suited for integrating Messaging Frameworks in the Software Systems.
There are many messaging frameworks available in the market – some are open-source, some are paid-licensed, some provide great support, have good Community support.

In order to make an apt choice, we look out and explore different messaging frameworks based on our requirements.

This post compares few Popular Messaging Frameworks and aims to provide or equip you with enough information to make a decision on choosing the best framework as per your requirements.

COMPARISON GRID

RabbitMQ Apache Kafka AWS SQS
HA ☑ Requires some extra work and may require 3rd party Plugins like Shovel and Federation ☑ Out of the Box (OOB) ☑ OOB
Scalable
Guaranteed Delivery ☑ Supports Consumer Acknowledgments ☑ Supports Consumer Acknowledgments ☑ Supports Consumer Acknowledgments
Durable ☑ Through Disk Nodes and Queues with extra configuration ☑ OOB ☑ Message Retention upto 14 days max and default being 4 days.
Exactly-Once Delivery ☑ Annotates a message with redelivered when message was delivered earlier but consumer ack failed earlier. Requires Idempotent behavior of a Consumer ☑ Dependent on Consumer behavior. Consumer is made responsible to track Offsets (messages read so far) and store those offsets. Kafka started supporting storing offsets within Kafka itself. It supports storing Offsets OOB through HIGH LEVEL CONSUMERS, however Requires Idempotent behavior of a Consumer ☑ MessageDeDup ID and MessageGroupID attributes are used. Requires Idempotent behavior of a Consumer. FIFO supports Exactly-once while Standard Queues support Atleast-Once
Ease of Deployment ☑ For Distributed Topology, Requires more effort and 3rd party Plugins ☑ Requires ZooKeeper ☑ Managed by AWS
Authentication Support ☑ OOB ☑ OOB ☑ OOB
Authorization aka Acl Support ☑ OOB ☑ OOB ☑ OOB
TLS Support ☑ OOB ☑ OOB ☑ OOB
Non-Bocking Producers ☑ Supports both Synchronous and Async
Performant :star: :star: Medium to High :star: :star: :star: :star: Very High :star: :star: :star: :star: Very High FIFO: 300 tps Standard Queues: Unlimited
Open Source  ☑  ☑
Load Balancing Across Consumers ☑ Can be done Through Consumer Groups ☑ Multiple Consumers can read from the same queue in an atomic way
Delay Queues NOT OOB NOT OOB ☑ OOB
Visibility Timeout Queues NOT OOB NOT OOB ☑ OOB
Message Dedup  ☑
Message Size Limits Upto 256 KB. AWS SDK supports storing large messages in S3 etc though.
No of Messages in a Queue ☑ No limits ☑ No Limits ☑ No Limits but Standard Queues: 1,20,000 In-Flight Messages FIFO: 20,000 In-Flight Messages Details here and here Messages are In-Flight after they have been received from the queue by a consuming component, but have not yet been deleted from the queue
Message Content Limits ☑ No Limits ☑ No Limits A message can include only XML, JSON, and unformatted text. The following Unicode characters are allowed: #x9 | #xA | #xD | #x20 to #xD7FF | #xE000 to #xFFFD | #x10000to #x10FFFF Any characters not included in this list are rejected.
Disaster Recovery Not OOB Not OOB but simple. Replicas can be deployed across regions Not OOB and simple. Requires different Strategies to achieve it.

← Back

Thank you for your response. ✨

Playaround with an address of object reference variable


Well, rarely you will have to find an address of a variable in C#.NET. Though C#.NET allows use of pointers but anyone hardly use them. .NET base library and Compiler have done a beautiful job, abstracting the complex use of pointers, exposing ref, out keywords.

But what if you still want to playaround?
[I must warn you about this. Don’t dare to play with Pointers :) Improper handling of pointers can even bring your application down]

Thankfully, there are constructs which do help in achieving what we want.

The above link is just to get a concept of Stack. What a stack is. It has nothing to do with Stack class provided in C#.NET.

static void GetAddress()
{
unsafe
{
int i = 5;
object refer = new object();
// &i gets an address of variable 'i' and * operator gets the value
Console.WriteLine("Address of i:{0},{1}" , (uint)&i, *(&i));
Console.WriteLine("Address of refer:{0},{1}", (uint)(&i - 1), *(&i - 1));
}
}

Explanation
Here, i have defined two variables, named – i and refer . Both these variables sit on STACK, however the values stored by them are treated differently.
i stores the value directly [5 in our case as per statement], and
refer stores the address of an actual object allocated on a manged heap.
This is how Stack and Heap state would be after two initalizing statements.

So, if i have an access to an address of i and knowing that refer is on Stack just after i, decrementing 1 from i address [Pointer arithmetic], i’m now pointing to the refer which is the reference to the actual object (object()) sitting on heap.

You have a stack address and thus the value stored at that location. Using Pointer Arithmatic, do what you want WITH CARE.
MS should tag POINTERS with “HANDLE WITH UTMOST CARE” :)

How to Reference or use same fully qualified class name from different assemblies


Two or more different assemblies can contain a type with a same name and same namespace.
What will happen if you reference those assemblies in your project?
Well, referencing won’t result into any error. :)
But.
How about making an object of that class?
This will fail. Compilation error will be generated because of ambiguity. Which type from which assembly should be used?
Compiler can’t make this decision on it’s own and rather i would say, it should not.
This should be done by a programmer/coder to clear his intentions to the compiler.
But How?

Solution

    Let’s say, we have two assemblies:

  • Assembly1
  • Assembly2

Both assemblies define a type, say class having fully qualified name as SomeNameSpace.SomeClassName

We have a project where we refer the above two assemblies.
In order to instantiate an object of ProjectLibrary.Class defined in Assembly1, do:

  1. Change the Aliases property of one of the references
  2. Whenever any reference is added, Aliases is set to global. Change this to your desired name, say Assembly1
    Though this can be done for both but doing it for one will automatically open doors for another assembly.

  3. Write a code statement on the top of your file: extern alias Assembly1;
  4. Lastly, change the object instantiation statement to
    Assembly1::ProjectLibrary.Class obj = new Assembly1::ProjectLibrary.Class();
  5. “::” is a Namespace Alias Qualifier in .Net.

That’s it.