How to handle cache stampede when using hazelcast? - caching

My spring service is using hazelcast(embedded cluster) as local cache.
When the cache expires due to TTL, if a large number of user requests come at once, the requests are transferred to the DB to update the cache. (cache stampede)
For this reason, I thought about applying the PER algorithm, but from my understanding, the algorithm seems to be meaningful when the cache expiration time can be different for each node.
However, it seems that hazelcast cluster cache cannot be applied because it uses a cache synchronized on a per-cluster basis. Is that correct?
So, what strategy should hazlecast take to handle cache stampede?

Hazelcast provides the following method:
put(K, V, long, java.util.concurrent.TimeUnit)
which allows you to specify a TTL on per-entry level. You can e.g. randomize the value in some interval which would prevent the cache stampede.

Related

How to calculate TTL for various types of cache?

Is there a standard way to calculate TTL for different types of cache? this's more of a generic question so lets assume we're designing a system from scratch and we have the following requirements/specs:
static resources served by CDNs are rarely updated e.g.(privacy
policy, about, images and maps)
application cache is used to
serve a- sessions b- recently used reads regardless of the type
client side cache (previously requested files), as well as lets say
images or posts a client can see (something similar to
Instagram/twitter in this case)
Calculate TTL for the following types based on the little to no information provided above:
Client cache
CDN
Webserver cache (used for media)
Application caache (sessions and recent reads of some data)
TTLs are mostly defined using historical data, use cases, and experience. There are no predefined rules/theories that tell you about the cache expiry. Cache TTL should have some tolerance like if you set TTL too high then you might see expired(stale) data, what's the impact of stale data in your application? In some cases, stale data is not accepted at all but in other cases, it's ok to use stale data for SOME TIME.
Still, you'll observe each caching system has some predefined TTL for example AWS CDN has 24 hours expiry, Google CDN has 1 hour. Etag is another thing, that's used in CDN.
CDN can catch data for a week but depending on the data some data can change hourly as well so in that case expiry is set to a lower value, similar things apply to other use cases.
The session should be cached for a week or so, but some applications cache the session for a longer period. Of course, there're pros and cons of using low/high TTL.
Application data cache has similar characters as CDN data, the data can change any time and change must reflect in the cache. Again depending on the use case the TTL should be used, my experiences say you can cache some data for one day or one week but some data can not be cached for more than 15 minutes since it might get updated within 15 minutes.
Depending on the nature of the data you can always find some optimal TTL, finding optimal TTL takes time as you would have to monitor cache hit/miss and stale data ratio.
Refers
https://martinfowler.com/bliki/TwoHardThings.html
https://www.stevesouders.com/blog/2012/03/22/cache-them-if-you-can/

Hazel-cast Cache Key-Values availability on all nodes/Member

I am using hazelCast to cache the data getting fetched from API.
Structure of the API is something like this
Controller->Service->DAOLayer->DB
I am keeping #Cacheable at service layer where getData(int elementID) method is present.
In my architecture there are two PaaS nodes (np3a, np4a). API will be deployed on both of them and users will be accessing them via loadBalancer IP, which will redirect them to either of the nodes.
So It might be possible that for one hit from User X request goes to np3a and for another hit from same user request goes to np4a.
I want that in the very first hit when I would be caching the response on np3a, the same cached response is also available for next hit to np4a.
I have read about
ReplicatedMaps : Memory inefficient
NearCache : when read>write
I am not sure which one approach to take or if you suggest something entirely different.
If you have 2 nodes, Hazelcast will partition data so that half of it will be on node 1, and the other half on node 2. It means there's a 50% chance a user will ask the node containing the data.
If you want to avoid in all cases an additional network request to fetch data that is not present on a node, the only way is to copy data each time to every node. That's the goal of ReplicatedMap. That's the trade-off: performance vs. memory consumption.
NearCache adds an additional cache on the "client-side", if you're using client-server architecture (as opposed to embedded).

TTL-Expiration on a cluster node does not update my clients NearCache

I have a cache cluster with multiple nodes containing a cache map config which is only valid for 10 minutes (TTL = 600s). Additionally I have some client nodes with near caches configured for that cache.
While debugging I see the following behaviour:
If I explicitly evict an entry in that cache on the cluster node, the corresponding near cache entry is evicted as well. (internally a DeleteOperation is performed).
If the entry is timed-out, the entry in the cluster node is removed but the entry in the near cache is still valid. So the client receives an outdated entry.
When I explicitly set a TTL for the near cache as well the cache is evicted correctly.
My expectation is that a TTL-Expiration is also propagated through the cluster and to all the near caches. Am I doing something wrong or is this behaviour by design?
Meanwhile we accept that behaviour as a feature and see near caches as a separate cache layer.
From this perspective it makes sense to design it like this. So the cluster has some rules for TTL oder IdleTime but the client can have different requirements for the topicality of items.

Dealing with concurrency issues when caching for high-traffic sites

I was asked this question in an interview:
For a high traffic website, there is a method (say getItems()) that gets called frequently. To prevent going to the DB each time, the result is cached. However, thousands of users may be trying to access the cache at the same time, and so locking the resource would not be a good idea, because if the cache has expired, the call is made to the DB, and all the users would have to wait for the DB to respond. What would be a good strategy to deal with this situation so that users don't have to wait?
I figure this is a pretty common scenario for most high-traffic sites these days, but I don't have the experience dealing with these problems--I have experience working with millions of records, but not millions of users.
How can I go about learning the basics used by high-traffic sites so that I can be more confident in future interviews? Normally I would start a side project to learn some new technology, but it's not possible to build out a high-traffic site on the side :)
The problem you were asked on the interview is the so-called Cache miss-storm - a scenario in which a lot of users trigger regeneration of the cache, hitting in this way the DB.
To prevent this, first you have to set soft and hard expiration date. Lets say the hard expiration date is 1 day, and the soft 1 hour. The hard is one actually set in the cache server, the soft is in the cache value itself (or in another key in the cache server). The application reads from cache, sees that the soft time has expired, set the soft time 1 hour ahead and hits the database. In this way the next request will see the already updated time and won't trigger the cache update - it will possibly read stale data, but the data itself will be in the process of regeneration.
Next point is: you should have procedure for cache warm-up, e.g. instead of user triggering cache update, a process in your application to pre-populate the new data.
The worst case scenario is e.g. restarting the cache server, when you don't have any data. In this case you should fill cache as fast as possible and there's where a warm-up procedure may play vital role. Even if you don't have a value in the cache, it would be a good strategy to "lock" the cache (mark it as being updated), allow only one query to the database, and handle in the application by requesting the resource again after a given timeout
You could probably be better of using some distributed cache repository, as memcached, or others depending your access pattern.
You could use the Cache implementation of Google's Guava library if you want to store the values inside the application.
From the coding point of view, you would need something like
public V get(K key){
V value = map.get(key);
if (value == null) {
synchronized(mutex){
value = map.get(key);
if (value == null) {
value = db.fetch(key);
map.put(key, value);
}
}
}
return value;
}
where the map is a ConcurrentMap and the mutex is just
private static Object mutex = new Object();
In this way, you will have just one request to the db per missing key.
Hope it helps! (and don't store null's, you could create a tombstone value instead!)
Cache miss-storm or Cache Stampede Effect, is the burst of requests to the backend when cache invalidates.
All high concurrent websites I've dealt with used some kind of caching front-end. Bein Varnish or Nginx, they all have microcaching and stampede effect suppression.
Just google for Nginx micro-caching, or Varnish stampede effect, you'll find plenty of real world examples and solutions for this sort of problem.
All boils down to whether or not you'll allow requests pass through cache to reach backend when it's in Updating or Expired state.
Usually it's possible to actively refresh cache, holding all requests to the updating entry, and then serve them from cache.
But, there is ALWAYS the question "What kind of data are you supposed to be caching or not", because, you see, if it is just plain text article, which get an edit/update, delaying cache update is not as problematic than if your data should be exactly shown on thousands of displays (real-time gaming, financial services, and so on).
So, the correct answer is, microcache, suppression of stampede effect/cache miss storm, and of course, knowing which data to cache when, how and why.
It is worse to consider particular data type for caching only if data consumers are ready for getting stale date (in reasonable bounds).
In such case you could define invalidation/eviction/update policy to keep you data up-to-date (in business meaning).
On update you just replace data item in cache and all new requests will be responsed with new data
Example: Stocks info system. If you do not need real-time price info it is reasonable to keep in cache stock and update it every X mils/secs with expensive remote call.
Do you really need to expire the cache. Can you have an incremental update mechanism using which you can always increment the data periodically so that you do not have to expire your data but keep on refreshing it periodically.
Secondly, if you want to prevent too many users from hiting the db in one go, you can have a locking mechanism in your stored proc (if your db supports it) that prevents too many people hitting the db at the same time. Also, you can have a caching mechanism in your db so that if someone is asking for the exact same data from the db again, you can always return a cached value
Some applications also use a third service layer between the application and the database to protect the database from this scenario. The service layer ensures that you do not have the cache miss storm in the db
The answer is to never expire the Cache and have a background process update cache periodically. This avoids the wait and the cache-miss storms, but then why use cache in this scenario?
If your app will crash with a "Cache miss" scenario, then you need to rethink your app and what is cache verses needed In-Memory data. For me, I would use an In Memory database that gets updated when data is changed or periodically, not a Cache at all and avoid the aforementioned scenario.

When to Use Azure Caching Local Cache

I want to start using the Azure Distributed Caching and came across the concept of LocalCache. But the fact that it can go out of sync with the Distributed Cache, makes me wonder, why I would want to use it and how I could use it safely.
When enabled, items retrieved from the cache cluster are locally stored in memory on the client machine. This improves performance of subsequent get requests, but it can result in inconsistency of data between the locally cached version and the actual item in the cache cluster.
Calling DataCache.GetIfNewer is one option to ensure that I get the latest version, but that requires that I still do a call to the Distributed Cache, passing in the object that I want to check, in order to see if the two versions differ.
I could use Notifications to invalidate the LocalCache object, but that is done on a polling basis, which opens up the opportunity for an update to occur within the poll period leaving me with stale data.
So,why would I ever use LocalCache, and if there is a reason to do so, how do I use it safely?
"There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton
You would use LocalCache when a) performance is critical b) you don't care that the retrieved object might be stale.
There are many cases where the object is never going to be out of date (e.g. list of public/bank holidays), or when you are not too worried about being 100% up-to-date (e.g. if item has > 1000 units in stock, use local cache, otherwise re-fetch from database).
Don't try and invalidate the local cache. If you need more up-to-date objects, get them from the cluster. If you cannot tolerate out-of-sync data, get it from the database. Caching is always a performance-inconsistency compromise — LocalCache more than the server cache, but the server cache is still a compromise.

Resources