I have a cache cluster with multiple nodes containing a cache map config which is only valid for 10 minutes (TTL = 600s). Additionally I have some client nodes with near caches configured for that cache.
While debugging I see the following behaviour:
If I explicitly evict an entry in that cache on the cluster node, the corresponding near cache entry is evicted as well. (internally a DeleteOperation is performed).
If the entry is timed-out, the entry in the cluster node is removed but the entry in the near cache is still valid. So the client receives an outdated entry.
When I explicitly set a TTL for the near cache as well the cache is evicted correctly.
My expectation is that a TTL-Expiration is also propagated through the cluster and to all the near caches. Am I doing something wrong or is this behaviour by design?
Meanwhile we accept that behaviour as a feature and see near caches as a separate cache layer.
From this perspective it makes sense to design it like this. So the cluster has some rules for TTL oder IdleTime but the client can have different requirements for the topicality of items.
Related
I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.
My spring service is using hazelcast(embedded cluster) as local cache.
When the cache expires due to TTL, if a large number of user requests come at once, the requests are transferred to the DB to update the cache. (cache stampede)
For this reason, I thought about applying the PER algorithm, but from my understanding, the algorithm seems to be meaningful when the cache expiration time can be different for each node.
However, it seems that hazelcast cluster cache cannot be applied because it uses a cache synchronized on a per-cluster basis. Is that correct?
So, what strategy should hazlecast take to handle cache stampede?
Hazelcast provides the following method:
put(K, V, long, java.util.concurrent.TimeUnit)
which allows you to specify a TTL on per-entry level. You can e.g. randomize the value in some interval which would prevent the cache stampede.
I am using hazelCast to cache the data getting fetched from API.
Structure of the API is something like this
Controller->Service->DAOLayer->DB
I am keeping #Cacheable at service layer where getData(int elementID) method is present.
In my architecture there are two PaaS nodes (np3a, np4a). API will be deployed on both of them and users will be accessing them via loadBalancer IP, which will redirect them to either of the nodes.
So It might be possible that for one hit from User X request goes to np3a and for another hit from same user request goes to np4a.
I want that in the very first hit when I would be caching the response on np3a, the same cached response is also available for next hit to np4a.
I have read about
ReplicatedMaps : Memory inefficient
NearCache : when read>write
I am not sure which one approach to take or if you suggest something entirely different.
If you have 2 nodes, Hazelcast will partition data so that half of it will be on node 1, and the other half on node 2. It means there's a 50% chance a user will ask the node containing the data.
If you want to avoid in all cases an additional network request to fetch data that is not present on a node, the only way is to copy data each time to every node. That's the goal of ReplicatedMap. That's the trade-off: performance vs. memory consumption.
NearCache adds an additional cache on the "client-side", if you're using client-server architecture (as opposed to embedded).
The release notes of Infinispan 8 describes a new feature: Staggered remote gets.
These are described in the user guide:
11.4. Distribution Mode
The remote GET requests are staggered: we request the value from the primary owner, but if it doesn’t respond in a reasonable amount of time, we request the value from the backup owners as well.
This feature is documented for the Distribution Mode only.
Is this feature used for Replicated Mode as well?
Generally speaking: Is it safe to assume that replicated caches are a special case of distributed caches?
Generally speaking, yes it's true that Replicated Mode is a special case of Distributed Caches. The code is pretty much the same, with the exception of Replicated mode keeping an amount of replicas which is equal to the size of the cluster: each node will also be a full replica.
A Get operation will not issue a Remote Get when the current node is also an owning replica of the entry. So while it's true that a Remote Get would also be "staggered" if it the method was invoked, in practice when you have Replication you will never actually perform a Remote Get.
I know redis can be used as LRU cache, but is there softlimit flag, where we can state after specific criteria is reached "redis will start cleaning LRU items".
Actually I'm getting OOM errors on redis, I've set redis to LRU cache, but it hits OOM limit and application stops.
I know of "maxmemory " flag, but is there a softlimit, where we've some 10% space left, and we can start eviction of some items, so that application doesn't stop !
Did you set a specific eviction policy?
See: Eviction policies http://redis.io/topics/lru-cache
I would then check, to make sure that you are not inadvertently setting PERSIST on your redis objects. PERSISTED objects, I believe, cannot be LRU'd out.
You can use http://redis.io/commands/ttl TTL to find out the time limit on your keys. And "Keys" to get a list of keys (this is dangerous on a production server, as the list could be very long and blocking). http://redis.io/commands/keys
-daniel