How to use a memory cache in a concurrency critical context - performance

Consider the following two methods, written in pseudo code, that fetches a complex data structure, and updates it, respectively:
getData(id) {
if(isInCache(id)) return getFromCache(id) // already in cache?
data = fetchComplexDataStructureFromDatabase(id) // time consuming!
setCache(id, data) // update cache
return data
}
updateData(id, data) {
storeDataStructureInDatabase(id, data)
clearCache(id)
}
In the above implementation, there is a problem with concurrency, and we might end up with outdated data in the cache: consider two parallel executions running getData() and updateData(), respectively. If the first execution fetches data from the cache exactly in between the other execution's call to storeDataStructureInDatabase() and clearCache(), then we will get an outdated version of the data. How would you get around this concurrency problem?
I considered the following solution, where the cache is invalidated just before data is committed:
storeDataStructureInDatabase(id, data) {
executeSql("UPDATE table1 SET...")
executeSql("UPDATE table2 SET...")
executeSql("UPDATE table3 SET...")
clearCache(id)
executeSql("COMMIT")
}
But then again: If one execution reads the cache in between the other execution's call to clearCache() and COMMIT, then an outdated data will be fetched to the cache. Problem not solved.

In the cache way of thinking you cannot prevent retrieving outdated data.
For example, when someone start sending an HTTP request (if your application is a web application) that will later render the cache invalid, should we consider the cache invalid when the POST request start? when the request is handled by your server? when you start the controller code?. Well no. In fact the cache is invalid only when the database transaction ends. Not even when the transaction start, only at the end, on the COMMIT phase of the transaction. And any working process working with previous data has very few chances of being aware that the data as changed, in a web application what about html pages showing outdated data in a browser, do you want to flush theses pages?
But let's just think your parallel process are not just there for the web, but for real concurrency critical parallel jobs.
One problem is that your cache is not handled by the database server, so it's not in the transaction COMMIT/ROLLBACK. You cannot decide to clear the cache first but rebuild it if you rollback. So you can only clear and rebuild the cache after the transaction is commited.
And that lead the possibility to get an outdated version of the cache if your get comes between the database commit and the cache clear instruction. So :
is it really important that you have an outdated version of the cache? Let's say your parallel process made something just a few milliseconds before you would have retrieve this new version (so it's the old one) and work with it for maybe 40ms, and then build final report on that without noticing the cache have been flush 15ms before the end of the work. If your process response cannot contain any outdated data, then you'll have to check data validity before outputing it (so you should recheck that all data used in the work process are still valid at teh end).
So if you don't want to recheck data validity that mean your process should have put some lock (semaphore?) when starting and should release the lock only at the end of the work, your are serializing your work. Databases can speed up serialization by working on pseudo-serialization levels for transactions and breaking your transaction if any changes make this pseudo-serialization hasardous. But here you're not only working with a database so you should do the serialization on your own side.
Process serialization is slow, but you may try to do the same as the database, that is runing jobs in parallel and invalidating any job running when data is altered (so having something that detect your cache clear and kill and rerun all existing parallel jobs, implying you have something mastering all the parallel jobs)
or simply accept you can have small past-invalid-outdated data. If we talk of web application the time your response walks on TCP/IP to the client browser it may be already invalid.
Chances are that you will accept to work with outdated cache data. The only really important point is that if you cannot trust your cache data for a really critical thing then you should'nt use a cache for that. If your are manipulating Accounting data for example. The only way to get a serialization of parallel tasks is to do:
in the Writing process: all the important reads (the one that will get some writes) and all the write things in a transaction with a high isolation level (level 4) and with all necessary row locks. That's something hard to do working only with a database, it's quite impossible if you add an external cache for read operations.
in parallel read process: do what you want (read from external cache), if the read data won't be used for write operations. If one of the read data will later be use for a write operation this data validity will have to be checked in the write transaction (so in the Writing process). Why not adding a timestamp watermark on the data, so that when it will come back for a write operation you'll be able to know if it is still valid.

Related

Is this Redis Race Condition Scenario Possible?

I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.

Cache invalidation algorithm

I'm thinking about caching dynamic content in web server. My goal is to bridge the whole processing by returning a cached HTTP response without bothering the DB (or Hibernate). This question is not about choosing between existing caching solutions; my current concern is the invalidation.
I'm sure, a time-based invalidation makes no sense at all: Whenever a user changes anything, they expect to see to the effect immediately rather than in a few seconds or even minutes. And caching for a fraction of a second is useless as there are no repeated requests for the same data in such a short period (since most of the data is user-specific).
For every data change, I get an event and can use it to invalidate everything depending on the changed data. As request happen concurrently, there are two time-related problems:
Invalidation may come too late and stale data may be served even to the client who changed them.
After the invalidation has finished, a long running request may finish and its stale data may get put into the cache.
The two problems are sort of opposite to each other. I guess, the former is easily solved by partially serializing requests from the same client using a ReadWriteLock per client. So let's forget it.
The latter is more serious as it basically means a lost invalidation and serving the stale data forever (or too long).
I can imagine a solution like repeating the invalidation after every request having started before the change happened, but this sounds rather complicated and time-consuming. I wonder if any existing caches do support this, but I'm mainly interested in how this gets done in general.
Clarification
The problem is a simple race condition:
Request A executes a query and fetches the result
Request B does some changes
The invalidation due to B happens
Request A (which was delayed for whatever reason) finishes
The obsolete response by request A gets written into the cache
To solve the race condition, add a timestamp (or a counter) and check this timestamp when setting a new cache entry.
This ensures that obsolete response will not be cached.
Here's a pseudocode:
//set new cache entry if resourceId is not cached
//or if existing entry is stale
function setCache(resourceId, requestTimestamp, responseData) {
if (cache[resourceId]) {
if (cache[resourceId].timestamp > requestTimestamp) {
//existing entry is newer
return;
} else
if (cache[resourceId].timestamp = requestTimestamp) {
//ensure invalidation
responseData = null;
}
}
cache[resourceId] = {
timestamp: requestTimestamp,
response: responseData
};
}
Let's say we got 2 requests for the same resource "foo":
Request A (received at 00:00:00.000) executes a query and fetches the result
Request B (received at 00:00:00.001) does some changes
The invalidation due to B happens by calling setCache("foo", "00:00:00.001", null)
Request A finishes
Request A calls setCache("foo", "00:00:00.000", ...) to write the obsolete response to cache but fails because the existing entry is newer
This is just the basic mechanism, so there is room for improvements.
I think you don't realize (or don't want to explicitly call out) that you are asking about a choice between cache synchronization strategies. There are several well known strategies: "cache aside", "read through", "write through", and "write behind". e.g. read here: A beginner’s guide to Cache synchronization strategies. They offer various levels of cache consistency (invalidation as you call it).
Your choice should depend on your needs and requirements.
It sounds like so far you've chosen "write behind" strategy (queue or defer cache invalidation). But from your concerns it sounds like you've chosen it incorrectly, because you are worried about inconsistent cache reads.
So, you should consider using "cache aside" or "read/write through" strategies, because those offer better cache consistency. They all are different flavors of the same thing - always keep cache consistent. If you don't care about cache consistency, then ok, stay with "write behind", but then this question becomes irrelevant.
Architecture wide, I would never go with raising events to invalidate the cache, because it seems like you've made it part of your business logic, while it's just an infrastructure concern. Invalidate (or queue invalidation of) cache as part of read/write operations, and not somewhere else. That allows cache to become just one aspect of your infrastructure, and not part of everything else.

Dealing with concurrency issues when caching for high-traffic sites

I was asked this question in an interview:
For a high traffic website, there is a method (say getItems()) that gets called frequently. To prevent going to the DB each time, the result is cached. However, thousands of users may be trying to access the cache at the same time, and so locking the resource would not be a good idea, because if the cache has expired, the call is made to the DB, and all the users would have to wait for the DB to respond. What would be a good strategy to deal with this situation so that users don't have to wait?
I figure this is a pretty common scenario for most high-traffic sites these days, but I don't have the experience dealing with these problems--I have experience working with millions of records, but not millions of users.
How can I go about learning the basics used by high-traffic sites so that I can be more confident in future interviews? Normally I would start a side project to learn some new technology, but it's not possible to build out a high-traffic site on the side :)
The problem you were asked on the interview is the so-called Cache miss-storm - a scenario in which a lot of users trigger regeneration of the cache, hitting in this way the DB.
To prevent this, first you have to set soft and hard expiration date. Lets say the hard expiration date is 1 day, and the soft 1 hour. The hard is one actually set in the cache server, the soft is in the cache value itself (or in another key in the cache server). The application reads from cache, sees that the soft time has expired, set the soft time 1 hour ahead and hits the database. In this way the next request will see the already updated time and won't trigger the cache update - it will possibly read stale data, but the data itself will be in the process of regeneration.
Next point is: you should have procedure for cache warm-up, e.g. instead of user triggering cache update, a process in your application to pre-populate the new data.
The worst case scenario is e.g. restarting the cache server, when you don't have any data. In this case you should fill cache as fast as possible and there's where a warm-up procedure may play vital role. Even if you don't have a value in the cache, it would be a good strategy to "lock" the cache (mark it as being updated), allow only one query to the database, and handle in the application by requesting the resource again after a given timeout
You could probably be better of using some distributed cache repository, as memcached, or others depending your access pattern.
You could use the Cache implementation of Google's Guava library if you want to store the values inside the application.
From the coding point of view, you would need something like
public V get(K key){
V value = map.get(key);
if (value == null) {
synchronized(mutex){
value = map.get(key);
if (value == null) {
value = db.fetch(key);
map.put(key, value);
}
}
}
return value;
}
where the map is a ConcurrentMap and the mutex is just
private static Object mutex = new Object();
In this way, you will have just one request to the db per missing key.
Hope it helps! (and don't store null's, you could create a tombstone value instead!)
Cache miss-storm or Cache Stampede Effect, is the burst of requests to the backend when cache invalidates.
All high concurrent websites I've dealt with used some kind of caching front-end. Bein Varnish or Nginx, they all have microcaching and stampede effect suppression.
Just google for Nginx micro-caching, or Varnish stampede effect, you'll find plenty of real world examples and solutions for this sort of problem.
All boils down to whether or not you'll allow requests pass through cache to reach backend when it's in Updating or Expired state.
Usually it's possible to actively refresh cache, holding all requests to the updating entry, and then serve them from cache.
But, there is ALWAYS the question "What kind of data are you supposed to be caching or not", because, you see, if it is just plain text article, which get an edit/update, delaying cache update is not as problematic than if your data should be exactly shown on thousands of displays (real-time gaming, financial services, and so on).
So, the correct answer is, microcache, suppression of stampede effect/cache miss storm, and of course, knowing which data to cache when, how and why.
It is worse to consider particular data type for caching only if data consumers are ready for getting stale date (in reasonable bounds).
In such case you could define invalidation/eviction/update policy to keep you data up-to-date (in business meaning).
On update you just replace data item in cache and all new requests will be responsed with new data
Example: Stocks info system. If you do not need real-time price info it is reasonable to keep in cache stock and update it every X mils/secs with expensive remote call.
Do you really need to expire the cache. Can you have an incremental update mechanism using which you can always increment the data periodically so that you do not have to expire your data but keep on refreshing it periodically.
Secondly, if you want to prevent too many users from hiting the db in one go, you can have a locking mechanism in your stored proc (if your db supports it) that prevents too many people hitting the db at the same time. Also, you can have a caching mechanism in your db so that if someone is asking for the exact same data from the db again, you can always return a cached value
Some applications also use a third service layer between the application and the database to protect the database from this scenario. The service layer ensures that you do not have the cache miss storm in the db
The answer is to never expire the Cache and have a background process update cache periodically. This avoids the wait and the cache-miss storms, but then why use cache in this scenario?
If your app will crash with a "Cache miss" scenario, then you need to rethink your app and what is cache verses needed In-Memory data. For me, I would use an In Memory database that gets updated when data is changed or periodically, not a Cache at all and avoid the aforementioned scenario.

When to Use Azure Caching Local Cache

I want to start using the Azure Distributed Caching and came across the concept of LocalCache. But the fact that it can go out of sync with the Distributed Cache, makes me wonder, why I would want to use it and how I could use it safely.
When enabled, items retrieved from the cache cluster are locally stored in memory on the client machine. This improves performance of subsequent get requests, but it can result in inconsistency of data between the locally cached version and the actual item in the cache cluster.
Calling DataCache.GetIfNewer is one option to ensure that I get the latest version, but that requires that I still do a call to the Distributed Cache, passing in the object that I want to check, in order to see if the two versions differ.
I could use Notifications to invalidate the LocalCache object, but that is done on a polling basis, which opens up the opportunity for an update to occur within the poll period leaving me with stale data.
So,why would I ever use LocalCache, and if there is a reason to do so, how do I use it safely?
"There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton
You would use LocalCache when a) performance is critical b) you don't care that the retrieved object might be stale.
There are many cases where the object is never going to be out of date (e.g. list of public/bank holidays), or when you are not too worried about being 100% up-to-date (e.g. if item has > 1000 units in stock, use local cache, otherwise re-fetch from database).
Don't try and invalidate the local cache. If you need more up-to-date objects, get them from the cluster. If you cannot tolerate out-of-sync data, get it from the database. Caching is always a performance-inconsistency compromise — LocalCache more than the server cache, but the server cache is still a compromise.

Plone 4.2 how to make PAS cache external usera data

I'm implementing a PAS plugin that handles authentications against mailservers. Actually only DBMail is implemented.
I realized, that the enumerateUsers function from the PAS plugin is called numerous times per request and requires my plugin to open/close an SQL connections for every (subsequent) request. Of course, this is very expensive.
The connections itself are handled in a plone tool, which is able to handle multiple different mailservers and delegeates the enumerateUsers call to wrapper objects that represent registered servers.
My question is now, what sort of cache (OOBTree, Session?) I should use to provide a temporary local storage for repeating enumerations and avoid subsequent SQL connections?
Another idea was, to hook into the user creation process that takes place on the first login, an external user issues and completely "localize" the users.
Third idea was, to store the needed data in the specific member, if possible.
What would be best practice here?
I'd cache the query results, indeed. You need to make a decision on how long to cache the results, and if stored long term, how to invalidate that cache or check for changes.
There are no best practices for these decisions, as they depend entirely on the type of data stored and the APIs of the backends. If they support some kind of freshness query, for example, then you store everything forever and poll the backend to see if the cache needs updating.
You can start with a simple request cache; query once per request, store it on the request object. Your cache will automatically be invalidated at the end of the request as the request object is cleaned up, the next request will be a clean slate.
If your backend users rarely change, you can cache information for longer, in a local cache. I'd use a volatile attribute on the plugin. Any attribute starting with _v_ is ignored by the persistence machinery. Thus, anything stored in a _v_ volatile attribute is both thread-local and only exists for the lifetime of the process, a restart of the server clears these automatically.
At the very least you should use an _v_ volatile attribute to store your backend SQL connections. That way they can stay open between requests, and can be re-used. Something like the following method would do nicely:
def _connection(self):
# Return a backend connection
if getattr(self, '_v_connection', None) is None:
# Create connection here
self._v_connection = yourdatabaseconnection
return self._v_connection
You could also use a persistent attribute on your plugin to store your cache. This cache would be committed to the ZODB and persist across restarts. You then really need to work out how to invalidate the contents; store timestamps and evict data when to old, etc.
Your cache datastructure depends entirely on your application needs. If you don't persist information, a dictionary (username -> information) could be more than enough. Persisted caches could benefit from using a OOBTree instead of a dictionary as they reduce chances of conflicts between different threads and are more efficient when it comes to large sets of data.
Whatever you do, you do not need to use a Session. Sessions are prone to conflicts, do not scale well, and are in any case not the place to store a cache of this kind.

Resources