I want to start using the Azure Distributed Caching and came across the concept of LocalCache. But the fact that it can go out of sync with the Distributed Cache, makes me wonder, why I would want to use it and how I could use it safely.
When enabled, items retrieved from the cache cluster are locally stored in memory on the client machine. This improves performance of subsequent get requests, but it can result in inconsistency of data between the locally cached version and the actual item in the cache cluster.
Calling DataCache.GetIfNewer is one option to ensure that I get the latest version, but that requires that I still do a call to the Distributed Cache, passing in the object that I want to check, in order to see if the two versions differ.
I could use Notifications to invalidate the LocalCache object, but that is done on a polling basis, which opens up the opportunity for an update to occur within the poll period leaving me with stale data.
So,why would I ever use LocalCache, and if there is a reason to do so, how do I use it safely?
"There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton
You would use LocalCache when a) performance is critical b) you don't care that the retrieved object might be stale.
There are many cases where the object is never going to be out of date (e.g. list of public/bank holidays), or when you are not too worried about being 100% up-to-date (e.g. if item has > 1000 units in stock, use local cache, otherwise re-fetch from database).
Don't try and invalidate the local cache. If you need more up-to-date objects, get them from the cluster. If you cannot tolerate out-of-sync data, get it from the database. Caching is always a performance-inconsistency compromise — LocalCache more than the server cache, but the server cache is still a compromise.
Related
A system is being implemented using microservices. In order to decrease interactions between microservices implemented "at the same level" in an architecture, some microservices will locally cache copies of tables managed by other services. The assumption is that the locally cached table (a) is frequently accessed in a "read mode" by the microservice, and (b) has relatively static content (i.e., more of a "lookup table" vice a transactional content).
The local caches will maintain synch using inter-service messaging. As the content should be fairly static, this should not be a significant issue/workload. However, on startup of a microservice, there is a possibility that the local cache has gone stale.
I'd like to implement some sort of rolling revision number on the source table, so that microservices with local caches can check this revision number to potentially avoid a re-synch event.
Is there a "best practice" to this approach? Or, a "better alternative", given that each microservice is backed by it's own database (i.e., no shared database)?
In my opinion you shouldn't be loading the data at start up. It might be bit complicated to maintain version.
Cache-Aside Pattern
Generally in microservices architecture you consider "cache-aside pattern". You don't build the cache at front but on demand. When you get a request you check the cache , if it's not there you update the cache with latest value and return response, from there it's always returned from cache. The benefit is you don't need to load everything at front. Say you have 200 records, while services are only using 50 of them frequently , you are maintaining the extra cache that may not be required.
Let the requests build the cache , it's the one time DB hit . You can set the expiry on cache and incoming request build it again.
If you have data which is totally static (never ever change) then this pattern may not be worth a discussion , but if you have a lookup table that can change even once a week, month, then you should be using this pattern with longer cache expiration time. Maintaining the version could be costly. But really upto you how you may want to implement.
https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside
We ran into this same issue and have temporarily solved it by using a LastUpdated timestamp comparison (same concept as your VersionNumber). Every night (when our application tends to be slow) each service publishes a ServiceXLastUpdated message that includes the most recent timestamp when the data it owns was added/edited. Any other service that subscribes to this data processes the message and if there's a mismatch it requests all rows "touched" since it's last local update so that it can get back in sync.
For us, for now, this is okay as new services don't tend to come online and be in use same day. But, our plan going forward is that any time a service starts up, it can publish a message for each subscribed service indicating it's most recent cache update timestamp. If a "source" service sees the timestamp is not current, it can send updates to re-sync the data. This has the advantage of only sending the needed updates to the specific service(s) that need it even though (at least for us) all services subscribed have access to the messages.
We started with using persistent Queues so if all instances of a Microservice were down, the messages would just build up in it's queue. There are 2 issues with this that led us to build something better:
1) It obviously doesn't solve the "first startup" scenario as there is no queue for messages to build up in
2) If ANYTHING goes wrong either in storing queued messages or processing them, you end up out of sync. If that happens, you still need a proactive mechanism like we have now to bring things back in sync. So, it seemed worth going this route
I wouldn't say our method is a "best practice" and if there is one I'm not aware of it. But, the way we're doing it (including planned future work) has so far proven simple to build, easy to understand and monitor, and robust in that it's extremely rare we get an event caused by out-of-sync local data.
I am new to the topic. Having read a handful of articles on it, and asked a couple of persons, I still do not understand what you people do in regard to one problem.
There are UI clients making requests to several backend instances (for now it's irrelevant whether sessions are sticky or not), and those instances are connected to some highly available DB cluster (may it be Cassandra or something else of even Elasticsearch). Say the backend instance is not specifically tied to one or cluster's machines, and instead its every request to DB may be served by a different machine.
One client creates some record, it's synchronously of asynchronously stored to one of cluster's machines then eventually gets replicated to the rest of DB machines. Then another client requests the list or records, the request ends up served by a distant machine not yet received the replicated changes, and so the client does not see the record. Well, that's bad but not yet ugly.
Consider however that the second client hits the machine which has the record, displays it in a list, then refreshes the list and this time hits the distant machine and again does not see the record. That's very weird behavior to observe, isn't it? It might even get worse: the client successfully requests the record, starts some editing on it, then tries to store the updates to DB and this time hits the distant machine which says "I know nothing about this record you are trying to update". That's an error which the user will see while doing something completely legitimate.
So what's the common practice to guard against this?
So far, I only see three solutions.
1) Not actually a solution but rather a policy: ignore the problem and instead speed up the cluster hard enough to guarantee that 99.999% of changes will be replicated on the whole cluster in, say, 0.5 secord (it's hard to imagine some user will try to make several consecutive requests to one record in that time; he can of course issue several reading requests, but in that case he'll probably not notice inconsistency between results). And even if sometimes something goes wrong and the user faces the problem, well, we just embrace that. If the loser gets unhappy and writes a complaint to us (which will happen maybe once a week or once an hour), we just apologize and go on.
2) Introduce an affinity between user's session and a specific DB machine. This helps, but needs explicit support from the DB, and also hurts load-balancing, and invites complications when the DB machine goes down and the session needs to be re-bound to another machine (however with proper support from DB I think that's possible; say Elasticsearch can accept routing key, and I believe if the target shard goes down it will just switch the affinity link to another shard - though I am not entirely sure; but even if re-binding happens, the other machine may contain older data :) ).
3) Rely on monotonic consistency, i.e. some method to be sure that the next request from a client will get results no older than the previous one. But, as I understand it, this approach also requires explicit support from DB, like being able so pass some "global version timestamp" to a cluster's balancer, which it will compare with it's latest data on all machines' timestamps to determine which machines can serve the request.
Are there other good options? Or are those three considered good enough to use?
P.S. My specific problem right now is with Elasticsearch; AFAIK there is no support for monotonic reads there, though looks like option #2 may be available.
Apache Ignite has primary partition for a key and backup partitions. Unless you have readFromBackup option set, you will always be reading from primary partition whose contents is expected to be reliable.
If a node goes away, a transaction (or operation) should be either propagated by remaining nodes or rolled back.
Note that Apache Ignite doesn't do Eventual Consistency but instead Strong Consistency. It means that you can observe delays during node loss, but will not observe inconsistent data.
In Cassandra if using at least quorum consistency for both reads and writes you will get monotonic reads. This was not the case pre 1.0 but thats a long time ago. There are some gotchas if using server timestamps but thats not by default so likely wont be an issue if using C* 2.1+.
What can get funny is since C* uses timestamps is things that occur at "same time". Since Cassandra is Last Write Wins the times and clock drift do matter. But concurrent updates to records will always have race conditions so if you require strong read before write guarantees you can use light weight transactions (essentially CAS operations using paxos) to ensure no one else updates between your read to update, these are slow though so I would avoid it unless critical.
In a true distributed system, it does not matter where your record is stored in remote cluster as long as your clients are connected to that remote cluster. In Hazelcast, a record is always stored in a partition and one partition is owned by one of the servers in the cluster. There could be X number of partitions in the cluster (by default 271) and all those partitions are equally distributed across the cluster. So a 3 members cluster will have a partition distribution like 91-90-90.
Now when a client sends a record to store in Hazelcast cluster, it already knows which partition does the record belong to by using consistent hashing algorithm. And with that, it also knows which server is the owner of that partition. Hence, the client sends its operation directly to that server. This approach applies on all client operations - put or get. So in your case, you may have several UI clients connected to the cluster but your record for a particular user is stored on one server in the cluster and all your UI clients will be approaching that server for their operations related to that record.
As for consistency, Hazelcast by default is strongly consistent distributed cache, which implies that all your updates to a particular record happen synchronously, in the same thread and the application waits until it has received acknowledgement from the owner server (and the backup server if backups are enabled) in the cluster.
When you connect a DB layer (this could be one or many different types of DBs running in parallel) to the cluster then Hazelcast cluster returns data even if its not currently present in the cluster by reading it from DB. So you never get a null value. On updating, you configure the cluster to send the updates downstream synchronously or asynchronously.
Ah-ha, after some even more thorough study of ES discussions I found this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Note how they specifically highlight the "custom value" case, recommending to use it exactly to solve my problem.
So, given that's their official recommendation, we can summarise it like this.
To fight volatile reads, we are supposed to use "preference",
with "custom" or some other approach.
To also get "read your
writes" consistency, we can have all clients use
"preference=_primary", because primary shard is first to get all
writes. This however will probably have worse performance than
"custom" mode due to no distribution. And that's quite similar to what other people here said about Ignite and Hazelcast.
Right?
Of course that's a solution specifically for ES. Reverting to my initial question which is a bit more generic, turns out that options #2 and #3 are really considered good enough for many distributed systems, with #3 being possible to achieve with #2 (even without immediate support for #3 by DB).
I was asked this question in an interview:
For a high traffic website, there is a method (say getItems()) that gets called frequently. To prevent going to the DB each time, the result is cached. However, thousands of users may be trying to access the cache at the same time, and so locking the resource would not be a good idea, because if the cache has expired, the call is made to the DB, and all the users would have to wait for the DB to respond. What would be a good strategy to deal with this situation so that users don't have to wait?
I figure this is a pretty common scenario for most high-traffic sites these days, but I don't have the experience dealing with these problems--I have experience working with millions of records, but not millions of users.
How can I go about learning the basics used by high-traffic sites so that I can be more confident in future interviews? Normally I would start a side project to learn some new technology, but it's not possible to build out a high-traffic site on the side :)
The problem you were asked on the interview is the so-called Cache miss-storm - a scenario in which a lot of users trigger regeneration of the cache, hitting in this way the DB.
To prevent this, first you have to set soft and hard expiration date. Lets say the hard expiration date is 1 day, and the soft 1 hour. The hard is one actually set in the cache server, the soft is in the cache value itself (or in another key in the cache server). The application reads from cache, sees that the soft time has expired, set the soft time 1 hour ahead and hits the database. In this way the next request will see the already updated time and won't trigger the cache update - it will possibly read stale data, but the data itself will be in the process of regeneration.
Next point is: you should have procedure for cache warm-up, e.g. instead of user triggering cache update, a process in your application to pre-populate the new data.
The worst case scenario is e.g. restarting the cache server, when you don't have any data. In this case you should fill cache as fast as possible and there's where a warm-up procedure may play vital role. Even if you don't have a value in the cache, it would be a good strategy to "lock" the cache (mark it as being updated), allow only one query to the database, and handle in the application by requesting the resource again after a given timeout
You could probably be better of using some distributed cache repository, as memcached, or others depending your access pattern.
You could use the Cache implementation of Google's Guava library if you want to store the values inside the application.
From the coding point of view, you would need something like
public V get(K key){
V value = map.get(key);
if (value == null) {
synchronized(mutex){
value = map.get(key);
if (value == null) {
value = db.fetch(key);
map.put(key, value);
}
}
}
return value;
}
where the map is a ConcurrentMap and the mutex is just
private static Object mutex = new Object();
In this way, you will have just one request to the db per missing key.
Hope it helps! (and don't store null's, you could create a tombstone value instead!)
Cache miss-storm or Cache Stampede Effect, is the burst of requests to the backend when cache invalidates.
All high concurrent websites I've dealt with used some kind of caching front-end. Bein Varnish or Nginx, they all have microcaching and stampede effect suppression.
Just google for Nginx micro-caching, or Varnish stampede effect, you'll find plenty of real world examples and solutions for this sort of problem.
All boils down to whether or not you'll allow requests pass through cache to reach backend when it's in Updating or Expired state.
Usually it's possible to actively refresh cache, holding all requests to the updating entry, and then serve them from cache.
But, there is ALWAYS the question "What kind of data are you supposed to be caching or not", because, you see, if it is just plain text article, which get an edit/update, delaying cache update is not as problematic than if your data should be exactly shown on thousands of displays (real-time gaming, financial services, and so on).
So, the correct answer is, microcache, suppression of stampede effect/cache miss storm, and of course, knowing which data to cache when, how and why.
It is worse to consider particular data type for caching only if data consumers are ready for getting stale date (in reasonable bounds).
In such case you could define invalidation/eviction/update policy to keep you data up-to-date (in business meaning).
On update you just replace data item in cache and all new requests will be responsed with new data
Example: Stocks info system. If you do not need real-time price info it is reasonable to keep in cache stock and update it every X mils/secs with expensive remote call.
Do you really need to expire the cache. Can you have an incremental update mechanism using which you can always increment the data periodically so that you do not have to expire your data but keep on refreshing it periodically.
Secondly, if you want to prevent too many users from hiting the db in one go, you can have a locking mechanism in your stored proc (if your db supports it) that prevents too many people hitting the db at the same time. Also, you can have a caching mechanism in your db so that if someone is asking for the exact same data from the db again, you can always return a cached value
Some applications also use a third service layer between the application and the database to protect the database from this scenario. The service layer ensures that you do not have the cache miss storm in the db
The answer is to never expire the Cache and have a background process update cache periodically. This avoids the wait and the cache-miss storms, but then why use cache in this scenario?
If your app will crash with a "Cache miss" scenario, then you need to rethink your app and what is cache verses needed In-Memory data. For me, I would use an In Memory database that gets updated when data is changed or periodically, not a Cache at all and avoid the aforementioned scenario.
I'm implementing a PAS plugin that handles authentications against mailservers. Actually only DBMail is implemented.
I realized, that the enumerateUsers function from the PAS plugin is called numerous times per request and requires my plugin to open/close an SQL connections for every (subsequent) request. Of course, this is very expensive.
The connections itself are handled in a plone tool, which is able to handle multiple different mailservers and delegeates the enumerateUsers call to wrapper objects that represent registered servers.
My question is now, what sort of cache (OOBTree, Session?) I should use to provide a temporary local storage for repeating enumerations and avoid subsequent SQL connections?
Another idea was, to hook into the user creation process that takes place on the first login, an external user issues and completely "localize" the users.
Third idea was, to store the needed data in the specific member, if possible.
What would be best practice here?
I'd cache the query results, indeed. You need to make a decision on how long to cache the results, and if stored long term, how to invalidate that cache or check for changes.
There are no best practices for these decisions, as they depend entirely on the type of data stored and the APIs of the backends. If they support some kind of freshness query, for example, then you store everything forever and poll the backend to see if the cache needs updating.
You can start with a simple request cache; query once per request, store it on the request object. Your cache will automatically be invalidated at the end of the request as the request object is cleaned up, the next request will be a clean slate.
If your backend users rarely change, you can cache information for longer, in a local cache. I'd use a volatile attribute on the plugin. Any attribute starting with _v_ is ignored by the persistence machinery. Thus, anything stored in a _v_ volatile attribute is both thread-local and only exists for the lifetime of the process, a restart of the server clears these automatically.
At the very least you should use an _v_ volatile attribute to store your backend SQL connections. That way they can stay open between requests, and can be re-used. Something like the following method would do nicely:
def _connection(self):
# Return a backend connection
if getattr(self, '_v_connection', None) is None:
# Create connection here
self._v_connection = yourdatabaseconnection
return self._v_connection
You could also use a persistent attribute on your plugin to store your cache. This cache would be committed to the ZODB and persist across restarts. You then really need to work out how to invalidate the contents; store timestamps and evict data when to old, etc.
Your cache datastructure depends entirely on your application needs. If you don't persist information, a dictionary (username -> information) could be more than enough. Persisted caches could benefit from using a OOBTree instead of a dictionary as they reduce chances of conflicts between different threads and are more efficient when it comes to large sets of data.
Whatever you do, you do not need to use a Session. Sessions are prone to conflicts, do not scale well, and are in any case not the place to store a cache of this kind.
Consider the following two methods, written in pseudo code, that fetches a complex data structure, and updates it, respectively:
getData(id) {
if(isInCache(id)) return getFromCache(id) // already in cache?
data = fetchComplexDataStructureFromDatabase(id) // time consuming!
setCache(id, data) // update cache
return data
}
updateData(id, data) {
storeDataStructureInDatabase(id, data)
clearCache(id)
}
In the above implementation, there is a problem with concurrency, and we might end up with outdated data in the cache: consider two parallel executions running getData() and updateData(), respectively. If the first execution fetches data from the cache exactly in between the other execution's call to storeDataStructureInDatabase() and clearCache(), then we will get an outdated version of the data. How would you get around this concurrency problem?
I considered the following solution, where the cache is invalidated just before data is committed:
storeDataStructureInDatabase(id, data) {
executeSql("UPDATE table1 SET...")
executeSql("UPDATE table2 SET...")
executeSql("UPDATE table3 SET...")
clearCache(id)
executeSql("COMMIT")
}
But then again: If one execution reads the cache in between the other execution's call to clearCache() and COMMIT, then an outdated data will be fetched to the cache. Problem not solved.
In the cache way of thinking you cannot prevent retrieving outdated data.
For example, when someone start sending an HTTP request (if your application is a web application) that will later render the cache invalid, should we consider the cache invalid when the POST request start? when the request is handled by your server? when you start the controller code?. Well no. In fact the cache is invalid only when the database transaction ends. Not even when the transaction start, only at the end, on the COMMIT phase of the transaction. And any working process working with previous data has very few chances of being aware that the data as changed, in a web application what about html pages showing outdated data in a browser, do you want to flush theses pages?
But let's just think your parallel process are not just there for the web, but for real concurrency critical parallel jobs.
One problem is that your cache is not handled by the database server, so it's not in the transaction COMMIT/ROLLBACK. You cannot decide to clear the cache first but rebuild it if you rollback. So you can only clear and rebuild the cache after the transaction is commited.
And that lead the possibility to get an outdated version of the cache if your get comes between the database commit and the cache clear instruction. So :
is it really important that you have an outdated version of the cache? Let's say your parallel process made something just a few milliseconds before you would have retrieve this new version (so it's the old one) and work with it for maybe 40ms, and then build final report on that without noticing the cache have been flush 15ms before the end of the work. If your process response cannot contain any outdated data, then you'll have to check data validity before outputing it (so you should recheck that all data used in the work process are still valid at teh end).
So if you don't want to recheck data validity that mean your process should have put some lock (semaphore?) when starting and should release the lock only at the end of the work, your are serializing your work. Databases can speed up serialization by working on pseudo-serialization levels for transactions and breaking your transaction if any changes make this pseudo-serialization hasardous. But here you're not only working with a database so you should do the serialization on your own side.
Process serialization is slow, but you may try to do the same as the database, that is runing jobs in parallel and invalidating any job running when data is altered (so having something that detect your cache clear and kill and rerun all existing parallel jobs, implying you have something mastering all the parallel jobs)
or simply accept you can have small past-invalid-outdated data. If we talk of web application the time your response walks on TCP/IP to the client browser it may be already invalid.
Chances are that you will accept to work with outdated cache data. The only really important point is that if you cannot trust your cache data for a really critical thing then you should'nt use a cache for that. If your are manipulating Accounting data for example. The only way to get a serialization of parallel tasks is to do:
in the Writing process: all the important reads (the one that will get some writes) and all the write things in a transaction with a high isolation level (level 4) and with all necessary row locks. That's something hard to do working only with a database, it's quite impossible if you add an external cache for read operations.
in parallel read process: do what you want (read from external cache), if the read data won't be used for write operations. If one of the read data will later be use for a write operation this data validity will have to be checked in the write transaction (so in the Writing process). Why not adding a timestamp watermark on the data, so that when it will come back for a write operation you'll be able to know if it is still valid.