what is the best strategy to sync data between DB and redis cache - caching

We are using Oracle db, we would like to use Redis Cache mechanism, We add some subset of DB data to cache, does it sync with DB automatically when there is a change in the data in DB or we will have to implement the sync strategy, if yes, what is the best way to do it.

does it sync with DB automatically when there is a change in the data in DB
No, it doesn't.
we will have to implement the sync strategy, if yes, what is the best way to do it.
This will depend on your particular case. Usually caches are sync'd in two common ways:
Data cached with expiration. Once cached data has expired, a background process adds fresh data to cache, and so on. Usually there's data that will be refreshed in different intervals: 10 minutes, 1 hour, every day...
Data cached on demand. When an user requests some data, that request goes through the non-cached road, and that request stores the result in cache, and a limited number of subsequent requests will read cached data directly if cache is available. This approach can fall into #1 one too in terms of cache invalidation interval.
Now I believe that you've enough details to think about what could be your best strategy in your particular case!

Additionally to what mathias wrote, you can look ath the problem from dynamic/static perspective:
Real/Time approach: each time a process changes the DB data, you dispatch an event or a message to a queue where a worker handles corresponding indexing of the cache. Some might event implement it as a DB Trigger (I don't like)
Static/delayed approach: Once a day/hour/minute.. depending on your needs there is a process that does a batch/whole indexing of the DB data to the cache.

Related

Springboot Ehcache load data

Service using SpringBoot, Maven, MongoDB, Ehcache.
Service requires a fast and frequently cache server, so eventually, I chose Ehcache.
All the cache will be called almost at the same frequency so there are no hot cold data in this case.
The original data in MongoDB will be updated every day by a timer service, so what I do is to load all the updated data to Ehcache every day.
Each item in this data has a connection with each other, like you use one to find the relevant Ids of the other. So if one cache is updated, but the other one hasn't, then you can't find these relevant Ids. I want to avoid this situation.
So my question is, is there any way to achieve a function like this, like using two Ehcache servers or something? i.e. When one is in use, the other one can load the data from MongoDB. When the update is done, switch it to the updated one. So every day when the MongoDB data updated, and I have to update the Ehcache data, it won't influence my current cache. That's just a thought I have. Another thought is something like a SQL transaction. Is there any other way to achieve this.
Please advise.
Good question. I see two ways.
One is to use an application lock. When you are ready to reload the cache, you block access to it and do it. There is no way to clear all caches are the same time. The problem is that everything will be blocked during the update.
The other way is to use an other cache. So you load the new cache with the new data and then swap the new cache and the expired one. The problem with this solution is that at a given moment you will take twice the memory since both caches are in memory.

How would Redis get to know if it has to return cached data or fresh data from DB

Say, I'm Fechting thousands or record using some long runing task from DB and caching it using Redis. Next day somebody have changed few records in DB.
Next time how redis would know that it has to return cached data or again have to revisit that all thousands of records in DB?
How this synchronisation achived?
Redis has no idea whether the data in DB has been updated.
Normally, we use Redis to cache data as follows:
Client checks if the data, e.g. key-value pair, exists in Redis.
If the key exists, client gets the corresponding value from Redis.
Otherwise, it gets data from DB, and sets it to Redis. Also client sets an expiration, say 5 minutes, for the key-value pair in Redis.
Then any subsequent requests for the same key will be served by Redis. Although the data in Redis might be out-of-date.
However, after 5 minutes, this key will be removed from Redis automatically.
Go to step 1.
So in order to keep your data in Redis update-to-date, you can set a short expiration time. However, your DB has to serve lots of requests.
If you want to largely decrease requests to DB, you can set a large expiration time. So that, most of time, Redis can serve the requests with possible staled data.
You should consider carefully about the trade-off between performance and staled data.
Since the source of truth resides on your Database and you push data from this DB to Redis, you always have to update from DB to Redis, at least you create another process to sync data.
My suggestion is just to run a first full update from DB to Redis and then use a synch process which every time you notice update/creation/deletion operation in your database you pull it to Redis.
I don't know which Redis structure are you using to store database records in Redis but I guess it could be a Hash, probably indexed by your table index so the sync operation will be immediate: if a record is created in your database you set a HSET, if deletion HDEL and so on.
You even could omit the first full sync from DB to Redis, and just clean Redis and start the sync process.
If you cannot do the above for some reason you can create a syncher daemon which constantly read data from the database and compare them with the data store in Redis if they are different in somehow you update or if they don't exist in some of both sides you can delete or create the entry in Redis.
My solution is:
When you are updating, deleting or adding new data in database, you should delete all data in redis. In your get route, you should check if data exists. If not, you should store all data to redis from db.
you may use #CacheEvict on any update/delete applied on DB. that would clear up responding values from cache, so next query would get from DB

Will Caching be useful when we need multiple items in one go

We are working on a ecom site, where admin can store some configuration on the combination of Product-Category-manufacturer or on Product-Category.
We have some reports, which can return 10000 Product's transactions (with 100-1000 unique combination of product-category-manufacturer ).
In this report, we also need to use configuration as well.
One option could be to fetch configurations from the same stored procedure for all unique Product-Category-manufacturer.
Another option could be to cache all these combination in some outproc cache (like redis). And once transaction data is fetched from stored procedure, system will pull the data from cache for all 1000 Product-Category-Feature combinations. But in this case, we will have to request cache 1000 times and if some of keys are not found in cache, we will have to hit database.
In fact there can be some combination where data does not exist in database. If we request for these combination, system will not find it in cache, and it will have to hit database every-time. To resolve this, we will have to form a set of all the Product-Category-Feature combination where there is data available in cache.
Could anybody suggest that if cache will be useful in this case?
We use caching mainly in 2 occasions,
To Reduce latency: Cache is closer to the client it takes less time for the resource to reach the client.
To Reduce network traffic: Most of the time we see that some resources are reusable but always fetch from original source which
is costly and make more unnecessary traffic. Adding a cache layer
solves this.
So to answer your question, "Will Caching be useful when we need multiple items in one go?" You have to think on the above 2 points. How much you are reusing (cache hit percentage). And cost difference between cache call and call to original source.
If your issue is getting 1000 items at once, Redis don't have issue providing that. It will be so much faster than the transnational DB. And you can have set of all the Product-Category-Feature combinations, its better as we will no have cache misses. However think about the size of the Redis DB, before you proceed.

How is memcached updated?

I have never used memcached before and I am confused on the following basic question.
Memcached is a cache right? And I assume we cache data from a DB for faster access. So when the DB is updated who is responsible to update the cache? Our code is does memcached "understand" when the DB has been updated?
Memcached is a cache right? And I assume we cache data from a DB for
faster access
Yes it is a cache, but you have to understand that a cache speed up the access when you are often accessing same data. If you access thousand times data/objects which are always different each other a cache doesn't help.
To answer your question:
So when the DB is updated who is responsible to update the cache?
Always you but you don't have to worry about if you are doing the right thing.
Our code is does memcached "understand" when the DB has been updated?
memcached doesn't know about your database. (actually the client doesn't know even about servers..) So when you use an object of your database you should check if is present in cache, if not you put in cache otherwise you are fine.. that is all. When the moment comes memcache will free the memory used by old data, or you can tell memcached to free data after a time you choose(read the API for details).
You are responsible to update the cache (or some plugin).
What happens is that the query is compressed to some key features and these are hashed. This is tested against the cache. If the value is in the cache, the data is returned directly from cache. Otherwise the query is performed, stored in cache and returned to the user.
In pseudo code:
key = query_key(your_sql_query)
if key in cache:
return cache.get(key)
else:
results = execute(your_sql_query)
cache.set(key, results, time_to_live)
return results.
The cache is cleared once in a while, you can give a time to live to a key, then your cached results are refreshed.
This is the most simple model, but can cause some inconsistencies.
One strategy is that if your code is also the only app that updates data, then your code can also refresh memcached as a second step after it has updated the database. Or at least evict the stale data from memcached, so the next time an app wants to read it, it will be forced to re-query the current data from the database and restore that latest data to memcached.
Another strategy is to store data in memcached with an expiration time, so memcached automatically purges that data element after a certain time. You pick the expiration time, based on your knowledge of how frequently the data might be updated, and how tolerant your app is of reading stale data.
So ultimately, you are the one responsible for putting data into memcached. Only you know what data is worth storing in the cache, what format you want to store it in, how frequently you expect to query it, and when to refresh it. You make this judgment on a case-by-case basis, because you know better than any automatic system the likely behavior of your data and your app.

How to avoid database query storms using cache-aside pattern

We are using a PostgreSQL database and AppFabric Server, running a moderately busy ASP.NET MVC e-commerce site.
Following the cache-aside pattern we request data from our cache, and if it is not available, we query the database.
This approach results in 'query storms' where the database recieves multiple queries for the same data in a short space of time, while a given object in the cache is being refreshed. This issue is exacerbated by longer running queries, and obviously multiple requests for the same data can cause the query to run longer, forming an unpleasant feedback loop.
One solution to this problem is to use read-locking on the cache. However this can itself cause performance issues in a web farm situation (or even on a single busy web server) as web servers are blocked on reads for no reason, in case there is a database query taking place.
Another solution is to drop the cache-aside pattern and seed the cache independently. This is the approach we have taken to mitigate the immediate issues we are seeing with this problem, however it is not possible with all data.
Am I missing something here? And what other approaches have people taken to avoid this behaviour?
Depending on the number of servers you have and your current cache architecture it may be worthwhile to evaluate adding a server-level (or in-process) cache as well. In effect you use this as a fallback cache, and it's especially helpful where hitting the primary storage (database) is either very resource intensive or slow.
When I've used this I've used the cache-aside pattern for the primary cache and a read-through design for the secondary--in which the secondary is locking and ensures the database isn't over-saturated by the same request. With this architecture a primary cache-miss results in at most one query per entity per server (or process) to the database.
So the basic workflow is:
1) Try to retrieve from primary / shared cache pool
* If successful, return
* If unsuccessul, continue
2) Check in-process cache for value
* If successful, return (optionally seeding primary cache)
* If unsuccessul, continue
3) Get lock by cache key (and double-check in-process cache, in case it's been added by another thread)
4) Retrieve object from primary persistence (db)
5) Seed in-process cache and return
I've done this using injectable wrappers, my cache layers all implement the relevant IRepository interface, and StructureMap injects the correct stack of caches. This keeps the actual cache behaviors flexible, focused, and easy to maintain despite being fairly complex.
We've used AppFabric successfully with the seeding strategy you mention above. We actually do use both solutions:
Seed known data where possible (we have a limited set, so this is actually easy for us to figure out)
Within each cache access method, make sure to do look-aside as necessary, and populate cache on retrieval from data store.
The look-aside is necessary, as items may be evicted due to memory pressure, or simply because they were missed in the seeding operation. We have a "warming" service that pulses on an interval (an hour) and keeps the cache populated with the necessary data. We keep analysis on cache misses, and use that to tweak our warming strategy if we see frequent misses during the warming interval.

Resources