Redis CRDB Eviction Policy - caching

I have read in the redis documentation that caching eviction policy for CRDB should be set to No Eviction .
"Note: Geo-Distributed CRDBs always operate in noeviction mode."
https://docs.redislabs.com/latest/rs/administering/database-operations/eviction-policy/
Reasoning for that is the garbage collection might cause inconsistencies as both the data center will have bidirectional synch.
I am not getting this point, can someone explain by giving a real world problem that might occur if suppose we have cache eviction policy LRU .

I got to know after doing some research that it is often a trouble to handle eviction when we have active replication. For example if one of the master runs out of memory and cache is trying to evict the keys to make some room for latest data, what might happen is - it will delete those keys from the other master even if there are no memory issues there. So until and unless there is really a good way to handle this ,eviction is not supported.

Related

How does an LRU cache fit into the CAP theorem?

I was pondering this question today. An LRU cache in the context of a database in a web app helps ensure Availability with fast data lookups that do not rely on continually accessing the database.
However, how does an LRU cache in practice stay fresh? As I understand it, one cannot garuntee Consistency along with Availibility. How is a frequently used item, which therefore does not expire from the LRU cache, handle modification? Is this an example where in a system that needs C over A, an LRU cache is not a good choice?
First of all, a cache too small to hold all the data (where an eviction might happen and the LRU part is relevant) is not a good example for the CAP theorem, because even without looking at consistency, it can't even deliver partition tolerance and availability at the same time. If the data the client asks for is not in the cache, and a network partition prevents the cache from getting the data from the primary database in time, then it simply can't give the client any answer on time.
If we only talk about data actually in the cache, we might somewhat awkwardly apply the CAP-theorem only to that data. Then it depends on how exactly that cache is used.
A lot of caching happens on the same machine that also has the authoritative data. For example, your database management system (say PostgreSql or whatever) probably caches lots of data in RAM and answers queries from there rather than from the persistent data on disk. Even then cache invalidation is a hairy problem. Basically even without a network you either are OK with sometimes using outdated information (basically sacrificing consistency) or the caching system needs to know about data changes and act on that and that can get very complicated. Still, the CAP theorem simply doesn't apply, because there is no distribution. Or if you want to look at it very pedantically (not the usual way of putting it) the bus the various parts of one computer use to communicate is not partition tolerant (the third leg of the CAP theorem). Put more simply: If the parts of your computer can't talk to one another the computer will crash.
So CAP-wise the interesting case is having the primary database and the cache on separate machines connected by an unreliable network. In that case there are two basic possibilities: (1) The caching server might answer requests without asking the primary database if its data is still valid, or (2) it might check with the primary database on every request. (1) means consistency is sacrificed. If its (2), there is a problem the cache's design must deal with: What should the cache tell the client if it doesn't get the primary database's answer on time (because of a partition, that is some networking problem)? In that case there are basically only two possibilities: It might still respond with the cached data, taking the risk that it might have become invalid. This is sacrificing consistency. Or it may tell the client it can't answer right now. That is sacrificing availability.
So in summary
If everything happens on one machine the CAP theorem doesn't apply
If the data and the cache are connected by an unreliable network, that is not a good example of the CAP theorem, because you don't even get A&P even without C.
Still, the CAP theorem means you'll have to sacrifice C or even more of A&P than the part a cache won't deliver in the first place.
What exactly you end up sacrificing depends on how exactly the cache is used.

Can cache admission strategy be useful to prune distributed cache writes

Assume some distributed CRUD Service that uses a distributed cache that is not read-through (just some Key-Value store agnostic of DB). So there are n server nodes connected to m cache nodes (round-robin as routing). The cache is supposed to cache data stored in a DB layer.
So the default retrieval sequence seems to be:
check if data is in cache, if so return data
else fetch from DB
send data to cache (cache does eviction)
return data
The question is whether the individual service nodes can be smarter about what data to send to the cache, to reduce cache capacity costs (achieve similar hit ratio with less required cache storage space).
Given recent benchmarks on optimal eviction/admission strategies (in particular LFU), some new caches might not even store data if it is deemed too infrequently used, maybe application nodes can do some best-effort guess.
So my idea is that the individual service nodes could evaluate whether data that was fetched from a DB should be send to the distributed cache or not based on an algorithm like LFU, thus reducing the network traffic between service and cache. I am thinking about local checks (suffering a lack of effectivity on cold startups), but checks against a shared list of cached keys may also be considered.
So the sequence would be
check if data is in cache, if so return data
else fetch from DB
check if data key is frequently used
if yes, send data to cache (cache does eviction). Else not.
return data
Is this possible, reasonable, has it already been done?
It is common in databases, search, and analytical products to guard their LRU caches with filters to avoid pollution caused by scans. For example see Postgres' Buffer Ring Replacement Strategy and ElasticSearch's filter cache. These are admission policies detached from the cache itself, which could be replaced if their caching algorithm was more intelligent. It sounds like your idea is similar, except a distributed version.
Most remote / distributed caches use classic eviction policies (LRU, LFU). That is okay because they are often excessively large, e.g. Twitter requires a 99.9% hit rate for their SLA targets. This means they likely won't drop recent items because the penalty is too high and oversize so that the victim is ancient.
However, that breaks down when batch jobs run and pollute the remote caching tier. In those cases, its not uncommon to see the cache population disabled to avoid impacting user requests. This is then a distributed variant of Postgres' problem described above.
The largest drawback with your idea is checking the item's popularity. This might be local only, which has a frequent cold start problem, or remote call which adds a network hop. That remote call would be cheaper than the traffic of shipping the item, but you are unlikely to be bandwidth limited. Likely you're goal would be to reduce capacity costs by a higher hit rate, but if your SLA requires a nearly perfect hit rate then you'll over provision anyway. It all depends on whether the gains by reducing cache-aside population operations are worth the implementation effort. I suspect that for most it hasn't been.

System Design: Global Caching and consistency

Lets take an example of Twitter. There is a huge cache which gets updated frequently. For example: if person Foo tweets and it has followers all across the globe. Ideally all the caches across all PoP needs to get updated. i.e. they should remain in sync
How does replication across datacenter (PoP) work for realtime caches ?
What tools/technologies are preferred ?
What are potential issues here in this system design ?
I am not sure there is a right/wrong answer to this, but here's my two pennies' worth of it.
I would tackle the problem from a slightly different angle: when a user posts something, that something goes in a distributed storage (not necessarily a cache) that is already redundant across multiple geographies. I would also presume that, in the interest of performance, these nodes are eventually consistent.
Now the caching. I would not design a system that takes care of synchronising all the caches each time someone does something. I would rather implement caching at the service level. Imagine a small service residing in a geographically distributed cluster. Each time a user tries to fetch data, the service checks its local cache - if it is a miss, it reads the tweets from the storage and puts a portion of them in a cache (subject to eviction policies). All subsequent accesses, if any, would be cached at a local level.
In terms of design precautions:
Carefully consider the DC / AZ topology in order to ensure sufficient bandwidth and low latency
Cache at the local level in order to avoid useless network trips
Cache updates don't happen from the centre to the periphery; cache is created when a cache miss happens
I am stating the obvious here, implement the right eviction policies in order to keep only the right objects in cache
The only message that should go from the centre to the periphery is a cache flush broadcast (tell all the nodes to get rid of their cache)
I am certainly missing many other things here, but hopefully this is good food for thought.

Exploit caching to the fullest

I am caching result of a method(obviously with its signature) so that it don't make complex query on my data-store every time. My caching is working perfectly.
My question is:
How should I find the optimal value of timeout for an entry in cache?
What should be the optimal number of entry in the cache?
Are their any other variables that I can change to improve performance of my application?
Assume the various factors effecting the performance of caching as variables and get me a formula to help understand how can I optimize my cache?
There are two hard problems in computer science: cache invalidation and naming things. First off, I'd be sure that you need a cache. It depends on what sort of datastore you're using (redis, apparently). If it were a traditional RDBMS then you'd be better off making sure that your indexing strategy was tight first. The trouble with introducing caching is that at some point, sooner rather than later, and many times thereafter, you're going to get an inconsistent cache. The cache invalidation isn't atomic with updates to your datastore, so something's going to fire an invalidate message but fail to reach it's destination, and your cache will be out of date. So be dead sure you need that caching before you introduce it. In terms of cache timeout - the sooner the better. An hour is great, a day less so. If something gets out of sync then it'll fix itself rather than causing ongoing issues. Also, if you're setting cache timeouts of a week or more then your cache is going to start operating like a datastore all of it's own; if it goes down and you have to rebuild it then you're going to take a large performance hit. So in this case less is more. Finally, make sure that you make sure and do actually set a cache timeout for everything that goes into your cache. It's all too easy with memcache to just have no expiry date by default, and in that case your cache really is going to start acting like a datastore. Don't let that happen; I've been there and waiting a week for your site to recover is not fun.

How slow is Redis when full and evicting keys ? (LRU algorithm)

I am using Redis in a Java application, where I am reading log files, storing/retrieving some info in Redis for each log. Keys are IP addresses in my log file, which mean that they are always news keys coming, even if the same appears regularly.
At some point, Redis reaches its maxmemory size (3gb in my case), and starts evicting some keys. I use the "allkeys-lru" settings as I want to keep the youngest keys.
The whole application then slows a lot, taking 5 times longer than at the beginning.
So I have three questions:
is it normal to have such a dramatic slowdown (5 times longer)? Did anybody experience such slowdown? If not, I may have another issue in my code (improbable as the slowdown appears exactly when Redis reaches its limit)
can I improve my config ? I tried to change the maxmemory-samples setting without much success
should I consider an alternative for my particular problem? Is there a in-memory DB that could handle evicting keys with better performances ? I may consider a pure Java object (HashMap...), even if it doesn't look like a good design.
edit 1:
we use 2 DBs in Redis
edit 2:
We use redis 2.2.12 (ubuntu 12.04 LTS). Further investigations explained the issue: we are using db0 and db1 in redis. db1 is used much less than db0, and keys are totally different. When Redis reaches max-memory (and LRU algo starts evicting keys), redis does remove almost all db1 keys, which slows drastically all calls. This is a strange behavior, probably unusual and maybe linked to our application. We fixed the issue by moving to another (better) memory mechanism for keys that were loaded in db1.
thanks !
I'm not convinced Redis is the best option for your use case.
Redis "LRU" is only a best effort algorithm (i.e. quite far from an exact LRU). Redis tracks memory allocations and knows when it has to free some memory. This is checked before the execution of each command. The mechanism to evict a key in "allkeys-lru" mode consists in choosing maxmemory-samples random keys, comparing their idle time, and select the most idle key. Redis repeats these operations until the used memory is below maxmemory.
The higher maxmemory-samples, the more CPU consumption, but the more accurate result.
Provided you do not explicitly use the EXPIRE command, there is no other overhead to be associated with key eviction.
Running a quick test with Redis benchmark on my machine results in a throughput of:
145 Kops/s when no eviction occurs
125 Kops/s when 50% eviction occurs (i.e. 1 key over 2 is evicted).
I cannot reproduce the 5 times factor you experienced.
The obvious recommendation to reduce the overhead of eviction is to decrease maxmemory-samples, but it also means a dramatic decrease of the accuracy.
My suggestion would be to give memcached a try. The LRU mechanism is different. It is still not exact (it applies only on a per slab basis), but it will likely give better results that Redis on this use case.
Which version of Redis are you using? The 2.8 version (quite recent) improved the expiration algorithm and if you are using 2.6 you might give it a try.
http://download.redis.io/redis-stable/00-RELEASENOTES

Resources