How does an LRU cache fit into the CAP theorem? - caching

I was pondering this question today. An LRU cache in the context of a database in a web app helps ensure Availability with fast data lookups that do not rely on continually accessing the database.
However, how does an LRU cache in practice stay fresh? As I understand it, one cannot garuntee Consistency along with Availibility. How is a frequently used item, which therefore does not expire from the LRU cache, handle modification? Is this an example where in a system that needs C over A, an LRU cache is not a good choice?

First of all, a cache too small to hold all the data (where an eviction might happen and the LRU part is relevant) is not a good example for the CAP theorem, because even without looking at consistency, it can't even deliver partition tolerance and availability at the same time. If the data the client asks for is not in the cache, and a network partition prevents the cache from getting the data from the primary database in time, then it simply can't give the client any answer on time.
If we only talk about data actually in the cache, we might somewhat awkwardly apply the CAP-theorem only to that data. Then it depends on how exactly that cache is used.
A lot of caching happens on the same machine that also has the authoritative data. For example, your database management system (say PostgreSql or whatever) probably caches lots of data in RAM and answers queries from there rather than from the persistent data on disk. Even then cache invalidation is a hairy problem. Basically even without a network you either are OK with sometimes using outdated information (basically sacrificing consistency) or the caching system needs to know about data changes and act on that and that can get very complicated. Still, the CAP theorem simply doesn't apply, because there is no distribution. Or if you want to look at it very pedantically (not the usual way of putting it) the bus the various parts of one computer use to communicate is not partition tolerant (the third leg of the CAP theorem). Put more simply: If the parts of your computer can't talk to one another the computer will crash.
So CAP-wise the interesting case is having the primary database and the cache on separate machines connected by an unreliable network. In that case there are two basic possibilities: (1) The caching server might answer requests without asking the primary database if its data is still valid, or (2) it might check with the primary database on every request. (1) means consistency is sacrificed. If its (2), there is a problem the cache's design must deal with: What should the cache tell the client if it doesn't get the primary database's answer on time (because of a partition, that is some networking problem)? In that case there are basically only two possibilities: It might still respond with the cached data, taking the risk that it might have become invalid. This is sacrificing consistency. Or it may tell the client it can't answer right now. That is sacrificing availability.
So in summary
If everything happens on one machine the CAP theorem doesn't apply
If the data and the cache are connected by an unreliable network, that is not a good example of the CAP theorem, because you don't even get A&P even without C.
Still, the CAP theorem means you'll have to sacrifice C or even more of A&P than the part a cache won't deliver in the first place.
What exactly you end up sacrificing depends on how exactly the cache is used.

Related

Can cache admission strategy be useful to prune distributed cache writes

Assume some distributed CRUD Service that uses a distributed cache that is not read-through (just some Key-Value store agnostic of DB). So there are n server nodes connected to m cache nodes (round-robin as routing). The cache is supposed to cache data stored in a DB layer.
So the default retrieval sequence seems to be:
check if data is in cache, if so return data
else fetch from DB
send data to cache (cache does eviction)
return data
The question is whether the individual service nodes can be smarter about what data to send to the cache, to reduce cache capacity costs (achieve similar hit ratio with less required cache storage space).
Given recent benchmarks on optimal eviction/admission strategies (in particular LFU), some new caches might not even store data if it is deemed too infrequently used, maybe application nodes can do some best-effort guess.
So my idea is that the individual service nodes could evaluate whether data that was fetched from a DB should be send to the distributed cache or not based on an algorithm like LFU, thus reducing the network traffic between service and cache. I am thinking about local checks (suffering a lack of effectivity on cold startups), but checks against a shared list of cached keys may also be considered.
So the sequence would be
check if data is in cache, if so return data
else fetch from DB
check if data key is frequently used
if yes, send data to cache (cache does eviction). Else not.
return data
Is this possible, reasonable, has it already been done?
It is common in databases, search, and analytical products to guard their LRU caches with filters to avoid pollution caused by scans. For example see Postgres' Buffer Ring Replacement Strategy and ElasticSearch's filter cache. These are admission policies detached from the cache itself, which could be replaced if their caching algorithm was more intelligent. It sounds like your idea is similar, except a distributed version.
Most remote / distributed caches use classic eviction policies (LRU, LFU). That is okay because they are often excessively large, e.g. Twitter requires a 99.9% hit rate for their SLA targets. This means they likely won't drop recent items because the penalty is too high and oversize so that the victim is ancient.
However, that breaks down when batch jobs run and pollute the remote caching tier. In those cases, its not uncommon to see the cache population disabled to avoid impacting user requests. This is then a distributed variant of Postgres' problem described above.
The largest drawback with your idea is checking the item's popularity. This might be local only, which has a frequent cold start problem, or remote call which adds a network hop. That remote call would be cheaper than the traffic of shipping the item, but you are unlikely to be bandwidth limited. Likely you're goal would be to reduce capacity costs by a higher hit rate, but if your SLA requires a nearly perfect hit rate then you'll over provision anyway. It all depends on whether the gains by reducing cache-aside population operations are worth the implementation effort. I suspect that for most it hasn't been.

System Design: Global Caching and consistency

Lets take an example of Twitter. There is a huge cache which gets updated frequently. For example: if person Foo tweets and it has followers all across the globe. Ideally all the caches across all PoP needs to get updated. i.e. they should remain in sync
How does replication across datacenter (PoP) work for realtime caches ?
What tools/technologies are preferred ?
What are potential issues here in this system design ?
I am not sure there is a right/wrong answer to this, but here's my two pennies' worth of it.
I would tackle the problem from a slightly different angle: when a user posts something, that something goes in a distributed storage (not necessarily a cache) that is already redundant across multiple geographies. I would also presume that, in the interest of performance, these nodes are eventually consistent.
Now the caching. I would not design a system that takes care of synchronising all the caches each time someone does something. I would rather implement caching at the service level. Imagine a small service residing in a geographically distributed cluster. Each time a user tries to fetch data, the service checks its local cache - if it is a miss, it reads the tweets from the storage and puts a portion of them in a cache (subject to eviction policies). All subsequent accesses, if any, would be cached at a local level.
In terms of design precautions:
Carefully consider the DC / AZ topology in order to ensure sufficient bandwidth and low latency
Cache at the local level in order to avoid useless network trips
Cache updates don't happen from the centre to the periphery; cache is created when a cache miss happens
I am stating the obvious here, implement the right eviction policies in order to keep only the right objects in cache
The only message that should go from the centre to the periphery is a cache flush broadcast (tell all the nodes to get rid of their cache)
I am certainly missing many other things here, but hopefully this is good food for thought.

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

How is Hazelcast distributed cache faster than a call to the DB?

Lets say there I have 2 servers that are using Hazelcasts distributed cache. If on server #1, I store 2 items in a map in that distributed cache. One of those items will be saved in the local back up, and the other will be stored in the backup of the other servers Hazelcast instance(Please correct me if that is incorrect).
My question is, if I try to retrieve the second item from the cache(stored in the backup on server #2), a TCP call will be made to retrieve that data. How is this faster than just calling the DB?
First of all let me correct how data is stored on Hazelcast.
Hazelcast uses a distribution algorithm based on consistent hashing, meaning the hashing algorithm returns the same output for the same input all the time. This distribution is not 100% equal distribution but for high number of elements pretty good and cost effective. That said it doesn't mean you'll have one element on each node in the worst case.
By default Hazelcast also keeps on backup, that means each node will have both elements (in a 2 node setup), either owned data or as a backup for failure case. You can make backups readable (read-from-backup=true), however that introduces a slight chance to read staled data (time between owner is updated but backup is not yet).
In addition data in Hazelcast, again by default, is stored in serialized form, means binary streamable representation.
Ok so how can all this be faster than a TCP connection to your database?
The answer is twofold:
Hazelcast is a key-value store. Therefore it is optimized for requesting data by key and answering with the value as quickly as possible.
Data is already serialized, therefore the byte stream is just "smashed" into the socket without any real further work to be done.
Your database on the other hand has to really query data from a table. The internal data structures to hold the information is optimized for complex queries but not to access on a key base. But, and this is important, current database implementation optimize internally (in RAM) for fast access too. So the effect will only happen for databases that serve under high load. Caches (local or distributed) are designed to speed up slow operations, resulting in: if your database is blazingly fast you won't see a benefit.
Anyways designing a system you expect to grow exponentially you should consider caching right from the start. A comprehensive introduction into caching and the behind ideas is available in a caching whitepaper and article I wrote some time ago: https://dzone.com/articles/caching-why-you-should-care
I hope this answers your question :-)

Most efficient way to cache in a fastcgi app

For fun i am writing a fastcgi app. Right now all i do is generate a GUID and display it at the top of the page then make a db query based on the url which pulls data from one of my existing sites.
I would like to attempt to cache everything on the page except for the GUID. What is a good way of doing that? I heard of but never used redis. But it appears its a server which means its in a seperate process. Perhaps an in process solution would be faster? (unless its not?)
What is a good solution for page caching? (i'm using C++)
Your implementation sounds like you need a simple key-value caching mechanism, and you could possibly use a container like std::unordered_map from C++11, or its boost cousin, boost::unordered_map. unordered_map provides a hash table implementation. If you needed even higher performance at some point, you could also look at Boost.Intrusive which provides high performance, standard library-compatible containers.
If you roll your cache with the suggestions mentioned, a second concern will be expiring cache entries, because of the possibility your cached data will grow stale. I don't know what your data is like, but you can choose to implement a caching strategy like any of these:
after a certain time/number of uses, expire a cached entry
after a certain time/number of uses, expire the entire cache (extreme)
least-recently used - there's a stack overflow question concerning this: LRU cache design
Multithreaded/concurrent access may also be a concern, though as suggested in the link above, a possibility would be to lock the cache on access rather than worry about granular locking.
Now if you're talking about scaling, and moving up to multiple processes, and distributing server processes across multiple physical machines, the simple in-process caching might not be the way to go anymore (everyone could have different copies of data at any given time, inconsistency of performance if some server has cached data but others don't).
That's where Redis/Memcached/Membase/etc. shine - they are built for scaling and for offloading work from a database. They could be beaten out by a database and in-memory cache in performance (there is latency, after all, and a host of other factors), but when it comes to scaling, they are very useful and save load from a database, and can quickly serve requests. They also come with features cache expiration (implementations differ between them).
Best of all? They're easy to use and drop in. You don't have to choose redis/memcache from the outset, as caching itself is just an optimization and you can quickly replace the caching code with using, say, an in-memory cache of your own to using redis or something else.
There are still some differences between the caching servers though - membase and memcache distribute their data, while redis has master-slave replication.
For the record: I work in a company where we use memcached servers - we have several of them in the data center with the rest of our servers each having something like 16 GB of RAM allocated completely to cache.
edit:
And for speed comparisons, I'll adapt something from a Herb Sutter presentation I watched long ago:
process in-memory -> really fast
getting data from a local process in-memory data -> still really fast
data from local disk -> depends on your I/O device, SSD can be fast, but mechanical drives are glacial
getting data from remote process (in-memory data) -> fast-ish, and your cache servers better be close
getting data from remote process (disk) -> iceberg

Resources