I have enabled hazelcast near cache for one of my application. I case if cache miss from near cache Hazelcast is making a network call to look for the data.
The behaviour I am looking for is, in case of cache miss from "near cache" no network call should be made.
Any idea how to achieve that with Hazelcast?
I don't think this is possible in Hazelcast, at least I wouldn't know of any way. What is the reason you want to prevent it to call down to the cluster, I mean Near-Cache is a speed optimization (at the tradeoff of consistency) but not a full blown local cache. Maybe looking at the Continuous Query Cache would solve your issue, but I don't have enough information about your use case to understand your need :-)
Related
I was pondering this question today. An LRU cache in the context of a database in a web app helps ensure Availability with fast data lookups that do not rely on continually accessing the database.
However, how does an LRU cache in practice stay fresh? As I understand it, one cannot garuntee Consistency along with Availibility. How is a frequently used item, which therefore does not expire from the LRU cache, handle modification? Is this an example where in a system that needs C over A, an LRU cache is not a good choice?
First of all, a cache too small to hold all the data (where an eviction might happen and the LRU part is relevant) is not a good example for the CAP theorem, because even without looking at consistency, it can't even deliver partition tolerance and availability at the same time. If the data the client asks for is not in the cache, and a network partition prevents the cache from getting the data from the primary database in time, then it simply can't give the client any answer on time.
If we only talk about data actually in the cache, we might somewhat awkwardly apply the CAP-theorem only to that data. Then it depends on how exactly that cache is used.
A lot of caching happens on the same machine that also has the authoritative data. For example, your database management system (say PostgreSql or whatever) probably caches lots of data in RAM and answers queries from there rather than from the persistent data on disk. Even then cache invalidation is a hairy problem. Basically even without a network you either are OK with sometimes using outdated information (basically sacrificing consistency) or the caching system needs to know about data changes and act on that and that can get very complicated. Still, the CAP theorem simply doesn't apply, because there is no distribution. Or if you want to look at it very pedantically (not the usual way of putting it) the bus the various parts of one computer use to communicate is not partition tolerant (the third leg of the CAP theorem). Put more simply: If the parts of your computer can't talk to one another the computer will crash.
So CAP-wise the interesting case is having the primary database and the cache on separate machines connected by an unreliable network. In that case there are two basic possibilities: (1) The caching server might answer requests without asking the primary database if its data is still valid, or (2) it might check with the primary database on every request. (1) means consistency is sacrificed. If its (2), there is a problem the cache's design must deal with: What should the cache tell the client if it doesn't get the primary database's answer on time (because of a partition, that is some networking problem)? In that case there are basically only two possibilities: It might still respond with the cached data, taking the risk that it might have become invalid. This is sacrificing consistency. Or it may tell the client it can't answer right now. That is sacrificing availability.
So in summary
If everything happens on one machine the CAP theorem doesn't apply
If the data and the cache are connected by an unreliable network, that is not a good example of the CAP theorem, because you don't even get A&P even without C.
Still, the CAP theorem means you'll have to sacrifice C or even more of A&P than the part a cache won't deliver in the first place.
What exactly you end up sacrificing depends on how exactly the cache is used.
Lets take an example of Twitter. There is a huge cache which gets updated frequently. For example: if person Foo tweets and it has followers all across the globe. Ideally all the caches across all PoP needs to get updated. i.e. they should remain in sync
How does replication across datacenter (PoP) work for realtime caches ?
What tools/technologies are preferred ?
What are potential issues here in this system design ?
I am not sure there is a right/wrong answer to this, but here's my two pennies' worth of it.
I would tackle the problem from a slightly different angle: when a user posts something, that something goes in a distributed storage (not necessarily a cache) that is already redundant across multiple geographies. I would also presume that, in the interest of performance, these nodes are eventually consistent.
Now the caching. I would not design a system that takes care of synchronising all the caches each time someone does something. I would rather implement caching at the service level. Imagine a small service residing in a geographically distributed cluster. Each time a user tries to fetch data, the service checks its local cache - if it is a miss, it reads the tweets from the storage and puts a portion of them in a cache (subject to eviction policies). All subsequent accesses, if any, would be cached at a local level.
In terms of design precautions:
Carefully consider the DC / AZ topology in order to ensure sufficient bandwidth and low latency
Cache at the local level in order to avoid useless network trips
Cache updates don't happen from the centre to the periphery; cache is created when a cache miss happens
I am stating the obvious here, implement the right eviction policies in order to keep only the right objects in cache
The only message that should go from the centre to the periphery is a cache flush broadcast (tell all the nodes to get rid of their cache)
I am certainly missing many other things here, but hopefully this is good food for thought.
We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html
I am caching result of a method(obviously with its signature) so that it don't make complex query on my data-store every time. My caching is working perfectly.
My question is:
How should I find the optimal value of timeout for an entry in cache?
What should be the optimal number of entry in the cache?
Are their any other variables that I can change to improve performance of my application?
Assume the various factors effecting the performance of caching as variables and get me a formula to help understand how can I optimize my cache?
There are two hard problems in computer science: cache invalidation and naming things. First off, I'd be sure that you need a cache. It depends on what sort of datastore you're using (redis, apparently). If it were a traditional RDBMS then you'd be better off making sure that your indexing strategy was tight first. The trouble with introducing caching is that at some point, sooner rather than later, and many times thereafter, you're going to get an inconsistent cache. The cache invalidation isn't atomic with updates to your datastore, so something's going to fire an invalidate message but fail to reach it's destination, and your cache will be out of date. So be dead sure you need that caching before you introduce it. In terms of cache timeout - the sooner the better. An hour is great, a day less so. If something gets out of sync then it'll fix itself rather than causing ongoing issues. Also, if you're setting cache timeouts of a week or more then your cache is going to start operating like a datastore all of it's own; if it goes down and you have to rebuild it then you're going to take a large performance hit. So in this case less is more. Finally, make sure that you make sure and do actually set a cache timeout for everything that goes into your cache. It's all too easy with memcache to just have no expiry date by default, and in that case your cache really is going to start acting like a datastore. Don't let that happen; I've been there and waiting a week for your site to recover is not fun.
For fun i am writing a fastcgi app. Right now all i do is generate a GUID and display it at the top of the page then make a db query based on the url which pulls data from one of my existing sites.
I would like to attempt to cache everything on the page except for the GUID. What is a good way of doing that? I heard of but never used redis. But it appears its a server which means its in a seperate process. Perhaps an in process solution would be faster? (unless its not?)
What is a good solution for page caching? (i'm using C++)
Your implementation sounds like you need a simple key-value caching mechanism, and you could possibly use a container like std::unordered_map from C++11, or its boost cousin, boost::unordered_map. unordered_map provides a hash table implementation. If you needed even higher performance at some point, you could also look at Boost.Intrusive which provides high performance, standard library-compatible containers.
If you roll your cache with the suggestions mentioned, a second concern will be expiring cache entries, because of the possibility your cached data will grow stale. I don't know what your data is like, but you can choose to implement a caching strategy like any of these:
after a certain time/number of uses, expire a cached entry
after a certain time/number of uses, expire the entire cache (extreme)
least-recently used - there's a stack overflow question concerning this: LRU cache design
Multithreaded/concurrent access may also be a concern, though as suggested in the link above, a possibility would be to lock the cache on access rather than worry about granular locking.
Now if you're talking about scaling, and moving up to multiple processes, and distributing server processes across multiple physical machines, the simple in-process caching might not be the way to go anymore (everyone could have different copies of data at any given time, inconsistency of performance if some server has cached data but others don't).
That's where Redis/Memcached/Membase/etc. shine - they are built for scaling and for offloading work from a database. They could be beaten out by a database and in-memory cache in performance (there is latency, after all, and a host of other factors), but when it comes to scaling, they are very useful and save load from a database, and can quickly serve requests. They also come with features cache expiration (implementations differ between them).
Best of all? They're easy to use and drop in. You don't have to choose redis/memcache from the outset, as caching itself is just an optimization and you can quickly replace the caching code with using, say, an in-memory cache of your own to using redis or something else.
There are still some differences between the caching servers though - membase and memcache distribute their data, while redis has master-slave replication.
For the record: I work in a company where we use memcached servers - we have several of them in the data center with the rest of our servers each having something like 16 GB of RAM allocated completely to cache.
edit:
And for speed comparisons, I'll adapt something from a Herb Sutter presentation I watched long ago:
process in-memory -> really fast
getting data from a local process in-memory data -> still really fast
data from local disk -> depends on your I/O device, SSD can be fast, but mechanical drives are glacial
getting data from remote process (in-memory data) -> fast-ish, and your cache servers better be close
getting data from remote process (disk) -> iceberg