How is Hazelcast distributed cache faster than a call to the DB? - caching

Lets say there I have 2 servers that are using Hazelcasts distributed cache. If on server #1, I store 2 items in a map in that distributed cache. One of those items will be saved in the local back up, and the other will be stored in the backup of the other servers Hazelcast instance(Please correct me if that is incorrect).
My question is, if I try to retrieve the second item from the cache(stored in the backup on server #2), a TCP call will be made to retrieve that data. How is this faster than just calling the DB?

First of all let me correct how data is stored on Hazelcast.
Hazelcast uses a distribution algorithm based on consistent hashing, meaning the hashing algorithm returns the same output for the same input all the time. This distribution is not 100% equal distribution but for high number of elements pretty good and cost effective. That said it doesn't mean you'll have one element on each node in the worst case.
By default Hazelcast also keeps on backup, that means each node will have both elements (in a 2 node setup), either owned data or as a backup for failure case. You can make backups readable (read-from-backup=true), however that introduces a slight chance to read staled data (time between owner is updated but backup is not yet).
In addition data in Hazelcast, again by default, is stored in serialized form, means binary streamable representation.
Ok so how can all this be faster than a TCP connection to your database?
The answer is twofold:
Hazelcast is a key-value store. Therefore it is optimized for requesting data by key and answering with the value as quickly as possible.
Data is already serialized, therefore the byte stream is just "smashed" into the socket without any real further work to be done.
Your database on the other hand has to really query data from a table. The internal data structures to hold the information is optimized for complex queries but not to access on a key base. But, and this is important, current database implementation optimize internally (in RAM) for fast access too. So the effect will only happen for databases that serve under high load. Caches (local or distributed) are designed to speed up slow operations, resulting in: if your database is blazingly fast you won't see a benefit.
Anyways designing a system you expect to grow exponentially you should consider caching right from the start. A comprehensive introduction into caching and the behind ideas is available in a caching whitepaper and article I wrote some time ago: https://dzone.com/articles/caching-why-you-should-care
I hope this answers your question :-)

Related

How does an LRU cache fit into the CAP theorem?

I was pondering this question today. An LRU cache in the context of a database in a web app helps ensure Availability with fast data lookups that do not rely on continually accessing the database.
However, how does an LRU cache in practice stay fresh? As I understand it, one cannot garuntee Consistency along with Availibility. How is a frequently used item, which therefore does not expire from the LRU cache, handle modification? Is this an example where in a system that needs C over A, an LRU cache is not a good choice?
First of all, a cache too small to hold all the data (where an eviction might happen and the LRU part is relevant) is not a good example for the CAP theorem, because even without looking at consistency, it can't even deliver partition tolerance and availability at the same time. If the data the client asks for is not in the cache, and a network partition prevents the cache from getting the data from the primary database in time, then it simply can't give the client any answer on time.
If we only talk about data actually in the cache, we might somewhat awkwardly apply the CAP-theorem only to that data. Then it depends on how exactly that cache is used.
A lot of caching happens on the same machine that also has the authoritative data. For example, your database management system (say PostgreSql or whatever) probably caches lots of data in RAM and answers queries from there rather than from the persistent data on disk. Even then cache invalidation is a hairy problem. Basically even without a network you either are OK with sometimes using outdated information (basically sacrificing consistency) or the caching system needs to know about data changes and act on that and that can get very complicated. Still, the CAP theorem simply doesn't apply, because there is no distribution. Or if you want to look at it very pedantically (not the usual way of putting it) the bus the various parts of one computer use to communicate is not partition tolerant (the third leg of the CAP theorem). Put more simply: If the parts of your computer can't talk to one another the computer will crash.
So CAP-wise the interesting case is having the primary database and the cache on separate machines connected by an unreliable network. In that case there are two basic possibilities: (1) The caching server might answer requests without asking the primary database if its data is still valid, or (2) it might check with the primary database on every request. (1) means consistency is sacrificed. If its (2), there is a problem the cache's design must deal with: What should the cache tell the client if it doesn't get the primary database's answer on time (because of a partition, that is some networking problem)? In that case there are basically only two possibilities: It might still respond with the cached data, taking the risk that it might have become invalid. This is sacrificing consistency. Or it may tell the client it can't answer right now. That is sacrificing availability.
So in summary
If everything happens on one machine the CAP theorem doesn't apply
If the data and the cache are connected by an unreliable network, that is not a good example of the CAP theorem, because you don't even get A&P even without C.
Still, the CAP theorem means you'll have to sacrifice C or even more of A&P than the part a cache won't deliver in the first place.
What exactly you end up sacrificing depends on how exactly the cache is used.

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Cassandra client code with high read throughput with row_cache optimization

Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?
Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

Duplicates from a stream

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion
will indicate the processing status of the particular record.
Problems will arise when we have this service running with zero
downtime and depending on the surge of data, this collection can
exceed the bounds of memory. Also, in case of system outages, this
data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will
only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe
but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks
you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk
in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).
and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).
You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)

Resources