Implementing request-level caching for my microservice(s) - microservices

I am trying to reduce the time the application spends computing the same thing over and over again... This sounds like a caching use-case, but it may require an architectural change instead.
The situation is this: there are many callers who, independently, submit near-identical requests to my micro-service. This happens for some time (on the same order of magnitude as the time needed to service one of these requests), then they all move to a new set of near-identical requests.
I would like to try to compute each unique request only once, as much as this is feasible.
At a given time, I will get several requests to compute each of
{A, T0}, {B, T0}, {C, T0}, {A, B, T0}, {B, C, T0}, etc.
Then, my callers switch to {A, T1}, {B, T1}, etc.
While I am computing the result for the {A, T0} request on one node, the cluster will receive several other requests for the same {A, T0} request. Even after I finish computing the result, but before the callers move to T1, I will still receive {A, T0} requests.
Also, a {A, B, T0} request can be broken down into a {A, T0} and {B, T0} request plus a simple join.
After an individual request is computed, it should be fairly easy to cache that result and serve it to subsequent requests. It's just that most of the duplicate requests come in while the first request is being computed...
Is there any form of request-level caching that can alleviate this situation?
It does sound a bit like trying to make POSTs idempotent, which might not be doable.
The set of possible "letters", the A, B and C's above is known, but large. The subset of "letters" that do form the requests can change slightly (e.g. there could be a {A, C, D, T2} request at some point).
Is there a better architectural approach to this issue?
Just throwing more hardware at it would work but seems wasteful.
EDIT:
One approach that I'm considering is this:
"like" requests get routed to the same node. E.g. all {A, T0} requests go to node 12
locally, on each node I have a (LRU) cache of Request to Future<Response>
any request either listens to an existing Future or registers and executes a new one
should the node go down, the "like" requests would all get assigned to another node and the request will be processed again
Where this becomes tricky is dealing with the {A, B, T0} kind of requests. These get split into smaller requests, each of which could be processed by different nodes.

One thing to mention is that your question is too broad.
Anyway I hope your whole question wraps up asking answers for these 2 questions.
(1) Caching a response (that was computed for a particular request) and serving that response (without re-computing) over and over again for identical requests that come afterwards.
(2) Caching a computed value (which was computed to serve a particular request) and reuse that value (without computing) to serve subsequent requests which need that value as a part of its computation.
And you need to do it in a multi-node system.
Yes. There are answers for both of your questions.
(1) HTTP Caching
So you doubt about the possibility to tackle the multi node environment. Actually HTTP caching is only applicable for intermediate servers(Load Balances,CDNs etc) and browsers(or mobile) not individual nodes.
Simply you can configure your caching requirement and serve responses back from cache(intermediate nodes) even before the requests go into an end server node. You might need to add some code to your application as well.
Apart from simple response caching, there are lots of other out of the box features as well. For caching purpose, you would need to use a cache supporting server application (like nginx) for your intermediate servers(I hope in your scenario, its your load balancer). Anyway most of them support for GET requests by default. But there are
some work around to do to support POST as well, depending on the product you select.
And you also need to configure HTTP cache headers as well. It's impossible to mention all the bits and pieces about HTTP caching in a SO answer. Anyway this is a really good read about HTTP caching published by Google. And there are other web resources too.
(2) A Cache DB
You can compute a particular value and store inside a cache db (which is centrally accessible to all the nodes). So that you can add the cache looking up logic to your code before doing a particular computation.
There are lots of in memory caching db applications which serve your requirement.
Ex: redis, Hazelcast
I hope this wraps up what you were looking for.

Related

How to manage repeated requests on a cached server while the result arrives

In the context of a highly requested web service written in go language, I am considering to cache some computations. For that, I am thinking to use Redis.
My application is susceptible to receiving an avalanche of requests containing the same payload that triggers a costly computation. So a cache would reward and allow to compute once.
Consider the following figure extracted from here
I use this figure because I think it helps me illustrate the problem. The figure considers the general two cases: the book is in the cache, or this one is not in. However, the figure does not consider the transitory case when a book is being retrieved from the database and other "get-same-book" requests arrive. In this case, I would like to queue the repeated requests temporarily until the book is retrieved. Next, once the book has already arrived, the queued requests are replied with the result which would remain in the cache for fast retrieving of future requests.
So my question asks for approaches for implementing this requirement. I'm considering to use a kind of table on the server (repository) that writes the status of a query database (computing, ready), but this seems a little complicated, because I would need to handle some race conditions.
So I would like to know if anyone knows this pattern or if Redis itself implements it in some way (I have not found it in my consultations, but I suspect that using a Redis lock would be possible)
You can design it as you have described. But there is some things that are important.
Use a unique key
Use an unique key for each book, and if the book is ever changed, that key should also change. This design makes your step (6) save the book in Redis an idempotent operation (you can do it many times with the same result). So you will avoid any race condition with "get-same-book".
Idempotent requests OR asynchronous messages
I would like to queue the repeated requests temporarily until the book is retrieved. Next, once the book has already arrived, the queued requests are replied with the result
I would not recommend to queue requests as you describe it. If the request is a cache-miss, let it retrieve it from the database - but design it idempotent. Alternatively, you should handle all requests as asynchronous, and use a message queue e.g. nats, RabbitMQ or something, but the complexity grows with that solution.
Serializing requests
My problem is that while that second of computation where the result is not still gotten, too many repeated requests can arrive and due to the cost I need to avoid to repeat their computations. I need to find a way of retaining them while the result of the first request arrives.
It sounds like you want to have your computations serialized instead of doing them concurrently because you want to avoid doing the same computation twice. To solve this, you should let the requests initialize the computation, e.g. by putting the input on a queue and then do the computation in a serial order (but still possibly concurrently if they have a different key) and finally notify the client, or if the client is subscribing for updates (a better solution).
Redis do have support for PubSub but it depends on what requirements you have on the clients. I would recommend a solution without locks, for scalability.

How does an LRU cache fit into the CAP theorem?

I was pondering this question today. An LRU cache in the context of a database in a web app helps ensure Availability with fast data lookups that do not rely on continually accessing the database.
However, how does an LRU cache in practice stay fresh? As I understand it, one cannot garuntee Consistency along with Availibility. How is a frequently used item, which therefore does not expire from the LRU cache, handle modification? Is this an example where in a system that needs C over A, an LRU cache is not a good choice?
First of all, a cache too small to hold all the data (where an eviction might happen and the LRU part is relevant) is not a good example for the CAP theorem, because even without looking at consistency, it can't even deliver partition tolerance and availability at the same time. If the data the client asks for is not in the cache, and a network partition prevents the cache from getting the data from the primary database in time, then it simply can't give the client any answer on time.
If we only talk about data actually in the cache, we might somewhat awkwardly apply the CAP-theorem only to that data. Then it depends on how exactly that cache is used.
A lot of caching happens on the same machine that also has the authoritative data. For example, your database management system (say PostgreSql or whatever) probably caches lots of data in RAM and answers queries from there rather than from the persistent data on disk. Even then cache invalidation is a hairy problem. Basically even without a network you either are OK with sometimes using outdated information (basically sacrificing consistency) or the caching system needs to know about data changes and act on that and that can get very complicated. Still, the CAP theorem simply doesn't apply, because there is no distribution. Or if you want to look at it very pedantically (not the usual way of putting it) the bus the various parts of one computer use to communicate is not partition tolerant (the third leg of the CAP theorem). Put more simply: If the parts of your computer can't talk to one another the computer will crash.
So CAP-wise the interesting case is having the primary database and the cache on separate machines connected by an unreliable network. In that case there are two basic possibilities: (1) The caching server might answer requests without asking the primary database if its data is still valid, or (2) it might check with the primary database on every request. (1) means consistency is sacrificed. If its (2), there is a problem the cache's design must deal with: What should the cache tell the client if it doesn't get the primary database's answer on time (because of a partition, that is some networking problem)? In that case there are basically only two possibilities: It might still respond with the cached data, taking the risk that it might have become invalid. This is sacrificing consistency. Or it may tell the client it can't answer right now. That is sacrificing availability.
So in summary
If everything happens on one machine the CAP theorem doesn't apply
If the data and the cache are connected by an unreliable network, that is not a good example of the CAP theorem, because you don't even get A&P even without C.
Still, the CAP theorem means you'll have to sacrifice C or even more of A&P than the part a cache won't deliver in the first place.
What exactly you end up sacrificing depends on how exactly the cache is used.

Is combining rest api calls to reduce # requests worth doing?

My server used to handle 700+ user burst and now it is failing at around 200 users.
(Users are connecting to the server almost at the same time after clicking a push message)
I think the change is due to the change how the requests are made.
Back then, webserver collected all the information in a single response in an html.
Now, each section in a page is making a rest api request resulting in probably 10+ more requests.
I'm considering making an api endpoint to aggregate those requests for pages that users would open when they click on push notification.
Another solution I think of is caching those frequently used rest api responses.
Is it a good idea to combine api calls to reduce api calls ?
It is always a good idea to reduce API calls. The optimal solution is to get all the necessary data in one go without any unused information.
This results in less traffic, less requests (and loads) to the server, less RAM and CPU usage, as well as less concurrent DB operations.
Caching is also a great choice. You can consider both caching the entire request and separate parts of the response.
A combined API response means that there will be just one response, which will reduce the pre-execution time (where the app is loading everything), but will increase the processing time, because it's doing everything in one thread. This will result in less traffic, but a slightly slower response time.
From the user's perspective this would mean that if you combine everything, the page will load slower, but when it does it will load up entirely.
It's a matter of finding the balance.
And for the question if it's worth doing - it depends on your set-up. You should measure the start-up time of the application and the execution time and do the math.
Another thing you should consider is the amount of time this might require. There is also the solution of increasing the server power, like creating a clustered cache and using a load balancer to split the load. You should compare the needed time for both tasks and work from there.

Distributed cache with huge objects ~1-2GB

I have a need to stream a huge dataset, around 1-2GB, but only on demand when they explore the data. For example, if they don't explore parts of the data, I don't want to send it out.
So now, I have a solution that effectively returns JSON only for things they need. The need for a cache arises because these 1-2GB objects are actually constructed in memory from a file or files on disk, so the latency is ~30 seconds if you have to read the file to return this data.
How do I manage such a cache? Basically I think the solution is something like ZooKeeper or such where I store the physical machine name which holds the cache and then forward my rest request to that.
Would you guys also consider this to be the right model? I wonder what kind of checks will I have to do such that if the node that has the cache goes down, I can still fullfil the request without an error, but higher latencies.
Has anybody developed such a system? All the solutions out there are for seemingly small rows or objects.
https://github.com/golang/groupcache is used for bigger things, but although it's used by http://dl.google.com, I'm not sure how it would do with Multi-gigabyte objects.
On the other hand, HTTP can do partial transfers and will be very efficient at that. Take a look ar Varnish, or Nginx.

What is a multi-tier cache?

I've recently come across the phrase "multi-tier cache" relating to multi-tiered architectures, but without a meaningful explanation of what such a cache would be (or how it would be used).
Relevant online searches for that phrase don't really turn up anything either. My interpretation would be a cache servicing all tiers of some n-tier web app. Perhaps a distributed cache with one cache node on each tier.
Has SO ever come across this term before? Am I right? Way off?
I know this is old, but thought I'd toss in my two cents here since I've written several multi-tier caches, or at least several iterations of one.
Consider this; Every application will have different layers, and at each layer a different form of information can be cached. Each cache item will generally expire for one of two reasons, either a period of time has expired, or a dependency has been updated.
For this explanation, lets imagine that we have three layers:
Templates (object definitions)
Objects (complete object cache)
Blocks (partial objects / block cache)
Each layer depends on it's parent, and we would define those using some form of dependency assignment. So Blocks depend on Objects which depend on Templates. If an Object is changed, any dependencies in Block would be expunged and refreshed; if a Template is changed, any Object dependencies would be expunged, in turn expunging any Blocks, and all would be refreshed.
There are several benefits, long expiry times are a big one because dependencies will ensure that downstream resources are updated whenever parents are updated, so you won't get stale cached resources. Block caches alone are a big help because, short of whole page caching (which requires AJAX or Edge Side Includes to avoid caching dynamic content), blocks will be the closest elements to an end users browser / interface and can save boatloads of pre-processing cycles.
The complication in a multi-tier cache like this though is that it generally can't rely on a purely DB based foreign key expunging, that is unless each tier is 1:1 in relation to its parent (ie. Block will only rely on a single object, which relies on a single template). You'll have to programmatically address the expunging of dependent resources. You can either do this via stored procedures in the DB, or in your application layer if you want to dynamically work with expunging rules.
Hope that helps someone :)
Edit: I should add, any one of these tiers can be clustered, sharded, or otherwise in a scaled environment, so this model works in both small and large environments.
After playing around with EhCache for a few weeks it is still not perfectly clear what they mean by the term "multi-tier" cache. I will follow up with what I interpret to be the implied meaning; if at any time down the road someone comes along and knows otherwise, please feel free to answer and I'll remove this one.
A multi-tier cache appears to be a replicated and/or distributed cache that lives on 1+ tiers in an n-tier architecture. It allows components on multiple tiers to gain access to the same cache(s). In EhCache, using a replicated or distributed cache architecture in conjunction with simply referring to the same cache servers from multiple tiers achieves this.

Resources