When to use Redis and when to use DataLoader in a GraphQL server setup - caching

I've been working on a GraphQL server for a while now and although I understand most of the aspects, I cannot seem to get a grasp on caching.
When it comes to caching, I see both DataLoader mentioned as well as Redis but it's not clear to me when I should use what and how I should use them.
I take it that DataLoader is used more on a field level to counter the n+1 problem? And I guess Redis is on a higher level then?
If anyone could shed some light on this, I would be most grateful.
Thank you.

DataLoader is primarily a means of batching requests to some data source. However, it does optionally utilize caching on a per request basis. This means, while executing the same GraphQL query, you only ever fetch a particular entity once. For example, we can call load(1) and load(2) concurrently and these will be batched into a single request to get two entities matching those ids. If another field calls load(1) later on while executing the same request, then that call will simply return the entity with ID 1 we fetched previously without making another request to our data source.
DataLoader's cache is specific to an individual request. Even if two requests are processed at the same time, they will not share a cache. DataLoader's cache does not have an expiration -- and it has no need to since the cache will be deleted once the request completes.
Redis is a key-value store that's used for caching, queues, PubSub and more. We can use it to provide response caching, which would let us effectively bypass the resolver for one or more fields and use the cached value instead (until it expires or is invalidated). We can use it as a cache layer between GraphQL and the database, API or other data source -- for example, this is what RESTDataSource does. We can use it as part of a PubSub implementation when implementing subscriptions.
DataLoader is a small library used to tackle a particular problem, namely generating too many requests to a data source. The alternative to using DataLoader is to fetch everything you need (based on the requested fields) at the root level and then letting the default resolver logic handle the rest. Redis is a key-value store that has a number of uses. Whether you need one or the other, or both, depends on your particular business case.

Related

Does storing another service's data violate the Single Responsibility Principle of Microservice

Say I have a service that manages warehouses(that is not very frequently updated). I have a sales service that requires the list of stores( to search through and use as necessary). If I get the list of stores from the store service and save it( lets say in redis) inside my sales service but ensure that redis is updated if the list of stores changes. Would it violate the single responsibility principle of Microservice architecture?
No it does not, actually it is quite common approach in microservice architecture when service stores a copy of related data from another services and uses some mechanism to sync it (usually using some async communications via message broker).
Storing the copy of data does not transfer ownership of that data from service which manages it.
It is common and you have a microservice pattern (CQRS).
If you need some information from other services / microservices to join with your data, then you need to store that information.
Whenever you are making design decision whether always issue requests against the downstream system or use a local copy then you are basically making trade-off analysis between performance and data freshness.
If you always issue RPC calls then you prefer data freshness over performance
The frequency of how often do you need to issue RPC calls has direct impact on performance
If you utilize caching to gain performance then there is a chance to use stale data (depending on your business it might be okay or unacceptable)
Cache invalidation is a pretty tough problem domain so, it can cause headache
Caching one microservice's data does not violate data ownership because caching just reads the data, it does not delete or update existing ones. It is similar to have a single leader (master) - multiple followers setup or a read-write lock. Until there is only one place where data can be created, modified or deleted then data ownership is implemented in a right way.

Apollo - Server(GraphQL): Using Batching together with Caching in REST-APIs not recommended, why?

the documentation of Apollo-Server states, that Batching + Caching should not be used together with REST-API Datasources:
Most REST APIs don't support batching. When they do, using a batched
endpoint can jeopardize caching. When you fetch data in a batch
request, the response you receive is for the exact combination of
resources you're requesting. Unless you request that same combination
again, future requests for the same resource won't be served from
cache.
We recommend that you restrict batching to requests that can't be
cached. In these cases, you can take advantage of DataLoader as a
private implementation detail inside your RESTDataSource [...]
Source: https://www.apollographql.com/docs/apollo-server/data/data-sources/#using-with-dataloader
I'm not sure why they say: "Unless you request that same combination again, future requests for the same resource won't be served from cache.".
Why shouldn't future requests be loaded from cache again? I mean, here we have 2 caching layers. The DataLoader which batches requests and memorizes - with an per-request cache - which objects are requested and return the same object from it's cache if requested multiple times in the whole request.
And we have a 2nd level cache, that caches individual objects over multiple requests (Or at least it could be implemented in a way that it caches the individual objects, not the whole result set).
Wouldn't that ensure that feature requests would be served from the second layer cache if the whole request changes but includes some of the objects which were requested in a previous request?
Many REST APIs implement some sort of request caching for GET requests based on URLs. When you request an entity from a REST endpoint a second time, the result can be returned faster.
For example lets imagine a fictional API "Weekend City Trip".
Your GraphQL API fetches the three largest cities around you and then checks the weather in these cities on the weekend. In this fictional example you receive two requests. The first request is from someone in Germany. You find the three largest cities around them: Cologne, Hamburg and Amsterdam. You can now call the weather API either in a batch or one by one.
/api/weather/Cologne
/api/weather/Hamburg
/api/weather/Amsterdam
or
/api/weather/Cologne,Hamburg,Amsterdam
The next person is in Belgium and we find Cologne, Amsterdam and Brussels.
/api/weather/Cologne
/api/weather/Amsterdam
/api/weather/Brussels
or
/api/weather/Cologne,Amsterdam,Brussels
Now as you can see, without the batching we have requested some URLs twice. The API provider can use a CDN to return these results quickly and not strain their application infrastructure. And since you are probably not the only one using the API, all these URLs might already be cached in the first place, meaning you will receive responses much faster. While the amount of possible batch endpoints grows massively with each city offered and amount of cities offered. If the API provides only 1000 cities, there are 166167000 possible combinations that could be requested when batching three cities. Therefore, the chance that someone else already requested the combination of these three cities might be rather low.
Conclusion
The caching is really just on the API provider side but could greatly benefit your response times as a consumer. Often, GraphQL is used as an API gateway to your own REST services. If you don't cache your services, it can be worth it to use batching in that case.

Am I misusing GraphQL if I must decompose REST data, then re-aggregate it?

We are considering using GraphQL on top of a REST service (using the
FHIR standard for medical records).
I understand that the pattern with GraphQL is to aggregate the results
of multiple, independent resolvers into the final result. But a
FHIR-compliant REST server offers batch endpoints that already aggregate
data. Sometimes we’ll need à la carte data—a patient’s age or address
only, for example. But quite often, we’ll need most or all of the data
available about a particular patient.
So although we can get that kind of plenary data from a single REST call
that knits together multiple associations, it seems we will need to
fetch it piecewise to do things the GraphQL way.
An optimization could be to eager load and memoize all the associated
data anytime any resolver asks for any data. In some cases this would be
appropriate while in other cases it would be serious overkill. But
discerning when it would be overkill seems impossible given that
resolvers should be independent. Also, it seems bloody-minded to undo
and then redo something that the REST service is already perfectly
capable of doing efficiently.
So—
Is GraphQL the wrong tool when it sits on top of a REST API that can
efficiently aggregate data?
If GraphQL is the right tool in this situation, is eager-loading and
memoization of associated data appropriate?
If eager-loading and memoization is not the right solution, is there
an alternative way to take advantage of the REST service’s ability
to aggregate data?
My question is different from
this
question and
this
question because neither touches on how to take advantage of another
service’s ability to aggregate data.
An alternative approach would be to parse the request inside the resolver for a particular query. The fourth parameter passed to a resolver is an object containing extensive information about the request, including the selection set. You could then await the batched request to your API endpoint based on the requested fields, and finally return the result of the REST call, and let your lower level resolvers handle parsing it into the shape the data was requested in.
Parsing the info object can be a PITA, although there's libraries out there for that, at least in the Node ecosystem.

How to add storage-level caching between DynamoDB and Titan?

I am using the Titan/DynamoDB library to use AWS DynamoDB as a backend for my Titan DB graphs. My app is very read-heavy and I noticed Titan is mostly executing query requests against DynamoDB. I am using transaction- and instance-local caches and indexes to reduce my DynamoDB read units and the overall latency. I would like to introduce a cache layer that is consistent for all my EC2 instances: A read/write-through cache between DynamoDB and my application to store query results, vertices, and edges.
I see two solutions to this:
Implicit caching done directly by the Titan/DynamoDB library. Classes like the ParallelScanner could be changed to read from AWS ElastiCache first. The change would have to be applied to read & write operations to ensure consistency.
Explicit caching done by the application before even invoking the Titan/Gremlin API.
The first option seems to be the more fine-grained, cross-cutting, and generic.
Does something like this already exist? Maybe for other storage backends?
Is there a reason why this does not exist already? Graph DB applications seem to be very read-intensive so cross-instance caching seems like a pretty significant feature to speedup queries.
First, ParallelScanner is not the only thing you would need to change. Most importantly, all the changes you need to make are in DynamoDBDelegate (that is the only class that makes low level DynamoDB API calls).
Regarding implicit caching, you could add a caching layer on top of DynamoDB. For example, you could implement a cache using API Gateway on top of DynamoDB, or you could use Elasticache. Either way, you need to figure out a way to invalidate Query/Scan pages. Inserting/deleting items will cause page boundaries to change so it requires some thought.
Explicit caching may be easier to do than implicit caching. The level of abstraction is higher, so based on your incoming writes it may be easier for you to decide at the application level whether a traversal that is cached needs to be invalidated. If you treat your graph application as another service, you could cache the results at the service level.
Something in between may also be possible (but requires some work). You could continue to use your vertex/database caches as provided by Titan, and use a low value for TTL that is consistent with how frequently you write columns. Or, you could take your caching approach a step further and do the following.
Enable DynamoDB Stream on edgestore.
Use a Lambda function to stream the edgestore updates to a Kinesis Stream.
Consume the Kinesis Stream with edgestore updates in the same JVM as the Gremlin Server on each of your Gremlin Server instances. You would need to instrument the database level cache in Titan to consume the Kinesis stream and invalidate the cached columns as appropriate, in each Titan instance.

Plone 4.2 how to make PAS cache external usera data

I'm implementing a PAS plugin that handles authentications against mailservers. Actually only DBMail is implemented.
I realized, that the enumerateUsers function from the PAS plugin is called numerous times per request and requires my plugin to open/close an SQL connections for every (subsequent) request. Of course, this is very expensive.
The connections itself are handled in a plone tool, which is able to handle multiple different mailservers and delegeates the enumerateUsers call to wrapper objects that represent registered servers.
My question is now, what sort of cache (OOBTree, Session?) I should use to provide a temporary local storage for repeating enumerations and avoid subsequent SQL connections?
Another idea was, to hook into the user creation process that takes place on the first login, an external user issues and completely "localize" the users.
Third idea was, to store the needed data in the specific member, if possible.
What would be best practice here?
I'd cache the query results, indeed. You need to make a decision on how long to cache the results, and if stored long term, how to invalidate that cache or check for changes.
There are no best practices for these decisions, as they depend entirely on the type of data stored and the APIs of the backends. If they support some kind of freshness query, for example, then you store everything forever and poll the backend to see if the cache needs updating.
You can start with a simple request cache; query once per request, store it on the request object. Your cache will automatically be invalidated at the end of the request as the request object is cleaned up, the next request will be a clean slate.
If your backend users rarely change, you can cache information for longer, in a local cache. I'd use a volatile attribute on the plugin. Any attribute starting with _v_ is ignored by the persistence machinery. Thus, anything stored in a _v_ volatile attribute is both thread-local and only exists for the lifetime of the process, a restart of the server clears these automatically.
At the very least you should use an _v_ volatile attribute to store your backend SQL connections. That way they can stay open between requests, and can be re-used. Something like the following method would do nicely:
def _connection(self):
# Return a backend connection
if getattr(self, '_v_connection', None) is None:
# Create connection here
self._v_connection = yourdatabaseconnection
return self._v_connection
You could also use a persistent attribute on your plugin to store your cache. This cache would be committed to the ZODB and persist across restarts. You then really need to work out how to invalidate the contents; store timestamps and evict data when to old, etc.
Your cache datastructure depends entirely on your application needs. If you don't persist information, a dictionary (username -> information) could be more than enough. Persisted caches could benefit from using a OOBTree instead of a dictionary as they reduce chances of conflicts between different threads and are more efficient when it comes to large sets of data.
Whatever you do, you do not need to use a Session. Sessions are prone to conflicts, do not scale well, and are in any case not the place to store a cache of this kind.

Resources