Am I misusing GraphQL if I must decompose REST data, then re-aggregate it? - graphql

We are considering using GraphQL on top of a REST service (using the
FHIR standard for medical records).
I understand that the pattern with GraphQL is to aggregate the results
of multiple, independent resolvers into the final result. But a
FHIR-compliant REST server offers batch endpoints that already aggregate
data. Sometimes we’ll need à la carte data—a patient’s age or address
only, for example. But quite often, we’ll need most or all of the data
available about a particular patient.
So although we can get that kind of plenary data from a single REST call
that knits together multiple associations, it seems we will need to
fetch it piecewise to do things the GraphQL way.
An optimization could be to eager load and memoize all the associated
data anytime any resolver asks for any data. In some cases this would be
appropriate while in other cases it would be serious overkill. But
discerning when it would be overkill seems impossible given that
resolvers should be independent. Also, it seems bloody-minded to undo
and then redo something that the REST service is already perfectly
capable of doing efficiently.
So—
Is GraphQL the wrong tool when it sits on top of a REST API that can
efficiently aggregate data?
If GraphQL is the right tool in this situation, is eager-loading and
memoization of associated data appropriate?
If eager-loading and memoization is not the right solution, is there
an alternative way to take advantage of the REST service’s ability
to aggregate data?
My question is different from
this
question and
this
question because neither touches on how to take advantage of another
service’s ability to aggregate data.

An alternative approach would be to parse the request inside the resolver for a particular query. The fourth parameter passed to a resolver is an object containing extensive information about the request, including the selection set. You could then await the batched request to your API endpoint based on the requested fields, and finally return the result of the REST call, and let your lower level resolvers handle parsing it into the shape the data was requested in.
Parsing the info object can be a PITA, although there's libraries out there for that, at least in the Node ecosystem.

Related

Child service data access to other services with Apollo Federation configuration

We've been using Apollo Federation for about 1.5 years as our main api. Behind the federation gateway are 6 child graphql services which are all combined at the gateway. This configuration really works excellent when you have a result set of data which spans the different services. E.g. A list of tickets which references the user who purchased and event it is associated with it, etc.
One place we have experienced this breaking down is when a pre-set of data is needed which is already defined in another child service (or across other child services) (resolver/path). There is no way (that has been discovered by us) to query the federation from a child service to get a federated set of data for use by a resolver to work with that data.
For example, say we have a graphql query defined which queries all tickets for an event, and through federation returns the purchaser's data, the event's data and the products data. If I need this data set from a resolver, I would need to make all those queries again myself duplicating dataSource logic and having to match up the data in code.
One crazy thought which came up is to setup apollo-datasource-rest dataSource to make queries against our gateway end point as a dataSource for our resolvers. This way we can request the data we need and let Apollo Federation stitch all the data together as it is designed to do. So instead of the resolver querying the database for all the different pieces of data and then matching them up, we would request the data from our graphql gateway where this query is already defined.
What we are trying to avoid by doing this is having a repeated set of queries in child services to get the details which are already available in (or across) other services.
The question
Is this a really bad idea?
Is it a plausible idea?
Has anyone tried something like this before?
Yes we would have to ensure that there aren't circular dependencies on the resolvers. In our case I see the "dataSource accessing the gateway" utilized in gathering initial data in mutations.
Example of a federated query. In this query, event, allocatedTo, purchasedBy, and product are all types in other services. event is an Event type, allocatedTo and purchasedBy are a Profile type, and product is a Product type. This query provides me with all the data I would use to say, send an email notification to the people in the result set. Though to get this data from a resolver in a mutation to queue up those emails means I need to make many queries and align all the data through code myself instead of using the Gateway/federation which does this already with the already established query. The thought around using apollo-datasource-rest to query our own gateway is get at this data in this form. Not through separate queries and code to align id's etc.
query getRegisteredUsers($eventId: ID!) {
communications {
event(eventId: $eventId) {
registered {
event {
name
}
isAllocated,
hasCheckedIn,
lastUpdatedAt,
allocatedTo {
firstName
lastName
email
}
purchasedBy {
id
firstName
lastName
}
product {
__typename
...on Ticket {
id
name
}
}
}
}
}
}
FYI, I didn't quite understand the question until I looked at your edits, which had some examples.
Is this a really bad idea?
In my experience, yes. Not as an idea, as you're in good company with other very smart people who have done this.
Is it a plausible idea?
Absolutely it's plausible, but I don't recommend it.
Has anyone tried something like this before?
Yes, but I hope you don't.
Your Question
Having resolvers make requests back to the Gateway:
I do not recommend this. I've seen this happen, and I've personally worked to help companies out of the mess this takes you into. Circular dependencies are going to happen. Latency is just going to skyrocket as you have more and more hops, TLS handshakes, etc. Do orchestration instead. It feels weird to introduce non-GraphQL, but IMO in the end it's way simpler, faster, and more maintainable than where "just talk to the gateway" takes you.
What then?
When you're dealing with some mutations which require data from across multiple data sources to be able to process a single thing (like sending a transaction email to a person), you have some choices. Something that helped me figure this out was the question "how would I have done this before GraphQL?"
Orchestration: you have a single "orchestration service", which takes the mutation and makes calls (preferably non-GraphQL, so REST, gRPC, Lambda?) to the owner services to collect the data. The orchestration layer does NOT own data, but it can speak with the other services. It's like Federation, but for sending the data into the request, instead of into the response.
Choreography: you trigger roughly the same thing, but via an event stream. (doesn't work as well with the request / response model of GraphQL)
CQRS (projections): Copies of database data, used for things like reporting. CQRS is basically "the way you read data doesn't have to be the same as the way you write it", and it allows for things like event-sourced data. If all of your data sources actually share the same database, you don't even need "projections" as much as you would just want a read replica. If you're not at enough scale to do replicas, just skip it and promise never to write data that your current domain doesn't own.
What I Do
Where I work, I have gotten us to:
Queries
queries always start with "one database call".
if the "one database call" goes to one domain of data (most often true), that query goes into one service, and Federation fills in the leaves of your tree. If you really follow CQRS, this could go the same way as #3, but we don't.
if your "one database call" needs data from across domains (e.g. get all orders with Product X in it, but sorted by the customer's first name), you need a database projection. Preferably this can be handled by a "reporting service": it doesn't OWN any data, but it READS all data.
Mutations
if your top-level mutation modifies acts only within one domain, the mutation goes in a service, it can use database transactions, and Federation fills in the leaves
if your mutation is required to write across multiple domains and requires immediate consistency (placing an order with inventory, payments, etc), we chose orchestration to write across multiple services (and roll-back when necessary, since we don't have database transactions to do it for us).
if your mutation requires data from many places to send further into the request (like sending an email), we chose orchestration to pull from the multiple services and to push that data down. This feels very much like Federation, but in reverse.

When to use Redis and when to use DataLoader in a GraphQL server setup

I've been working on a GraphQL server for a while now and although I understand most of the aspects, I cannot seem to get a grasp on caching.
When it comes to caching, I see both DataLoader mentioned as well as Redis but it's not clear to me when I should use what and how I should use them.
I take it that DataLoader is used more on a field level to counter the n+1 problem? And I guess Redis is on a higher level then?
If anyone could shed some light on this, I would be most grateful.
Thank you.
DataLoader is primarily a means of batching requests to some data source. However, it does optionally utilize caching on a per request basis. This means, while executing the same GraphQL query, you only ever fetch a particular entity once. For example, we can call load(1) and load(2) concurrently and these will be batched into a single request to get two entities matching those ids. If another field calls load(1) later on while executing the same request, then that call will simply return the entity with ID 1 we fetched previously without making another request to our data source.
DataLoader's cache is specific to an individual request. Even if two requests are processed at the same time, they will not share a cache. DataLoader's cache does not have an expiration -- and it has no need to since the cache will be deleted once the request completes.
Redis is a key-value store that's used for caching, queues, PubSub and more. We can use it to provide response caching, which would let us effectively bypass the resolver for one or more fields and use the cached value instead (until it expires or is invalidated). We can use it as a cache layer between GraphQL and the database, API or other data source -- for example, this is what RESTDataSource does. We can use it as part of a PubSub implementation when implementing subscriptions.
DataLoader is a small library used to tackle a particular problem, namely generating too many requests to a data source. The alternative to using DataLoader is to fetch everything you need (based on the requested fields) at the root level and then letting the default resolver logic handle the rest. Redis is a key-value store that has a number of uses. Whether you need one or the other, or both, depends on your particular business case.

Ideal way to use graphql

I am new user for graphql. I am planning to use graphql as a middleware layer where different application will hit the API and get the data they require. But main problem is training different groups as to how to post data and query the data they require. Is is good idea to build a middleware which accepts JSON over REST api and converts it to graphql request. I am thinking of 2 options
1. Build REST middle layer which accepts JSON and convert it to graphql request.
2. Ask user to get comfortable with graphql.
Mixing REST and graphql is never a good idea for a new project, because you will waste your resources for doing the same thing in two different ways and you will have to maintain larger codebase. Providing REST and graphql at the same time may seems like a convenience for your customers but in the long run, it is not. Smaller, well structured and well documented API is always preferable.
If you are going to mix and match different resources or call outside services graphql offers better solution. Graphql provides strong typing, single round trip, query batching, instrospection and better dev tools, versionless API.

Does GraphQL obviate Data Transfer Objects?

To my understanding, Data Transfer Objects (DTOs) are typically smallish, flattish, behavior-less, serializable objects whose main advantage is ease of transport across networks.
GraphQL has the following facets:
encourages serving rich object graphs, which (in my head anyway) contradicts the "flattish" portion of DTOs,
lets clients choose exactly the data they want, which addresses the "smallish" portion,
returns JSON-esque objects, which addresses the "behavior-less" and "serializable" portions
Do GraphQL and the DTO pattern mutually exclude one another?
Here's what led to this question: We envision a microservices architecture with a gateway. I'm designing one API to fit into that architecture that will serve (among other things) geometries. In many (likely most) cases the geometries will not be useful to client applications, but they'll be critical in others so they must be served. However they're serialized, geometries can be big so giving clients the option to decline them can save lots of bandwidth. RESTful APIs that I've seen handling geometries do that by providing a "returnGeometry" parameter in the query string. I never felt entirely comfortable with that approach, and I initially envisioned serving a reasonably deep set of related/nested return objects many of which clients will elect to decline. All of that led me to consider a GraphQL interface. As the design has progressed, I've started considering flattening the output (either entirely or partially), which led me to consider the DTO pattern. So now I'm wondering if it would be best to flatten everything into DTOs and skip GraphQL (in favor of REST, I suppose?). I've considered a middle ground with DTOs served using GraphQL to let clients pick and choose the attributes they want on them, but I'm wondering if that's mixing patterns & technologies inappropriately.
I think it's worthwhile differentiating between 2 typical use cases for GraphQL, and a hidden 3rd use case which combines the first two.
In all 3 however, the very nature of a GraphType is to selectively decide which fields you want to expose from your domain entity. Sounds familiar? It should, that's what a DTO is. GraphQL or not, you do not want to expose the 'password' field on your Users table for example, hence you need to hide it from your clients one way or another.
This is enabled by the fact that GraphQL doesn't make any assumptions about your persistence layer and gives you the tools to treat your input types / queries as you see fit.
1. GraphQL endpoint exposed directly to clients (e.g. web, mobile):
In this use case you'd use any GraphQL client to talk to your graphql endpoint directly. The DTOs here are the actual GraphType objects, and are structured depending on the Fields you added to your exposed GraphTypes.
Internally, you would use field resolvers to transform your DTO to your domain entity and then use your repository to persist it.
DTO transformation occurs inside the GraphType's Field resolver.
GraphQL --> DTO --> Domain Entity --> Data Store
2. REST endpoint exposed to clients, which internally consumes a GraphQL endpoint:
In this use case, your web and mobile clients are working with traditional DTOs via REST. The controllers however are connecting to an internally-exposed GraphQL endpoint - as opposed to use case #1 - whose GraphTypes are an exact mapping of your domain entities, password field included!
DTO transformation occurs in the controller before calling the endpoint.
DTO --> Domain Entity --> GraphQL --> Data Store
3. Combining 1 and 2
This is is a use case for when you're shifting your architecture from one to the other and you don't want to break things for client consumers, so you leave both options open and eventually decommission one of them.

Micro Services and noSQL - Best practice to enrich data in micro service architecture

I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.

Resources