How do you prevent nested attack on GraphQL/Apollo server? - graphql

How do you prevent a nested attack against an Apollo server with a query such as:
{
authors {
firstName
posts {
title
author {
firstName
posts{
title
author {
firstName
posts {
title
[n author]
[n post]
}
}
}
}
}
}
}
In other words, how can you limit the number of recursions being submitted in a query? This could be a potential server vulnerability.

As of the time of writing, there isn't a built-in feature in GraphQL-JS or Apollo Server to handle this concern, but it's something that should definitely have a simple solution as GraphQL becomes more popular. This concern can be addressed with several approaches at several levels of the stack, and should also always be combined with rate limiting, so that people can't send too many queries to your server (this is a potential issue with REST as well).
I'll just list all of the different methods I can think of, and I'll try to keep this answer up to date as these solutions are implemented in various GraphQL servers. Some of them are quite simple, and some are more complex.
Query validation: In every GraphQL server, the first step to running a query is validation - this is where the server tries to determine if there are any serious errors in the query, so that we can avoid using actual server resources if we can find that there is some syntax error or invalid argument up front. GraphQL-JS comes with a selection of default rules that follow a format pretty similar to ESLint. Just like there is a rule to detect infinite cycles in fragments, one could write a validation rule to detect queries with too much nesting and reject them at the validation stage.
Query timeout: If it's not possible to detect that a query will be too resource-intensive statically (perhaps even shallow queries can be very expensive!), then we can simply add a timeout to the query execution. This has a few benefits: (1) it's a hard limit that's not too hard to reason about, and (2) this will also help with situations where one of the backends takes unreasonably long to respond. In many cases, a user of your app would prefer a missing field over waiting 10+ seconds to get a response.
Query whitelisting: This is probably the most involved method, but you could compile a list of allowed queries ahead of time, and check any incoming queries against that list. If your queries are totally static (you don't do any dynamic query generation on the client with something like Relay) this is the most reliable approach. You could use an automated tool to pull query strings out of your apps when they are deployed, so that in development you write whatever queries you want but in production only the ones you want are let through. Another benefit of this approach is that you can skip query validation entirely, since you know that all possible queries are valid already. For more benefits of static queries and whitelisting, read this post: https://dev-blog.apollodata.com/5-benefits-of-static-graphql-queries-b7fa90b0b69a
Query cost limiting: (Added in an edit) Similar to query timeouts, you can assign a cost to different operations during query execution, for example a database query, and limit the total cost the client is able to use per query. This can be combined with limiting the maximum parallelism of a single query, so that you can prevent the client from sending something that initiates thousands of parallel requests to your backend.
(1) and (2) in particular are probably something every GraphQL server should have by default, especially since many new developers might not be aware of these concerns. (3) will only work for certain kinds of apps, but might be a good choice when there are very strict performance or security requirements.

To supplement point (4) in stubailo's answer, here are some Node.js implementations that impose cost and depth bounds on incoming GraphQL documents.
graphql-depth-limit
graphql-validation-complexity
graphql-query-complexity
These are custom rules that supplement the validation phase.

A variation on query whitelisting is query signing.
During the build process, each query is cryptographically signed using a secret which is shared with the server but not bundled with the client. Then at runtime the server can validate that a query is genuine.
The advantage over whitelisting is that writing queries in the client doesn't require any changes to the server. This is especially valuable if multiple clients access the same server (e.g. web, desktop and mobile apps).
Example
In development, you write your queries as usual against your dev server which allows unsigned queries.
Then in your client build step in CI, each query is tagged with its cryptographic signature. This signature is sent by the client as a header to the server when making the request, along with the full GraphQL query string.
Your staging and production servers are configured to require a signed queries. They calculate the signature of the query received in the same way as the CI server did during the build. If the signatures don't match then they don't process the query.
Limitations:
not suitable for public facing APIs since the secret must be shared with developers
clients cannot dynamically build a GraphQL query at runtime using string interpolation, but I've never had a need for this and it is discouraged

For the Query cost limiting you could use graphql-cost-analysis
This is a validation rule which parses the query before executing it. In your GraphQL server you just have to assign a cost configuration for each field of your Schema Type Map you want.

Don't miss graphql-rate-limit 👌a GraphQL directive to add basic but granular rate limiting to your Queries or Mutations.

Related

Child service data access to other services with Apollo Federation configuration

We've been using Apollo Federation for about 1.5 years as our main api. Behind the federation gateway are 6 child graphql services which are all combined at the gateway. This configuration really works excellent when you have a result set of data which spans the different services. E.g. A list of tickets which references the user who purchased and event it is associated with it, etc.
One place we have experienced this breaking down is when a pre-set of data is needed which is already defined in another child service (or across other child services) (resolver/path). There is no way (that has been discovered by us) to query the federation from a child service to get a federated set of data for use by a resolver to work with that data.
For example, say we have a graphql query defined which queries all tickets for an event, and through federation returns the purchaser's data, the event's data and the products data. If I need this data set from a resolver, I would need to make all those queries again myself duplicating dataSource logic and having to match up the data in code.
One crazy thought which came up is to setup apollo-datasource-rest dataSource to make queries against our gateway end point as a dataSource for our resolvers. This way we can request the data we need and let Apollo Federation stitch all the data together as it is designed to do. So instead of the resolver querying the database for all the different pieces of data and then matching them up, we would request the data from our graphql gateway where this query is already defined.
What we are trying to avoid by doing this is having a repeated set of queries in child services to get the details which are already available in (or across) other services.
The question
Is this a really bad idea?
Is it a plausible idea?
Has anyone tried something like this before?
Yes we would have to ensure that there aren't circular dependencies on the resolvers. In our case I see the "dataSource accessing the gateway" utilized in gathering initial data in mutations.
Example of a federated query. In this query, event, allocatedTo, purchasedBy, and product are all types in other services. event is an Event type, allocatedTo and purchasedBy are a Profile type, and product is a Product type. This query provides me with all the data I would use to say, send an email notification to the people in the result set. Though to get this data from a resolver in a mutation to queue up those emails means I need to make many queries and align all the data through code myself instead of using the Gateway/federation which does this already with the already established query. The thought around using apollo-datasource-rest to query our own gateway is get at this data in this form. Not through separate queries and code to align id's etc.
query getRegisteredUsers($eventId: ID!) {
communications {
event(eventId: $eventId) {
registered {
event {
name
}
isAllocated,
hasCheckedIn,
lastUpdatedAt,
allocatedTo {
firstName
lastName
email
}
purchasedBy {
id
firstName
lastName
}
product {
__typename
...on Ticket {
id
name
}
}
}
}
}
}
FYI, I didn't quite understand the question until I looked at your edits, which had some examples.
Is this a really bad idea?
In my experience, yes. Not as an idea, as you're in good company with other very smart people who have done this.
Is it a plausible idea?
Absolutely it's plausible, but I don't recommend it.
Has anyone tried something like this before?
Yes, but I hope you don't.
Your Question
Having resolvers make requests back to the Gateway:
I do not recommend this. I've seen this happen, and I've personally worked to help companies out of the mess this takes you into. Circular dependencies are going to happen. Latency is just going to skyrocket as you have more and more hops, TLS handshakes, etc. Do orchestration instead. It feels weird to introduce non-GraphQL, but IMO in the end it's way simpler, faster, and more maintainable than where "just talk to the gateway" takes you.
What then?
When you're dealing with some mutations which require data from across multiple data sources to be able to process a single thing (like sending a transaction email to a person), you have some choices. Something that helped me figure this out was the question "how would I have done this before GraphQL?"
Orchestration: you have a single "orchestration service", which takes the mutation and makes calls (preferably non-GraphQL, so REST, gRPC, Lambda?) to the owner services to collect the data. The orchestration layer does NOT own data, but it can speak with the other services. It's like Federation, but for sending the data into the request, instead of into the response.
Choreography: you trigger roughly the same thing, but via an event stream. (doesn't work as well with the request / response model of GraphQL)
CQRS (projections): Copies of database data, used for things like reporting. CQRS is basically "the way you read data doesn't have to be the same as the way you write it", and it allows for things like event-sourced data. If all of your data sources actually share the same database, you don't even need "projections" as much as you would just want a read replica. If you're not at enough scale to do replicas, just skip it and promise never to write data that your current domain doesn't own.
What I Do
Where I work, I have gotten us to:
Queries
queries always start with "one database call".
if the "one database call" goes to one domain of data (most often true), that query goes into one service, and Federation fills in the leaves of your tree. If you really follow CQRS, this could go the same way as #3, but we don't.
if your "one database call" needs data from across domains (e.g. get all orders with Product X in it, but sorted by the customer's first name), you need a database projection. Preferably this can be handled by a "reporting service": it doesn't OWN any data, but it READS all data.
Mutations
if your top-level mutation modifies acts only within one domain, the mutation goes in a service, it can use database transactions, and Federation fills in the leaves
if your mutation is required to write across multiple domains and requires immediate consistency (placing an order with inventory, payments, etc), we chose orchestration to write across multiple services (and roll-back when necessary, since we don't have database transactions to do it for us).
if your mutation requires data from many places to send further into the request (like sending an email), we chose orchestration to pull from the multiple services and to push that data down. This feels very much like Federation, but in reverse.

GraphQL Asynchronous query results

I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.

Is there a way to combine a query and a command in CQRS?

I have a project built using CQRS, but I can't figure out how to implement one use case.
The user needs to be able to make a Query which will return a set of data for them to view. However, I also need to save the data they got at the same time.
Is there a way to do this within a Query without violating CQRS' principles? Or would the Query and Command need to be two separate API calls one after another?
In CQRS it is your client who can do both command and queries. This client is not necessary a piece of UI.
It can be an API endpoint handler, which would
receive a query
forward it to the query endpoint
wait for the answer
send an answer to the caller
send a command to store the answer
Is there a way to do this within a Query without violating CQRS' principles?
It depends.
If "save the data" means "make some change to the domain model"... well, that would be pretty weird.
Asking a question should not change the answer. -- Bertrand Meyer
On the other hand, logging/telemetry are pretty normal ways to track the activity of an application, so that should be fine.
There are some realities of a distributed system on an unreliable network that you need to be aware of (what should the behavior be if the telemetry system is not available? What are the consequences of recording queries that don't actually reach the client (because the network is unreliable).
As #VoiceOfUnreason stated, it may be somewhat strange to effect domain changes when querying data.
However, it may be that you could swop that around.
For instance, perhaps one could query a forecast of sorts. We would want to store that forecast. It then seems as though the query results in us having to save the result. This appears to break CQS at some level since each query would result in a change of state.
If we swop that around and first request a forecast via the domain handling and then that produces a result, or even a pointer to the result, then the query would be something you could perform on the data multiple times without "breaking" CQS.

Is graphql schema circular reference an anti-pattern?

graphql schema like this:
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
Now, the client sends a graphql query. Theoretically, the User and Location can circular reference each other infinitely.
I think it's an anti-pattern. For my known, there is no middleware or way to limit the nesting depth of query both in graphql and apollo community.
This infinite nesting depth query will cost a lot of resources for my system, like bandwidth, hardware, performance. Not only server-side, but also client-side.
So, if graphql schema allow circular reference, there should be some middlewares or ways to limit the nesting depth of query. Or, add some constraints for the query.
Maybe do not allow circular reference is a better idea?
I prefer to sending another query and doing multiple operations in one query. It's much more simple.
Update
I found this library: https://github.com/slicknode/graphql-query-complexity. If graphql doesn't limit circular reference. This library can protect your application against resource exhaustion and DoS attacks.
It depends.
It's useful to remember that the same solution can be a good pattern in some contexts and an antipattern in others. The value of a solution depends on the context that you use it. — Martin Fowler
It's a valid point that circular references can introduce additional challenges. As you point out, they are a potential security risk in that they enable a malicious user to craft potentially very expensive queries. In my experience, they also make it easier for client teams to inadvertently overfetch data.
On the other hand, circular references allow an added level of flexibility. Running with your example, if we assume the following schema:
type Query {
user(id: ID): User
location(id: ID): Location
}
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
it's clear we could potentially make two different queries to fetch effectively the same data:
{
# query 1
user(id: ID) {
id
location {
id
}
}
# query 2
location(id: ID) {
id
user {
id
}
}
}
If the primary consumers of your API are one or more client teams working on the same project, this might not matter much. Your front end needs the data it fetches to be of a particular shape and you can design your schema around those needs. If the client always fetches the user, can get the location that way and doesn't need location information outside that context, it might make sense to only have a user query and omit the user field from the Location type. Even if you need a location query, it might still not make sense to expose a user field on it, depending on your client's needs.
On the flip side, imagine your API is consumed by a larger number of clients. Maybe you support multiple platforms, or multiple apps that do different things but share the same API for accessing your data layer. Or maybe you're exposing a public API designed to let third-party apps integrate with your service or product. In these scenarios, your idea of what a client needs is much blurrier. Suddenly, it's more important to expose a wide variety of ways to query the underlying data to satisfy the needs of both current clients and future ones. The same could be said for an API for a single client whose needs are likely to evolve over time.
It's always possible to "flatten" your schema as you suggest and provide additional queries as opposed to implementing relational fields. However, whether doing so is "simpler" for the client depends on the client. The best approach may be to enable each client to choose the data structure that fits their needs.
As with most architectural decisions, there's a trade-off and the right solution for you may not be the same as for another team.
If you do have circular references, all hope is not lost. Some implementations have built-in controls for limiting query depth. GraphQL.js does not, but there's libraries out there like graphql-depth-limit that do just that. It'd be worthwhile to point out that breadth can be just as large a problem as depth -- regardless of whether you have circular references, you should look into implementing pagination with a max limit when resolving Lists as well to prevent clients from potentially requesting thousands of records at a time.
As #DavidMaze points out, in addition to limiting the depth of client queries, you can also use dataloader to mitigate the cost of repeatedly fetching the same record from your data layer. While dataloader is typically used to batch requests to get around the "n+1 problem" that arises from lazily loading associations, it can also help here. In addition to batching, dataloader also caches the loaded records. That means subsequent loads for the same record (inside the same request) don't hit the db but are fetched from memory instead.
TLDR; Circular references are an anti-pattern for non-rate-limited GraphQL APIs. APIs with rate limiting can safely use them.
Long Answer: Yes, true circular references are an anti-pattern on smaller/simpler APIs ... but when you get to the point of rate-limiting your API you can use that limiting to "kill two birds with one stone".
A perfect example of this was given in one of the other answers: Github's GraphQL API let's you request a repository, with its owner, with their repositories, with their owners ... infinitely ... or so you might think from the schema.
If you look at the API though (https://developer.github.com/v4/object/user/) you'll see their structure isn't directly circular: there are types in-between. For instance, User doesn't reference Repository, it references RepositoryConnection. Now, RepositoryConnection does have a RepositoryEdge, which does have a nodes property of type [Repository] ...
... but when you look at the implementation of the API: https://developer.github.com/v4/guides/resource-limitations/ you'll see that the resolvers behind the types are rate-limited (ie. no more than X nodes per query). This guards both against consumers who request too much (breadth-based issues) and consumers who request infinitely (depth-based issues).
Whenever a user requests a resource on GitHub it can allow circular references because it puts the burden on not letting them be circular onto the consumer. If the consumer fails, the query fails because of the rate-limiting.
This lets responsible users ask for the user, of the repository, owned by the same user ... if they really need that ... as long as they don't keep asking for the repositories owned by the owner of that repository, owned by ...
Thus, GraphQL APIs have two options:
avoid circular references (I think this is the default "best practice")
allow circular references, but limit the total nodes that can be queried per call, so that infinite circles aren't possible
If you don't want to rate-limit, GraphQL's approach of using different types can still give you a clue to a solution.
Let's say you have users and repositories: you need two types for both, a User and UserLink (or UserEdge, UserConnection, UserSummary ... take your pick), and a Repository and RepositoryLink.
Whenever someone requests a user via a root query, you return the User type. But that User type would not have:
repositories: [Repository]
it would have:
repositories: [RepositoryLink]
RepositoryLink would have the same "flat" fields as Repository has, but none of its potentically circular object fields. Instead of owner: User, it would have owner: ID.
The pattern you show is fairly natural for a "graph" and I don't think it's especially discouraged in GraphQL. The GitHub GraphQL API is the thing I often look at when I wonder "how do people build larger GraphQL APIs", and there are routinely object cycles there: a Repository has a RepositoryOwner, which can be a User, which has a list of repositories.
At least graphql-ruby has a control to limit nesting depth. Apollo doesn't obviously have this control, but you might be able to build a custom data source or use the DataLoader library to avoid repeatedly fetching objects you already have.
The above answers provide good theoretical discussion on the question. I would like to add more practical considerations that occur in software development.
As #daniel-rearden points out, a consequence of circular references is that it allows for multiple query documents to retrieve the same data. In my experience, this is a bad practice because it makes client-side caching of GraphQL requests less predictable and more difficult, since a developer would have to explicitly specify that the documents are returning the same data in a different structure.
Furthermore, in unit testing, it is difficult to generate mock data for objects whose fields/properties contain circular references to the parent. (at least in JS/TS; if there are languages that support this easily out-of-the-box, I'd love to hear it in a comment)
Maintenance of a clear data hierarchy seems to be the clear choice for understandable and maintainable schemas. If a reference to a field's parent is frequently needed, it is perhaps best to build a separate query.
Aside: Truthfully, if it were not for the practical consequences of circular references, I would love to use them. It would be beautiful and amazing to represent data structures as a "mathematically perfect" directed graph.

Am I misusing GraphQL if I must decompose REST data, then re-aggregate it?

We are considering using GraphQL on top of a REST service (using the
FHIR standard for medical records).
I understand that the pattern with GraphQL is to aggregate the results
of multiple, independent resolvers into the final result. But a
FHIR-compliant REST server offers batch endpoints that already aggregate
data. Sometimes we’ll need à la carte data—a patient’s age or address
only, for example. But quite often, we’ll need most or all of the data
available about a particular patient.
So although we can get that kind of plenary data from a single REST call
that knits together multiple associations, it seems we will need to
fetch it piecewise to do things the GraphQL way.
An optimization could be to eager load and memoize all the associated
data anytime any resolver asks for any data. In some cases this would be
appropriate while in other cases it would be serious overkill. But
discerning when it would be overkill seems impossible given that
resolvers should be independent. Also, it seems bloody-minded to undo
and then redo something that the REST service is already perfectly
capable of doing efficiently.
So—
Is GraphQL the wrong tool when it sits on top of a REST API that can
efficiently aggregate data?
If GraphQL is the right tool in this situation, is eager-loading and
memoization of associated data appropriate?
If eager-loading and memoization is not the right solution, is there
an alternative way to take advantage of the REST service’s ability
to aggregate data?
My question is different from
this
question and
this
question because neither touches on how to take advantage of another
service’s ability to aggregate data.
An alternative approach would be to parse the request inside the resolver for a particular query. The fourth parameter passed to a resolver is an object containing extensive information about the request, including the selection set. You could then await the batched request to your API endpoint based on the requested fields, and finally return the result of the REST call, and let your lower level resolvers handle parsing it into the shape the data was requested in.
Parsing the info object can be a PITA, although there's libraries out there for that, at least in the Node ecosystem.

Resources