I have been working on splitting up a monolithic system (.NET) into smaller bounded contexts and thus multiple class libraries. Address records are central to the system, so its important for the unrelated entities to reach back to a MASTER address record. Each microservice has its own database with tables related to that bounded context. I've run across a challenge with regard to Address records or the Address Entity. Both of these bounded contexts or libraries need the use of an Address table and the related lookup tables, such as State, Country, etc. Should I duplicate the Address related tables in each database and also duplicate the Address related domain classes? If I duplicate, I will have a ton of duplicate address records in both databases and this has me stuck. If I put the Address tables in something like a "Location Service", it will be difficult to display address fields in search results. Maybe this isn't the proper scenario for microservices?
EXAMPLE:
Microservice-1 (product A)
Incident -> one to one -> Address
Microservice-2 (product B)
Fire Hydrant -> one to one -> Address
Microservice-3 (product C)
Inspection -> many to one -> Address
For better illustration I assume REST APIs for each Microservice.
As I see it you have a IncidentService, a FireHydrantService and an InspectionService that all rely on the LocationService.
By following the Microservice approach, you have to be aware of two goals you want to achieve:
Loose coupling
Cohesive behavior
this implies that your services are not allowed to share a Database (it could be the same physically, but not logically), so that a service cannot modify the data of a different service or access its data directly by bypassing the responsible service. This has to be done by the responsible service through its API exclusively. Otherwise Cohesive behavior would be put at risk.
A RESTful API is great because we want to also couple loosely.
A Resource could look like this:
"incident": {
"time": "...",
"location" : {
"name": "an arbitrary address"
"href": "apis.yourdomain.com/locationservice/address/1"
"type": "application/json"
},
"nearestHydrant": {
"name": "hydrant 42"
"href": "apis.yourdomain.com/firehydrantservice/hydrant/42"
"type": "application/json"
}
}
As you can see, those enitites (better its representations) are only represented as Link referencing (an entity in) the responsible service. This coupling is loose because a Link is less likely to change as a representation of a Entity throughout the lifetime of you application. By naming the field you also get the semantics of the relation between FireHydrant and Incident.
But things could get complicated very easily. Assume a rule like "A domain object can only be emitted by the responsible service" in order to maintain Cohesive behavior. This means that only the FireHydrantService is allowed to serve FireHydrant.
Now imagine a request like "Incident at address 1". Which service should be queried for the nearest FireHydrant? Following the above rule, it should be the FireHydrantService. But how does the FireHydrantService know, it doesn't know what nearest means?
In such a case there is a thing called non-authoritative cache (which could also be persisted in a DB). This just means that the LocationService is allowed to store FireHydrant in a cache or in a database, but is not allowed to share them with others. This way the above rule is still followed and the Cohesive behavior still at hand. The non-authoritative cache does not need to reflect the full domain object FireHydrant, but everything necessary to fulfill the UseCases.
So the IncidentService asks the FireHydrantService for the nearest FireHydrant(s), what gets delegated to the LocationService, that queries its non-authoritative cache for the nearest location with type FireHydrant, the response is a reference (Link) to the FireHydrantService itself, that checks if the FireHydrant has had its Inspection and therefore is valid according to business rules (if not it would check for the next) and return a reference (Link) to the IncidentService.
My rule of thumb is, that every service can be queried on every domain object, but can only serve the domain object it is responsible for. The "queried on domain objects" could live in the forementioned non-authoritative cache. This of course leads to duplication and the interesting part is to keep them in sync. A thing that instantly comes to mind is the use of Event Sourcing. A Service emits an Event "Hydrant 42 out of order" and the LocationService would remove Hydrant 42 from its cache.
As I'm from java background I don't know the libraries and bounded context related to .Net mentioned above. But I will try to answer this question in generic way.
To solve this issue you can store address related tables in different database which can be accessed by all these three microservices.
As lookup tables for each microservice will have same tables/data, so you can use same lookup tables for all microservices. If your address table have different columns then you can store address for each microservice in different table else you can store data in same table by storing address_type for each microservice.
Related
I want to write some Rest(ful) application with Spring Boot and Spring Data JPA.
Let's assume that for business reasons I have a database with the following tables:
customer(id number, first_name text, last_name text, type text);
customer_type(type text, description text);
where:
id is generated by the database at inserion time
type column in customer table is a foreign key to type column in customer_type table and it is immutable from a microservice point of view, just a lookup table.
Assuming I want to create APIs for CRUD operations on a customer but want to minimize api calls when just reading, I suppose I need the following operations:
GET /customer/{id}
POST /customer
PUT /customer/{id}
DELETE /customer/{id}
How the body should be structured?
For GET operation the response should be
{
"id":123,
"firstName":"John",
"lastName":"Doe",
"customerType":{
"type":"P",
"description":"Premium Customer"
}
}
But for POST I imagine I need to avoid sending the id and send just the customer type since the description is immutable and the client needs the description only for visualizing the information on screen, but this leads to different request body from the one returned in the GET operation.
For the PUT operation is the same but also should the id field be sent? How to handle the case where the id in the API path is different from the id in the request body if sent?
DELETE should not be a problem since it just deletes the row in customer table.
Thank you
How the body should be structured?
Let's make a step back first and let us discuss quickly what you basically try when following a REST architecture and why and how REST installs those mechanisms.
REST is an architectural style that helps in decoupling clients from servers by introducing indirection mechanisms which may seem odd at first but in the end allow you to achieve the required level of decoupling which allows clients to introduce changes which clients will naturally adept to. Such indirection mechanisms include attaching URIs to link-relation names, using form-based representation formats to tell a client how to create requests, content-type negotiation to return representations supported and understood by others and so forth. If you don't need such properties, i.e. as client and servers always go hand in hand in regards to changes and communicate on predefined messages, REST is probably not the best style to follow. If you though have a server that is contacted by various clients not under your control or a client that has to contact various servers, also not under your direct control, this is where REST truly starts to shine if all parties adhere to these concepts.
One of RESTs premise is that a server will teach clients everything they need to know in order to construct requests. If you look at the Web, where HTML is basically used everywhere, you might see that HTML defines HTML forms which basically allow a server to explain to a client what properties of a resource the server expects as input. On top of that the form also tells you client which HTTP operation to use, which target URI to send the request to and which media-type to represent the state in. In HTML this is usually implicitly given as application/x-www-form-urlencoded which chains properties together i.e. like this:
firstName=Roman&lastName=Vottner&role=Dev
or the like. This is in essence what HATEOAS or hypertext as the engine of application state is all about. You use in-build controls of the media-type exchanged to allow your client to progress its task instead of having to consult external documentation to lookup the "API" of some services. I.e. a form could state that an input only allows numeric values, that a sub-portion of the form represents a date/time picker widget which a client could render to a user accordingly, or an element represents a slider with a given range of admissible values and the like.
How the actual representation format you have to send to the server has to look like depends on the instructed media-type. I.e. HAL forms uses application/json by default and also specifies that application/x-www-form-urlencoded needs to be supported. Other media-types have explicitly negotiated between client and server. Ion states that application/json or application/ion+json have to be negotiated via the Content-Type request header.
In plain application/json the url-encoded payload from above could simply be expressed as:
{
"firstName": "Roman",
"lastName": "Vottner",
"role": "Dev"
}
and this is OK as the server basically instructed you to send this data in that format.
There are further media-types available that are worth a closer look whether they could fit your need or not. I.e. Hydra has a bit of a different take on this matter by connecting Linked Data to REST and its affordances called operations and allows to describe resources and its properties through LD classes. So the presence of an affordance for a certain resource tells you what you can do with that resource, like i.e. updating its state, and therefore also which class it belongs to and therefore which properties it has.
This just should illustrate how a negotiated media type finally decides how the actual representation needs to look like that has to be sent to the server.
In regards of whether to put in resource identifiers in the payload or not it depends. Usually resources are identified by the URI/IRI and this, as a whole, is the identifier of the resource. In your application though you will reference related domain objects through their ID which does not necessarily need to be, and probably also should not be, part of the IRI itself. I.e. let's assume we retrieve a resource that represents an order. That order contains the users name and address, the various items that got ordered including some meta data describing those items and what not. It usually makes sense in such a case to add the orderId which you use in your application even though the URI may contain that information already. Users of that API are usually not interested in those URIs but the actual content and might also never see those URIs if they are hidden behind automated processes or user interfaces. If a user now wants to print out that order s/he has all the information needed to file complaints later on via phone i.e. In other cases, i.e. if you design a resource to be an all-purpose clipboard like, copy&paste location, an ID does not make any sense unless you grant the user to explicitly reference one of that states directly.
The reason why IDs should not be part of the URI itself stems from the fact that a URI shouldn't change if the actual resource does not change. I.e. we have a customer who went through a merge a couple of years ago. They used to expose all their products via own URIs that exposed the productId as part of the URI. During the merger the tried to combine the various different data models to reduce the number of systems they had to operate while serving each of their customers with the same data as before as the underlying products didn't change. As they tried to stay "backwards" compatible for the purpose of supporting legacy systems of their customers, they quickly noticed that exposing those productIds as part of the URI was causing them some troubles. If they had used a mapping table of i.e. exposed UUIDs to internal productIds (again an introduction of indirection) earlier they could have reduced their whole data model and thus complexity by a lot while being able to change the mapping from internal prodcutId to UUID on the fly while allowing their clients to lookup the product information.
Long story short, as hopefully can be seen the structure of a representation depends on the exchanged media type. There are loads of different media-types available. Use the ones that allow you to describe resources to clients, such as HAL/HAL forms, Ion, Hydra, .... In regards to URIs, don't overengineer URIs. They are, as a whole, just a pointer to a resource and clients are usually interested in the content, not the URI! As such, make use of indirection-features like link-relation names, content-type negotiation and so forth to help remove the direct coupling of clients to services but instead rely more on the document type exchanged. The media-type here becomes basically the contract of the message. Through mappings on the client and server side resources of various representations can be "translated" to an object which you can use in your application.
As you've tagged your question with spring-boot and spring-data-jpa, you might want to look into spring-hateoas. It supports HAL out of the box, HAL forms can be used via affordances though the media-type needs to be enabled explicitly for it otherwise you might miss out on the form-template in the responses. Hydra support in spring-hateoas seems to be added through hydra-java which implements the Spring HATEOAS SPI. While Amazon provides implementation for Ion for various programming languages, including Java, it does not yet support Spring HATEOAS or Spring in general. Here a custom SPI implementation may be necessary.
For PUT operations you need to send the id of the entity that you want to update.
If you want to generate the same response as you would get in GET, then you need to write a DTO and map details accordingly.
We've been using Apollo Federation for about 1.5 years as our main api. Behind the federation gateway are 6 child graphql services which are all combined at the gateway. This configuration really works excellent when you have a result set of data which spans the different services. E.g. A list of tickets which references the user who purchased and event it is associated with it, etc.
One place we have experienced this breaking down is when a pre-set of data is needed which is already defined in another child service (or across other child services) (resolver/path). There is no way (that has been discovered by us) to query the federation from a child service to get a federated set of data for use by a resolver to work with that data.
For example, say we have a graphql query defined which queries all tickets for an event, and through federation returns the purchaser's data, the event's data and the products data. If I need this data set from a resolver, I would need to make all those queries again myself duplicating dataSource logic and having to match up the data in code.
One crazy thought which came up is to setup apollo-datasource-rest dataSource to make queries against our gateway end point as a dataSource for our resolvers. This way we can request the data we need and let Apollo Federation stitch all the data together as it is designed to do. So instead of the resolver querying the database for all the different pieces of data and then matching them up, we would request the data from our graphql gateway where this query is already defined.
What we are trying to avoid by doing this is having a repeated set of queries in child services to get the details which are already available in (or across) other services.
The question
Is this a really bad idea?
Is it a plausible idea?
Has anyone tried something like this before?
Yes we would have to ensure that there aren't circular dependencies on the resolvers. In our case I see the "dataSource accessing the gateway" utilized in gathering initial data in mutations.
Example of a federated query. In this query, event, allocatedTo, purchasedBy, and product are all types in other services. event is an Event type, allocatedTo and purchasedBy are a Profile type, and product is a Product type. This query provides me with all the data I would use to say, send an email notification to the people in the result set. Though to get this data from a resolver in a mutation to queue up those emails means I need to make many queries and align all the data through code myself instead of using the Gateway/federation which does this already with the already established query. The thought around using apollo-datasource-rest to query our own gateway is get at this data in this form. Not through separate queries and code to align id's etc.
query getRegisteredUsers($eventId: ID!) {
communications {
event(eventId: $eventId) {
registered {
event {
name
}
isAllocated,
hasCheckedIn,
lastUpdatedAt,
allocatedTo {
firstName
lastName
email
}
purchasedBy {
id
firstName
lastName
}
product {
__typename
...on Ticket {
id
name
}
}
}
}
}
}
FYI, I didn't quite understand the question until I looked at your edits, which had some examples.
Is this a really bad idea?
In my experience, yes. Not as an idea, as you're in good company with other very smart people who have done this.
Is it a plausible idea?
Absolutely it's plausible, but I don't recommend it.
Has anyone tried something like this before?
Yes, but I hope you don't.
Your Question
Having resolvers make requests back to the Gateway:
I do not recommend this. I've seen this happen, and I've personally worked to help companies out of the mess this takes you into. Circular dependencies are going to happen. Latency is just going to skyrocket as you have more and more hops, TLS handshakes, etc. Do orchestration instead. It feels weird to introduce non-GraphQL, but IMO in the end it's way simpler, faster, and more maintainable than where "just talk to the gateway" takes you.
What then?
When you're dealing with some mutations which require data from across multiple data sources to be able to process a single thing (like sending a transaction email to a person), you have some choices. Something that helped me figure this out was the question "how would I have done this before GraphQL?"
Orchestration: you have a single "orchestration service", which takes the mutation and makes calls (preferably non-GraphQL, so REST, gRPC, Lambda?) to the owner services to collect the data. The orchestration layer does NOT own data, but it can speak with the other services. It's like Federation, but for sending the data into the request, instead of into the response.
Choreography: you trigger roughly the same thing, but via an event stream. (doesn't work as well with the request / response model of GraphQL)
CQRS (projections): Copies of database data, used for things like reporting. CQRS is basically "the way you read data doesn't have to be the same as the way you write it", and it allows for things like event-sourced data. If all of your data sources actually share the same database, you don't even need "projections" as much as you would just want a read replica. If you're not at enough scale to do replicas, just skip it and promise never to write data that your current domain doesn't own.
What I Do
Where I work, I have gotten us to:
Queries
queries always start with "one database call".
if the "one database call" goes to one domain of data (most often true), that query goes into one service, and Federation fills in the leaves of your tree. If you really follow CQRS, this could go the same way as #3, but we don't.
if your "one database call" needs data from across domains (e.g. get all orders with Product X in it, but sorted by the customer's first name), you need a database projection. Preferably this can be handled by a "reporting service": it doesn't OWN any data, but it READS all data.
Mutations
if your top-level mutation modifies acts only within one domain, the mutation goes in a service, it can use database transactions, and Federation fills in the leaves
if your mutation is required to write across multiple domains and requires immediate consistency (placing an order with inventory, payments, etc), we chose orchestration to write across multiple services (and roll-back when necessary, since we don't have database transactions to do it for us).
if your mutation requires data from many places to send further into the request (like sending an email), we chose orchestration to pull from the multiple services and to push that data down. This feels very much like Federation, but in reverse.
Quick question on Foreign key in Microservices. I already tried looking for answer. But, they did not give me the exact answer I was looking for.
Usecase : Every blog post will have many comments. Traditional monolith will have comments table with foreign key to blog post. However in microservice, we will have two services.
Service 1 : Post Microservie with these table fields (PostID, Name, Content)
Service 2 : Comments Microservie with these table fields (CommentID, PostID, Cpmment)
The question is, Do we need "PostID" in service 2 (Comments Microservice) ? I guess the answer is yes, as we need to know which comment belongs to which post. But then, it will create tight coupling? I mean if I delete service 1(Blog post service), it will impact service 2(Comments service) ?
I'm going to use another example I'm more familiar with to explain how I believe most people would do this.
Consider an Order Management System (OMS) and an Inventory Management System (IMS).
When a customer places an order in the company web site, we ask the OMS to create an order entry in the backend (e.g. via an HTTP endpoint).
The OMS system then broadcasts an event e.g. OrderPlaced containing all the details of the customer order. We may have a pub/sub (e.g. Redis), or a queue (e.g. RabbitMQ), or an event stream (e.g. Kafka) where we place the event (although this can be done in many other ways).
The thing is that we have one or more subscribers interested in this event. One of those could be the IMS, which has the responsibility of assigning the best inventory available every time an order is placed.
We can expect that the IMS will keep a copy of the relevant order information it received when it processed the OrderPlaced event such that it does not ask every little detail of the order to the OMS all the time. So, if the IMS needed a join with the order, instead of calling an endpoint in the Order API, it would probably just do a join with its local copy of the orders table.
Say now that our customer called to cancel her order. A customer service representative then cancelled it in the OMS Web User Interface. At that point an event OrderCanceled is broadcast. Guess who is listening for that event? Correct, the IMS receives notification and acts accordingly reversing the inventory assignation and probably even deleting the order record because it is no longer necessary on this domain.
So, as you can see, the best way to do this is by using events and making copies of the relevant details on the other domain.
Since events need time to get broadcast and processed by interested parties, we say that the order data in the IMS is eventually consistent.
Followup Questions
Q: So, if I understood right in microservises we prefer to duplicate data and get better performance? That is the concept? I mean I know the concept is scaling and flexibility but when we must share data we will just duplicate it?
Not really. That´s definitively not what I meant although it may have sounded like that due to my poor choice of words in the original explanation. It appears to me that at the heart of your question lies a lack of sufficient understanding of the concept of a bounded context.
In my explanation I meant to indicate that the OMS has a domain concept known as the order, but so does the IMS. Therefore, they both have an entity within their domain that represents it. There is a good chance that the order entity in the OMS is much richer than the corresponding representation of the same concept in the IMS.
For example, if the system I was describing was not for retail, but for wholesale, then the same concept of a "sales order" in our system corresponds to the concept of a "purchase order" in that of our customers. So you see, the same data, mapped under a different name, simply because under a different bounded context the data may have a different perspective and meaning.
So, this is the realization that a given concept from our model may be represented in multiple bounded contexts, perhaps from a different perspective and names from our ubiquitous language.
Just to give another example, the OMS needs to know about the customer, but the representation of the idea of a customer in the OMS is probably different than the same representation of such a concept or entity in the CRM. In the OMS the customer's name, email, shipping and billing addresses are probably enough representation of this idea, but for the CRM the customer encompasses much more.
Another example: the IMS needs to know the shipping address of the customer to choose the best inventory (e.g. the one in a facility closest to its final destination), but probably does not care much about the billing address. On the other hand, the billing address is fundamental for the Payment Management System (PMS). So, both the IMS and PMS may have a concept of an "order", it is just that it is not exactly the same, neither it has the same meaning or perspective, even if we store the same data.
One final example: the accounting system cares about the inventory for accounting purposes, to be able to tell how much we own, but perhaps accounting does not care about the specific location of the inventory within the warehouse, that's a detail only the IMS cares about.
In conclusion, I would not say this is about "copying data", this is about appropriately representing a fundamental concept within your bounded context and the realization that some concepts from the model may overlap between systems and have different representations, sometimes even under different names and levels of details. That's why I suggested that you investigate the idea of context mapping some more.
In other words, from my perspective, it would be a mistake to assume that the concept of an "order" only exists in the OMS. I could probably say that the OMS is the master of record of orders and that if something happens to an order we should let other interested systems know about those events since they care about some of that data because those other systems could have mapping concepts related to orders and when reacting to the changes in the master of record, they probably want to change their data as well.
From this point of view, copying some data is a side effect of having a proper design for the bounded context and not a goal in itself.
I hope that answers your question.
graphql schema like this:
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
Now, the client sends a graphql query. Theoretically, the User and Location can circular reference each other infinitely.
I think it's an anti-pattern. For my known, there is no middleware or way to limit the nesting depth of query both in graphql and apollo community.
This infinite nesting depth query will cost a lot of resources for my system, like bandwidth, hardware, performance. Not only server-side, but also client-side.
So, if graphql schema allow circular reference, there should be some middlewares or ways to limit the nesting depth of query. Or, add some constraints for the query.
Maybe do not allow circular reference is a better idea?
I prefer to sending another query and doing multiple operations in one query. It's much more simple.
Update
I found this library: https://github.com/slicknode/graphql-query-complexity. If graphql doesn't limit circular reference. This library can protect your application against resource exhaustion and DoS attacks.
It depends.
It's useful to remember that the same solution can be a good pattern in some contexts and an antipattern in others. The value of a solution depends on the context that you use it. — Martin Fowler
It's a valid point that circular references can introduce additional challenges. As you point out, they are a potential security risk in that they enable a malicious user to craft potentially very expensive queries. In my experience, they also make it easier for client teams to inadvertently overfetch data.
On the other hand, circular references allow an added level of flexibility. Running with your example, if we assume the following schema:
type Query {
user(id: ID): User
location(id: ID): Location
}
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
it's clear we could potentially make two different queries to fetch effectively the same data:
{
# query 1
user(id: ID) {
id
location {
id
}
}
# query 2
location(id: ID) {
id
user {
id
}
}
}
If the primary consumers of your API are one or more client teams working on the same project, this might not matter much. Your front end needs the data it fetches to be of a particular shape and you can design your schema around those needs. If the client always fetches the user, can get the location that way and doesn't need location information outside that context, it might make sense to only have a user query and omit the user field from the Location type. Even if you need a location query, it might still not make sense to expose a user field on it, depending on your client's needs.
On the flip side, imagine your API is consumed by a larger number of clients. Maybe you support multiple platforms, or multiple apps that do different things but share the same API for accessing your data layer. Or maybe you're exposing a public API designed to let third-party apps integrate with your service or product. In these scenarios, your idea of what a client needs is much blurrier. Suddenly, it's more important to expose a wide variety of ways to query the underlying data to satisfy the needs of both current clients and future ones. The same could be said for an API for a single client whose needs are likely to evolve over time.
It's always possible to "flatten" your schema as you suggest and provide additional queries as opposed to implementing relational fields. However, whether doing so is "simpler" for the client depends on the client. The best approach may be to enable each client to choose the data structure that fits their needs.
As with most architectural decisions, there's a trade-off and the right solution for you may not be the same as for another team.
If you do have circular references, all hope is not lost. Some implementations have built-in controls for limiting query depth. GraphQL.js does not, but there's libraries out there like graphql-depth-limit that do just that. It'd be worthwhile to point out that breadth can be just as large a problem as depth -- regardless of whether you have circular references, you should look into implementing pagination with a max limit when resolving Lists as well to prevent clients from potentially requesting thousands of records at a time.
As #DavidMaze points out, in addition to limiting the depth of client queries, you can also use dataloader to mitigate the cost of repeatedly fetching the same record from your data layer. While dataloader is typically used to batch requests to get around the "n+1 problem" that arises from lazily loading associations, it can also help here. In addition to batching, dataloader also caches the loaded records. That means subsequent loads for the same record (inside the same request) don't hit the db but are fetched from memory instead.
TLDR; Circular references are an anti-pattern for non-rate-limited GraphQL APIs. APIs with rate limiting can safely use them.
Long Answer: Yes, true circular references are an anti-pattern on smaller/simpler APIs ... but when you get to the point of rate-limiting your API you can use that limiting to "kill two birds with one stone".
A perfect example of this was given in one of the other answers: Github's GraphQL API let's you request a repository, with its owner, with their repositories, with their owners ... infinitely ... or so you might think from the schema.
If you look at the API though (https://developer.github.com/v4/object/user/) you'll see their structure isn't directly circular: there are types in-between. For instance, User doesn't reference Repository, it references RepositoryConnection. Now, RepositoryConnection does have a RepositoryEdge, which does have a nodes property of type [Repository] ...
... but when you look at the implementation of the API: https://developer.github.com/v4/guides/resource-limitations/ you'll see that the resolvers behind the types are rate-limited (ie. no more than X nodes per query). This guards both against consumers who request too much (breadth-based issues) and consumers who request infinitely (depth-based issues).
Whenever a user requests a resource on GitHub it can allow circular references because it puts the burden on not letting them be circular onto the consumer. If the consumer fails, the query fails because of the rate-limiting.
This lets responsible users ask for the user, of the repository, owned by the same user ... if they really need that ... as long as they don't keep asking for the repositories owned by the owner of that repository, owned by ...
Thus, GraphQL APIs have two options:
avoid circular references (I think this is the default "best practice")
allow circular references, but limit the total nodes that can be queried per call, so that infinite circles aren't possible
If you don't want to rate-limit, GraphQL's approach of using different types can still give you a clue to a solution.
Let's say you have users and repositories: you need two types for both, a User and UserLink (or UserEdge, UserConnection, UserSummary ... take your pick), and a Repository and RepositoryLink.
Whenever someone requests a user via a root query, you return the User type. But that User type would not have:
repositories: [Repository]
it would have:
repositories: [RepositoryLink]
RepositoryLink would have the same "flat" fields as Repository has, but none of its potentically circular object fields. Instead of owner: User, it would have owner: ID.
The pattern you show is fairly natural for a "graph" and I don't think it's especially discouraged in GraphQL. The GitHub GraphQL API is the thing I often look at when I wonder "how do people build larger GraphQL APIs", and there are routinely object cycles there: a Repository has a RepositoryOwner, which can be a User, which has a list of repositories.
At least graphql-ruby has a control to limit nesting depth. Apollo doesn't obviously have this control, but you might be able to build a custom data source or use the DataLoader library to avoid repeatedly fetching objects you already have.
The above answers provide good theoretical discussion on the question. I would like to add more practical considerations that occur in software development.
As #daniel-rearden points out, a consequence of circular references is that it allows for multiple query documents to retrieve the same data. In my experience, this is a bad practice because it makes client-side caching of GraphQL requests less predictable and more difficult, since a developer would have to explicitly specify that the documents are returning the same data in a different structure.
Furthermore, in unit testing, it is difficult to generate mock data for objects whose fields/properties contain circular references to the parent. (at least in JS/TS; if there are languages that support this easily out-of-the-box, I'd love to hear it in a comment)
Maintenance of a clear data hierarchy seems to be the clear choice for understandable and maintainable schemas. If a reference to a field's parent is frequently needed, it is perhaps best to build a separate query.
Aside: Truthfully, if it were not for the practical consequences of circular references, I would love to use them. It would be beautiful and amazing to represent data structures as a "mathematically perfect" directed graph.
I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.