What is the practical difference between merging and stitching GraphQL schemas? The graphql-tools documentation (merging:https://www.graphql-tools.com/docs/schema-merging, stitching:https://www.graphql-tools.com/docs/schema-stitching/stitch-combining-schemas) is a bit ambiguous when it comes exactly to each implementation's use cases. If I understood correctly, stitching is just a matter of organizational preferences and each subscheme becomes a 'proxy' to your scheme, while the merge functionality seems pretty similar to me. Could you please explain the difference? Thank you!
Schema stitching is used when you want to retrieve data from multiple GraphQL APIs in the same query (which is basically the motivation behind GraphQL).
For example, you may have to extract data from two GraphQL APIs - one which offers you information about the location, the other GraphQL API gives information about the weather. For you to execute a query that has access to both endpoints at the same time, you have to STITCH the schemas of the two endpoints, which will allow you to perform a query like this(which presents a link between the two endpoints as well) :
{
event(id: "5983706debf3140039d1e8b4") {
title
description
url
location {
city
country
weather {
summary
temperature
}
}
}
}
On the other hand, schema merging refers to gathering all your schemas that have been split based on domains, mainly for organizational purposes. Schema merging does not keep the individual subschemas.
Related
When embedding data from another document in another collection, what is the best practice as to who should be responsible to populate that data in a microservice architecture?
As an example, let's say I have basic information about an organization:
{
"id" : 1,
"legalName": "Initech"
}
which I want to embed in an invoice like this to avoid doing two service requests to show the invoice:
{
"type": "Payable",
"invoiceStatus": "Preparing Preliminary Version",
"applicablePeriod": {
"startDateTime": "2020-07-08T00:10:59.618Z",
"endDateTime": "2020-07-08T00:10:59.618Z"
},
"issuedDateTime": "2020-07-08T00:10:59.618Z"
"issuingOrganization":{
"id": 1,
"legalName": "Initech"
}
}
Would it be the caller's responsibility to supply the data while creating/updating the invoice or would it be the invoice service that would retrieve the external data using the organization id and then embed the data as necessary?
I feel like I should avoid having cross service dependencies in the backend as much as possible. I understand the maintenance of the embedded data could be achieved through the change feed but I was wondering about the initial population of the embedded data.
Did you get an answer back on this? I at least, wanted to provide an answer to serve as general guidance. It comes down to the state of the data. Please see the following document which covers this specific topic in greater detail: Data modeling in Azure Cosmos DB
In general, use embedded data models when (link):
There are contained relationships between entities.
There are one-to-few relationships between entities.
There is embedded data that changes infrequently.
There is embedded data that will not grow without bound.
There is embedded data that is queried frequently together.
Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth. So what do we do now?
When to reference (link):
In general, use normalized data models when:
Representing one-to-many relationships.
Representing many-to-many relationships.
Related data changes frequently.
Referenced data could be unbounded.
Hybrid data models (link).
We've now looked embedding (or denormalizing) and referencing (or normalizing) data, each have their upsides and each have compromises as we have seen.
It doesn't always have to be either or, don't be scared to mix things up a little.
Based on your application's specific usage patterns and workloads there may be cases where mixing embedded and referenced data makes sense and could lead to simpler application logic with fewer server round trips while still maintaining a good level of performance.
So, with the above Data Model information, the other half of the equation is Identifying microservice boundaries and designing a microservices architecture but, in a simpler scenario...the invoice service would perform an update to the root document, either through embedding the invoice or linking the invoice.
We get FHIR bundles from vendor, mostly patient, encounter, observation, flag and a few other resources (10 total). We have an option to store resources as json values or we can come up with a process to normalize all the nested structures into separate tables. We are going to use traditional BI tools to do some analytics and build some dashboards and these tools do not support json natively. Should we do former or latter and what is the best/easiest way to build/generate these normalized tables programmatically?
Ultimately how you decide to store these is not part of the scope of FHIR, and any answer you get on here is going to be one person's opinion. You need to figure out what method makes the most sense for the product/business you're building.
Here are some first principles that may help you:
Different vendors will send different FHIR at you. Fields may be missing, different code systems may be used.
FHIR extensions contain a lot of valuable information and the JSON representation is an Entity Attribute Value. EAV is anti-pattern for relational databases.
FHIR versions will change overtime - fields will be added and have their names changed, and new extensions will be relevant.
As far as your second question about generating the tables - I think that you will be best served by designing the data model you need, and mapping the FHIR data to it. That said there are a number of open source FHIR implementations you can study for inspiration.
Modern databases like postgresql, oracle & mssql have a good support for json datatype. To flatten FHIR resources for BI you can consider building relational (may be normalised) views. We built simple DSL, which allow you describe destination relation as set of (fhir)paths in resource.
I read GraphQL specs and could not find a way to avoid 1 + N * number_of_nested calls, am I missing something?
i.e. a query has a type client which has nested orders and addresses, if there are 10 clients it will do 1 call for the 10 clients + 10 calls for each client.orders + 10 calls for each client.addresses.
Is there a way to avoid this? Not that it is not the same as caching an UUID of something, those are all different values and if you GraphQL points to a database which can make joins, it would be pretty bad on it because you could do 3 queries for any number of clients.
I ask this because I wanted to integrate GraphQL with an API that can fetch nested resources in an efficient way and if there was a way to solve the whole graph before resolving it would be nice to try to put some nested stuff in just one call.
Or I got it wrong and GraphQL is meant to be used only with microservices?
This is one of the difficulties of GraphQL's "resolver architecture". You must avoid incurring a ton of network latency by doing a lot of I/O in each resolver. Apps using a SQL DBMS will often grapple with the N + 1 problem at first. You need to use some batching and/or caching techniques to get around this.
If you are using Node.js on the server, I have two tools to recommend:
DataLoader - A database-agnostic tool for batching resolvers for each field and caching individual records.
Join Monster - A SQL-tailored tool that reads each query and your schema and compiles a SQL query for you. It leverages JOINs and DataLoader-style batching to fetch the data from your tables in a few (or a single) SQL queries.
I consider, that you're talking about using GraphQL with SQL database backend. The standard itself is database agnostic, and it doesn't care, how are you going to work out the problems of possible N+1 SELECT issues in your code. That being said, the specific server-side implementations of GraphQL server introduce many different ways of mitigating that problem:
AFAIK, Ruby implementation is able to to make use of Active Record and gems such as bullet to apply horizontal batching of executed database calls.
JavaScript implementation may make use of DataLoader library, which have similar techinque of batching series of executed promises together. You can see it in action here.
Elixir and Python implementations have concept of runtime info about executed subqueries, that can be used to determine which data will be further needed in order to execute GraphQL query, and potentially prefetch it.
F# implementation works similar to Elixir, but plugin itself can perform live analysis of execution tree to better describe, which fields can be potentially used in code, allowing for easier split of GraphQL domain model from database model.
Many implementations (i.e. PostGraph) tie underlying database model directly into GraphQL schema. In this case GQL query is often translated directly into database query language.
We are trying to implement a FHIR Rest Server for our application. In our current data model (and thus live data) several FHIR resources are represented by multiple tables, e.g. what would all be Observations are stored in tables for vital values, laboratory values and diagnosis. Each table has an independent, auto-incrementing primary ID, so there are entries with the same ID in different tables. But for GET or DELETE calls to the FHIR server a unique ID is needed. What would be the most sensible way to handle this?
Searching didn't reveal an inherent way of doing this, so I'm considering these two options:
Add a prefix to all (or just the problematic) table IDs, e.g lab-123 and vit-123
Add a UUID to every table and use that as the logical identifier
Both have drawbacks: an ID parser is necessary for the first one and the second requires multiple database calls to identify the correct record.
Is there a FHIR way that allows to split a resource into several sub-resources, even in the Rest URL? Ideally I'd get something like GET server:port/Observation/laboratory/123
Server systems will have all sorts of different divisions of data in terms of how data is stored internally. What FHIR does is provide an interface that tries to hide those variations. So Observation/laboratory/123 would be going against what we're trying to do - because every system would have different divisions and it would be very difficult to get interoperability happening.
Either of the options you've proposed could work. I have a slight leaning towards the first option because it doesn't involve changing your persistence layer and it's a relatively straight-forward transformation to convert between external/fhir and internal.
Is there a FHIR way that allows to split a resource into several
sub-resources, even in the Rest URL? Ideally I'd get something like
GET server:port/Observation/laboratory/123
What would this mean for search? So, what would /Obervation?code=xxx search through? Would that search labs, vitals etc combined, or would you just allow access on /Observation/laboratory?
If these are truly "silos", maybe you could use http://servername/lab/Observation (so swap the last two path parts), which suggests your server has multiple "endpoints" for the different observations. I think more clients will be able to handle that url than the url you suggested.
Best, still, I think is having one of your two other options, for which the first is indeed the easiest to implement.
For a multi-tenancy architecture for a web application using a document-oriented database I can see two conceivable options:
Having one database per tenant, and the collections logically separate different kinds of object.
Having one collection per tenant, and all user data is stored in one database, with some kind of flag or object type identifier on each record.
Have there been any studies or has any documentation been produced regarding these two options and the differences between them?
Is there a particular standard or good reason why someone designing a web application which allows multiple users to store vastly different kinds of data would choose one over the other?
Aside from speed/efficiency issues, are there any other things to be said about this that would influence the decision?
EDIT I'm aware some of the terminology might be database specific, so for all wondering I am specifically referring to MongoDB.
I wouldn't want tenant specific collections. In my application, I usually hard code collection names, in the same way as I'd hardcode table names if I were using SQL tables. There'd be one comments collection that stores all comments for a blog. I would not want to deal with collection names like comments_tenant_1 and comments_tenant_2, because 1) that feels error prone, and 2) would make the application code more complicated (collection names would have to be replaced with functions that computed the collection name). And 3) the number of collections in a single database could grow huge, which would make a list of all collections look daunting, and also MongoDB isn't built for having very many collections (see the link in the comment below your question, which David B posted, https://docs.mongohq.com/use-cases/multi-tenant.html).
However, database names aren't coupled to application data structures, and you can grant permissions on databases (but not on single collections). So one database per tenant could be reasonable. As could be a per document tenant_id field in a single database for all tenants (see the above-mentioned link).