Defining a schema that will be compatible between multiple databases and different conventions - graphql

I want to define one schema that will be cross teams & platform valid. This is pretty simple and can be thought of as a kind of ontology. What I need is to have the ability to define what the field represents and under it the name of the field on each platform. I'd like the schema to have the ability to generate data objects for each of the used languages, and therefore I'd like to know if my need can be filled within Protobuf or GraphQL. Notice - my conventions can be different than the trivial in my generated target language since it needs to be compatible with the databases. A simple example for my need:
{
"lastName": {
"mssqlName":"LastName",
"oracleName":"FamilyName",
"elasticName":"lastName",
"cassandraName":"last_name",
"rocksDbName":"surname",
},
"age" : {
...
}
As you can see, on some platforms I have totally different names than the others. I'd like to know what are the usual ways\ technologies to solve this problem, and if whether it will be possible with codegen-able technologies like Proto & GraphQL.

A single schema as the single point of truth for all object / message definition across databases, comms links, multiple languages and plaforms? It would be nice, wouldn't it?
The closest I can think of is XSD (XML schema), but I don't think it works when it comes to tools. For example, I know of tools that will take an XSD schema and generate you code that will serialise / deserialise objects to / from XML (e.g. Microsoft's xsd.exe). There's even some good ones.
And then there's tools that will create SQL tables from that XSD schema. But a code generator that builds classes that can access those tables isn't also building them to serialise / deserialise objects to and XML wireformat.
Basically, I've not come across a schema language that has tooling that does everything. The ASN.1 tools are very good at creating serialisation classes, but I've never found one that also targets SQL interactions. Same with XSD.
My knowledge is of course not exhaustive, and there might be something in JSON-land that works.
Minimum Pain Compromise Approach
What I have settled on in the past is to accept that I'm having to do some manual coding around changes in schema, but probably not too much. I'd define messages fully in, say, Google Protocol Buffers, and use that for object exchange between applications / languages. Where I wanted to stash objects in a database, I'd accept that for that I'd be having to have a parallel definition of the object in the table columns, but only for critical fields that I'd want to search on. The last column would be an arbitrary container, able to store the serialised object whole.
For example, if a GPB message had an integer ID field, and a string Name field, plus a bunch of other fields. My data base table would then have an ID column, a Name column, and column for storing Bytes.
That way I could serialise an object, and push it into a row's Bytes column whilst also filling in the ID and Name columns. I could quickly search for objects, because of the Name / ID column. If I then wanted access to the other fields in the object stored in the database, I'd have to retreive the record from the database and deserialise the Bytes column.
This way one is essentially taking a bet that those key columns / field names (ID, Name) won't ever be changed during development in the schema. But it's quite likely a safe bet. Generally, one can settle things like that quite easily, early on in a project, it's the rest of the schema that might be changed during development.
One small payoff is that if the reason to hunt out an object in the database is to be able to send it through a communications channel, it is already serialised in the database. No need to serialise it again before dispatch down the comms link.
So this approach can leave one with some duplication of code / points of truth, but can be quite performant in avoiding a serialisation step during parts of runtime.
You can also cheat a little. If the serialisation wireformat is text based (JSON, XML, some ASN.1 formats, etc), then there's a good chance that string searches on the bytes column will yield good results anyway. For instance, suppose a message field was MiddleName, but I'd not created that as a distinct table column in the database. I could find likely records for any given MiddleName by searching for the value in the Bytes column, as it's stored as text somewhere in there.
Reflection Based Approach?
A potential other approach is to accept that the tooling does not exist to satisfy all needs, and adapt using language features (reflection) to exploit a common feature of code generators.
For example, consider GPB's proto compiler. In the generated code you end up with classes whose members are named after the fields in messages. And it'll be more or less the same with any code generated to access a database table that has columns by the same name.
So it is possible to use reflection to make an auto-transcriber between generated classes. You iterate down the tree of members in one class, and you can match that up to a member in a different generated class.
This avoids the need for code like:
Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j; // equivalent object to be copied from myObj_g;
myObj_j.Field1 = myObj_g.Field1;
myObj_j.Field2 = myObj_g.Field2;
.
.
.
Instead:
Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j; // equivalent object to be copied from myObj_g;
foreach (Protobuf::MyClass::Reflection::Field field in Protobuf::MyClass.Fields)
{
myObj_j.Reflection.FindByName(field.Name) = myObj_g.Reflection.FindByName(field.Name);
}
There'd be a fit of fiddling around to do to get this to work between each database and serialisation technology, per language, but the point is you'd only ever have to write it once. Any subsequent schema changes do not require code changes, at least not so far as exchanging objects between a serialisation technology and a database access technology.
Obviously, reflection is easier / possible in some languages and not otheres.
The Fix It At Runtime Approach?
Apache Avro has the characteristic where serialised data describes it's own shape. Basically, wireformat data comes with its own schema, so a consumer can build a representation of the data automatically. In some languages that's horrid (C, C++), but libraries exist.
Basically, it forces you to write applications so that they work out what to do with data for themselves;

Related

Normalize FHIR bundles data into separate database tables

We get FHIR bundles from vendor, mostly patient, encounter, observation, flag and a few other resources (10 total). We have an option to store resources as json values or we can come up with a process to normalize all the nested structures into separate tables. We are going to use traditional BI tools to do some analytics and build some dashboards and these tools do not support json natively. Should we do former or latter and what is the best/easiest way to build/generate these normalized tables programmatically?
Ultimately how you decide to store these is not part of the scope of FHIR, and any answer you get on here is going to be one person's opinion. You need to figure out what method makes the most sense for the product/business you're building.
Here are some first principles that may help you:
Different vendors will send different FHIR at you. Fields may be missing, different code systems may be used.
FHIR extensions contain a lot of valuable information and the JSON representation is an Entity Attribute Value. EAV is anti-pattern for relational databases.
FHIR versions will change overtime - fields will be added and have their names changed, and new extensions will be relevant.
As far as your second question about generating the tables - I think that you will be best served by designing the data model you need, and mapping the FHIR data to it. That said there are a number of open source FHIR implementations you can study for inspiration.
Modern databases like postgresql, oracle & mssql have a good support for json datatype. To flatten FHIR resources for BI you can consider building relational (may be normalised) views. We built simple DSL, which allow you describe destination relation as set of (fhir)paths in resource.

Create subsets for certain Resources to better fit existing data model?

We are trying to implement a FHIR Rest Server for our application. In our current data model (and thus live data) several FHIR resources are represented by multiple tables, e.g. what would all be Observations are stored in tables for vital values, laboratory values and diagnosis. Each table has an independent, auto-incrementing primary ID, so there are entries with the same ID in different tables. But for GET or DELETE calls to the FHIR server a unique ID is needed. What would be the most sensible way to handle this?
Searching didn't reveal an inherent way of doing this, so I'm considering these two options:
Add a prefix to all (or just the problematic) table IDs, e.g lab-123 and vit-123
Add a UUID to every table and use that as the logical identifier
Both have drawbacks: an ID parser is necessary for the first one and the second requires multiple database calls to identify the correct record.
Is there a FHIR way that allows to split a resource into several sub-resources, even in the Rest URL? Ideally I'd get something like GET server:port/Observation/laboratory/123
Server systems will have all sorts of different divisions of data in terms of how data is stored internally. What FHIR does is provide an interface that tries to hide those variations. So Observation/laboratory/123 would be going against what we're trying to do - because every system would have different divisions and it would be very difficult to get interoperability happening.
Either of the options you've proposed could work. I have a slight leaning towards the first option because it doesn't involve changing your persistence layer and it's a relatively straight-forward transformation to convert between external/fhir and internal.
Is there a FHIR way that allows to split a resource into several
sub-resources, even in the Rest URL? Ideally I'd get something like
GET server:port/Observation/laboratory/123
What would this mean for search? So, what would /Obervation?code=xxx search through? Would that search labs, vitals etc combined, or would you just allow access on /Observation/laboratory?
If these are truly "silos", maybe you could use http://servername/lab/Observation (so swap the last two path parts), which suggests your server has multiple "endpoints" for the different observations. I think more clients will be able to handle that url than the url you suggested.
Best, still, I think is having one of your two other options, for which the first is indeed the easiest to implement.

Multidimensional data types

So I was thinking... Imagine you have to write a program that would represent a schedule of a whole college.
That schedule has several dimensions (e.g.):
time
location
indivitual(s) attending it
lecturer(s)
subject
You would have to be able to display the schedule from several standpoints:
everything held in one location in certain timeframe
everything attended by individual in certain timeframe
everything lecturered by a certain lecturer in certain timeframe
etc.
How would you save such data, and yet keep the ability to view it from different angles?
Only way I could think of was to save it in every form you might need it:
E.g. you have folder "students" and in it each student has a file and it contains when and why and where he has to be. However, you also have a folder "locations" and each location has a file which contains who and why and when has to be there. The more angles you have, the more size-per-info ratio increases.
But that seems highly inefficinet, spacewise.
Is there any other way?
My knowledge of Javascript is 0, but I wonder if such things would be possible with it, even in this space inefficient form.
If not that, I wonder if it would work in any other standard (C++, C#, Java, etc.) language, primarily in Java...
EDIT: Could this be done by using MySQL database?
Basically, you are trying to first store data and then present it under different views.
SQL databases were made exactly for that: from one side you build a schema and instantiate it in a database to store your data (the language is called Data Definition Language, DDL), then you make requests on it with the query language (SQL), what you call "views". There are even "views" objects in SQL databases to build these views Inside the database (rather than having to the code of the request in the user code).
MySQL can do that for sure, note that it is possible to compile some SQL engine for Javascript (SQLite for example) and use local web store to store the data.
There is another aspect to your question: optimization of the queries. While SQL can do most of the request job for your views. It is sometimes preferred to create actual copies of the requests results in so called "datamarts" (this is called de-normalizing a request), so that the hard work of selecting or computing aggregate/groups functions and so on is done once per period of time (imagine that a specific view changes only on Monday), then requesters just have to read these results. It is important in this case to separate at least semantically what is primary data from what is secondary data (and for performance/user rights reasons, physical separation is often a good idea).
Note that as you cited MySQL, I wrote about SQL but mostly any database technology could do that what you searched to do (hierarchical, object oriented, XML...) as long as the particular implementation that you use is flexible enough for your data and requests.
So in short:
I would use a SQL database to store the data
make appropriate views / requests
if I need huge request performance, make appropriate de-normalized data available
the language is not important there, any will do

A Spring DAO that can adapt to changes in the data

For application developers, I suppose the traditional paradigm for writing an application with domain objects that can be persisted to an underlying data store (SQL database for arguments sake), is to write the domain objects and then write (or generate) the table structure. There is a tight coupling between what the domain object looks like and what the structure of underlying data store looks like. So if you want to add a piece of information to your domain object, you add the field to your code and then add a column to the appropriate database table. All familiar?
This is all well and good for data stores that have a well defined structure (I'm mainly talking about SQL databases whereby the tables and columns are pre-defined and fixed), but now a number of alternatives to the ubiquitous SQL database exist and these often do not constrain the data in this way. For instance, MongoDB is a NoSQL database whereby you divide data into collections but aside from that there is no structuring of the data. You don't define new columns when you want to add a new field.
Now to the question: given the flexibility of a data store like MongoDB, how would one go about achieving a similar kind of flexibility in the domain objects that represent this data? So for instance if I'm using Spring and creating my own domain obejcts, when I add a "middleName" field to my data, how can I avoid having to add a "middleName" field to my domain object? I'm looking for some kind of mechanism/approach/framework to dynamically inspect the data and have access to it in my domain object without having to make a code change every time. All ideas welcome.
I think you have a couple of choices:
You can use a dynamic programming language and not have domain objects (clojure for example)
If you're fixed on using java, the mongo java driver returns data in DBObject which is essentially a Map. So the default behavior already provides what you want. It's only when you map the DBObject into domain objects, using a library like morphia (or spring-data), that you even have to worry about domain objects at all.
But, if I was using java, I would stick with the standard convention of domain objects mapped via morphia, because I think adding a field is a very minor inconvenience when compared against the benefits.
I think the question is inherintly paradoxical-
On one hand, you want to have domain objects, i.e. objects that represent the data (and behaviour) of your problem domain.
On the other hand, you say that you don't want your domain objects to be explicitly influenced by changes to the data.
But when you have objects that represent your problem domain, you want to do just that- to represent your problem domain.
So that if, for example, middle name is added, then your representation of the real-life 'User' entity should change to accomodate this change to the real-life user; perhaps not only by adding this piece of data to your object, but also adding some related behaviour (validation of middle name, or some functionality related to it).
In essense, what I'm trying to say here is that when you have (classic OO) domain objects, you may need to change your behaviour / functionality along with your data, and since you don't have any automatic way of changing your behaviour, the question of automatically changing your data becomes irrelevant.
If you don't want behaviour associated with your data, then you essentialy have DTOs, and #Kevin's answer is what you're looking for.
Honestly, it sounds more like you're looking for some kind of blackbox DTO where, like you describe, fields are added or removed "arbitrarily" depending on the data. This makes me inclined to suggest a simple Map to do the job. You can't really have a domain-driven design if your domain model is constantly changing.

Recommended data structure for a Data Access layer

I am building a DataAccess layer to a DB, what data structure is recommended to use to pass and return a collection?
I use a list of data access objects mapped to the db tables.
I'm not sure what language you're using, but in general, there are tradeoffs of simplicity vs extensibility.
If you return the DataSet directly, you have now coupled yourself to database specific classes. This leaves little room for extension - what if you allow access to files or to other types of data sources? But, it is also very simple. This is the recordset pattern and C#/VB provide a lot of built-in support for this. The GUI layer can access the recordset and easily manipulate the data. This works well for simple applications.
On the other hand, you can wrap the datasets in a custom object, and provide gateway methods (see the Gateway pattern http://martinfowler.com/eaaCatalog/gateway.html). This method is more complex, but provides a lot more extensibility. In a larger application when you need to separate the the business logic, data logic, and GUI logic, this is a more robust way to go.
For larger enterprise applications, you can look into using Object Relational Mapping tools (ORM). They help to automatically map java objects to database tables. They hide a lot of the painful SQL details. Frameworks such as Spring provide excellent support for ORMs.
I tend to use arrays of objects, so that I can disconnect the DAO from the business logic.
You can store the data in the DAO as a dataset, for example, and give them an easy way to add to the database before doing an update, so they can pass in information to do modification operations, and then when they want to commit the changes they can do it in one shot.
I prefer that the user can't add/modify the structure themselves, as it makes it harder to determine what must be changed in the database.
By initially returning an array they can then display what is in the database.
Then, as the presentation layer makes changes, the DAO can be updated by the controller. By having a loose coupling the entire system becomes more flexible, as you can change the DAO from a dataset to something else, and the rest of the application doesn't care.
There are two choices that are the most generic.
The first way to look at a ResultSet is as a List of Maps, where each Map represents a row in the ResultSet. The keys are the columns listed in the FROM clause; the values are the database values.
The second way to look at a ResultSet is as a Map of Lists, where each List represents a column in the ResultSet. The Map keys are the columns listed in the FROM clause; the values are the List of database values.
If you don't want to do full-blown ORM, these can carry you a long way.

Resources