Normalize FHIR bundles data into separate database tables - hl7-fhir

We get FHIR bundles from vendor, mostly patient, encounter, observation, flag and a few other resources (10 total). We have an option to store resources as json values or we can come up with a process to normalize all the nested structures into separate tables. We are going to use traditional BI tools to do some analytics and build some dashboards and these tools do not support json natively. Should we do former or latter and what is the best/easiest way to build/generate these normalized tables programmatically?

Ultimately how you decide to store these is not part of the scope of FHIR, and any answer you get on here is going to be one person's opinion. You need to figure out what method makes the most sense for the product/business you're building.
Here are some first principles that may help you:
Different vendors will send different FHIR at you. Fields may be missing, different code systems may be used.
FHIR extensions contain a lot of valuable information and the JSON representation is an Entity Attribute Value. EAV is anti-pattern for relational databases.
FHIR versions will change overtime - fields will be added and have their names changed, and new extensions will be relevant.
As far as your second question about generating the tables - I think that you will be best served by designing the data model you need, and mapping the FHIR data to it. That said there are a number of open source FHIR implementations you can study for inspiration.

Modern databases like postgresql, oracle & mssql have a good support for json datatype. To flatten FHIR resources for BI you can consider building relational (may be normalised) views. We built simple DSL, which allow you describe destination relation as set of (fhir)paths in resource.

Related

Defining a schema that will be compatible between multiple databases and different conventions

I want to define one schema that will be cross teams & platform valid. This is pretty simple and can be thought of as a kind of ontology. What I need is to have the ability to define what the field represents and under it the name of the field on each platform. I'd like the schema to have the ability to generate data objects for each of the used languages, and therefore I'd like to know if my need can be filled within Protobuf or GraphQL. Notice - my conventions can be different than the trivial in my generated target language since it needs to be compatible with the databases. A simple example for my need:
{
"lastName": {
"mssqlName":"LastName",
"oracleName":"FamilyName",
"elasticName":"lastName",
"cassandraName":"last_name",
"rocksDbName":"surname",
},
"age" : {
...
}
As you can see, on some platforms I have totally different names than the others. I'd like to know what are the usual ways\ technologies to solve this problem, and if whether it will be possible with codegen-able technologies like Proto & GraphQL.
A single schema as the single point of truth for all object / message definition across databases, comms links, multiple languages and plaforms? It would be nice, wouldn't it?
The closest I can think of is XSD (XML schema), but I don't think it works when it comes to tools. For example, I know of tools that will take an XSD schema and generate you code that will serialise / deserialise objects to / from XML (e.g. Microsoft's xsd.exe). There's even some good ones.
And then there's tools that will create SQL tables from that XSD schema. But a code generator that builds classes that can access those tables isn't also building them to serialise / deserialise objects to and XML wireformat.
Basically, I've not come across a schema language that has tooling that does everything. The ASN.1 tools are very good at creating serialisation classes, but I've never found one that also targets SQL interactions. Same with XSD.
My knowledge is of course not exhaustive, and there might be something in JSON-land that works.
Minimum Pain Compromise Approach
What I have settled on in the past is to accept that I'm having to do some manual coding around changes in schema, but probably not too much. I'd define messages fully in, say, Google Protocol Buffers, and use that for object exchange between applications / languages. Where I wanted to stash objects in a database, I'd accept that for that I'd be having to have a parallel definition of the object in the table columns, but only for critical fields that I'd want to search on. The last column would be an arbitrary container, able to store the serialised object whole.
For example, if a GPB message had an integer ID field, and a string Name field, plus a bunch of other fields. My data base table would then have an ID column, a Name column, and column for storing Bytes.
That way I could serialise an object, and push it into a row's Bytes column whilst also filling in the ID and Name columns. I could quickly search for objects, because of the Name / ID column. If I then wanted access to the other fields in the object stored in the database, I'd have to retreive the record from the database and deserialise the Bytes column.
This way one is essentially taking a bet that those key columns / field names (ID, Name) won't ever be changed during development in the schema. But it's quite likely a safe bet. Generally, one can settle things like that quite easily, early on in a project, it's the rest of the schema that might be changed during development.
One small payoff is that if the reason to hunt out an object in the database is to be able to send it through a communications channel, it is already serialised in the database. No need to serialise it again before dispatch down the comms link.
So this approach can leave one with some duplication of code / points of truth, but can be quite performant in avoiding a serialisation step during parts of runtime.
You can also cheat a little. If the serialisation wireformat is text based (JSON, XML, some ASN.1 formats, etc), then there's a good chance that string searches on the bytes column will yield good results anyway. For instance, suppose a message field was MiddleName, but I'd not created that as a distinct table column in the database. I could find likely records for any given MiddleName by searching for the value in the Bytes column, as it's stored as text somewhere in there.
Reflection Based Approach?
A potential other approach is to accept that the tooling does not exist to satisfy all needs, and adapt using language features (reflection) to exploit a common feature of code generators.
For example, consider GPB's proto compiler. In the generated code you end up with classes whose members are named after the fields in messages. And it'll be more or less the same with any code generated to access a database table that has columns by the same name.
So it is possible to use reflection to make an auto-transcriber between generated classes. You iterate down the tree of members in one class, and you can match that up to a member in a different generated class.
This avoids the need for code like:
Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j; // equivalent object to be copied from myObj_g;
myObj_j.Field1 = myObj_g.Field1;
myObj_j.Field2 = myObj_g.Field2;
.
.
.
Instead:
Protobuf::MyClass myObj_g; // An object built using GPB
JSON::MyClass myObj_j; // equivalent object to be copied from myObj_g;
foreach (Protobuf::MyClass::Reflection::Field field in Protobuf::MyClass.Fields)
{
myObj_j.Reflection.FindByName(field.Name) = myObj_g.Reflection.FindByName(field.Name);
}
There'd be a fit of fiddling around to do to get this to work between each database and serialisation technology, per language, but the point is you'd only ever have to write it once. Any subsequent schema changes do not require code changes, at least not so far as exchanging objects between a serialisation technology and a database access technology.
Obviously, reflection is easier / possible in some languages and not otheres.
The Fix It At Runtime Approach?
Apache Avro has the characteristic where serialised data describes it's own shape. Basically, wireformat data comes with its own schema, so a consumer can build a representation of the data automatically. In some languages that's horrid (C, C++), but libraries exist.
Basically, it forces you to write applications so that they work out what to do with data for themselves;

Salesforce Table Relationships for Business Analyst

I am a business analyst. I use Tableau a lot but have limited knowledge about the back-end of Salesforce. The majority of our company's data is stored in Salesforce and our data team does not support business users for understanding such topics.
In many of my projects, I use the Salesforce connector inside Tableau to extract Salesforce tables, but it requires knowledge about joins relationships among tables. Most of the time, I can guess correctly about the primary key among tables, but I still want to learn systematically about the data structure and have my data independence.
So, how do I learn the data structure by myself? Or how do I ask specific structure questions to data team so I don't trouble them as much?
Do you have Salesforce account with "Customize Application" permission? If you don't have in production - maybe they'll be willing to promote you to sysadmin in one of sandboxes.
If you do - Setup -> Schema Builder might be easiest tool to visualise relations. It's bit old, flash-based but pretty neat way to model relationships. https://trailhead.salesforce.com/en/content/learn/modules/data_modeling/schema_builder
Another one might be workbench, http://workbench.developerforce.com/ It's not as neat but lets you experiment with metadata & queries, learn which object has what child relationships...
For standard objects if you have a primary key / foreign key you can use some lookup tables to learn more about target table. All Account Ids in all SF instances start with 001. Contacts with 003, Users with 005... Combine some blogs like http://www.fishofprey.com/2011/09/obscure-salesforce-object-key-prefixes.html with https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_objects_account.htm and it's a good start. Won't help much with custom objects and fields (specific to your company) but well.
It's bit "meta" but you can query info about tables and columns too. After all - you might be more comfortable in Tableau ;) Querying Salesforce Object Column Names w/SOQL might give you some hints.
If your job is to build advanced reports off these data sources, I would imagine you need to understand the data structure to some extent. This would mean you need to have authorization to view and access the database table list to get familiar with it and possibly run raw queries to verify data integrity.
If they are not comfortable with you touching the production system, ask for access to a development system which is a copy of production or even just realistic test data.

Create subsets for certain Resources to better fit existing data model?

We are trying to implement a FHIR Rest Server for our application. In our current data model (and thus live data) several FHIR resources are represented by multiple tables, e.g. what would all be Observations are stored in tables for vital values, laboratory values and diagnosis. Each table has an independent, auto-incrementing primary ID, so there are entries with the same ID in different tables. But for GET or DELETE calls to the FHIR server a unique ID is needed. What would be the most sensible way to handle this?
Searching didn't reveal an inherent way of doing this, so I'm considering these two options:
Add a prefix to all (or just the problematic) table IDs, e.g lab-123 and vit-123
Add a UUID to every table and use that as the logical identifier
Both have drawbacks: an ID parser is necessary for the first one and the second requires multiple database calls to identify the correct record.
Is there a FHIR way that allows to split a resource into several sub-resources, even in the Rest URL? Ideally I'd get something like GET server:port/Observation/laboratory/123
Server systems will have all sorts of different divisions of data in terms of how data is stored internally. What FHIR does is provide an interface that tries to hide those variations. So Observation/laboratory/123 would be going against what we're trying to do - because every system would have different divisions and it would be very difficult to get interoperability happening.
Either of the options you've proposed could work. I have a slight leaning towards the first option because it doesn't involve changing your persistence layer and it's a relatively straight-forward transformation to convert between external/fhir and internal.
Is there a FHIR way that allows to split a resource into several
sub-resources, even in the Rest URL? Ideally I'd get something like
GET server:port/Observation/laboratory/123
What would this mean for search? So, what would /Obervation?code=xxx search through? Would that search labs, vitals etc combined, or would you just allow access on /Observation/laboratory?
If these are truly "silos", maybe you could use http://servername/lab/Observation (so swap the last two path parts), which suggests your server has multiple "endpoints" for the different observations. I think more clients will be able to handle that url than the url you suggested.
Best, still, I think is having one of your two other options, for which the first is indeed the easiest to implement.

Multi-tenant database. One collection or one db per tenant?

For a multi-tenancy architecture for a web application using a document-oriented database I can see two conceivable options:
Having one database per tenant, and the collections logically separate different kinds of object.
Having one collection per tenant, and all user data is stored in one database, with some kind of flag or object type identifier on each record.
Have there been any studies or has any documentation been produced regarding these two options and the differences between them?
Is there a particular standard or good reason why someone designing a web application which allows multiple users to store vastly different kinds of data would choose one over the other?
Aside from speed/efficiency issues, are there any other things to be said about this that would influence the decision?
EDIT I'm aware some of the terminology might be database specific, so for all wondering I am specifically referring to MongoDB.
I wouldn't want tenant specific collections. In my application, I usually hard code collection names, in the same way as I'd hardcode table names if I were using SQL tables. There'd be one comments collection that stores all comments for a blog. I would not want to deal with collection names like comments_tenant_1 and comments_tenant_2, because 1) that feels error prone, and 2) would make the application code more complicated (collection names would have to be replaced with functions that computed the collection name). And 3) the number of collections in a single database could grow huge, which would make a list of all collections look daunting, and also MongoDB isn't built for having very many collections (see the link in the comment below your question, which David B posted, https://docs.mongohq.com/use-cases/multi-tenant.html).
However, database names aren't coupled to application data structures, and you can grant permissions on databases (but not on single collections). So one database per tenant could be reasonable. As could be a per document tenant_id field in a single database for all tenants (see the above-mentioned link).

How to stop thinking "relationally"

At work, we recently started a project using CouchDB (a document-oriented database). I've been having a hard time un-learning all of my relational db knowledge.
I was wondering how some of you overcame this obstacle? How did you stop thinking relationally and start think documentally (I apologise for making up that word).
Any suggestions? Helpful hints?
Edit: If it makes any difference, we're using Ruby & CouchPotato to connect to the database.
Edit 2: SO was hassling me to accept an answer. I chose the one that helped me learn the most, I think. However, there's no real "correct" answer, I suppose.
I think, after perusing about on a couple of pages on this subject, it all depends upon the types of data you are dealing with.
RDBMSes represent a top-down approach, where you, the database designer, assert the structure of all data that will exist in the database. You define that a Person has a First,Last,Middle Name and a Home Address, etc. You can enforce this using a RDBMS. If you don't have a column for a Person's HomePlanet, tough luck wanna-be-Person that has a different HomePlanet than Earth; you'll have to add a column in at a later date or the data can't be stored in the RDBMS. Most programmers make assumptions like this in their apps anyway, so this isn't a dumb thing to assume and enforce. Defining things can be good. But if you need to log additional attributes in the future, you'll have to add them in. The relation model assumes that your data attributes won't change much.
"Cloud" type databases using something like MapReduce, in your case CouchDB, do not make the above assumption, and instead look at data from the bottom-up. Data is input in documents, which could have any number of varying attributes. It assumes that your data, by its very definition, is diverse in the types of attributes it could have. It says, "I just know that I have this document in database Person that has a HomePlanet attribute of "Eternium" and a FirstName of "Lord Nibbler" but no LastName." This model fits webpages: all webpages are a document, but the actual contents/tags/keys of the document vary soo widely that you can't fit them into the rigid structure that the DBMS pontificates from upon high. This is why Google thinks the MapReduce model roxors soxors, because Google's data set is so diverse it needs to build in for ambiguity from the get-go, and due to the massive data sets be able to utilize parallel processing (which MapReduce makes trivial). The document-database model assumes that your data's attributes may/will change a lot or be very diverse with "gaps" and lots of sparsely populated columns that one might find if the data was stored in a relational database. While you could use an RDBMS to store data like this, it would get ugly really fast.
To answer your question then: you can't think "relationally" at all when looking at a database that uses the MapReduce paradigm. Because, it doesn't actually have an enforced relation. It's a conceptual hump you'll just have to get over.
A good article I ran into that compares and contrasts the two databases pretty well is MapReduce: A Major Step Back, which argues that MapReduce paradigm databases are a technological step backwards, and are inferior to RDBMSes. I have to disagree with the thesis of the author and would submit that the database designer would simply have to select the right one for his/her situation.
It's all about the data. If you have data which makes most sense relationally, a document store may not be useful. A typical document based system is a search server, you have a huge data set and want to find a specific item/document, the document is static, or versioned.
In an archive type situation, the documents might literally be documents, that don't change and have very flexible structures. It doesn't make sense to store their meta data in a relational databases, since they are all very different so very few documents may share those tags. Document based systems don't store null values.
Non-relational/document-like data makes sense when denormalized. It doesn't change much or you don't care as much about consistency.
If your use case fits a relational model well then it's probably not worth squeezing it into a document model.
Here's a good article about non relational databases.
Another way of thinking about it is, a document is a row. Everything about a document is in that row and it is specific to that document. Rows are easy to split on, so scaling is easier.
In CouchDB, like Lotus Notes, you really shouldn't think about a Document as being analogous to a row.
Instead, a Document is a relation (table).
Each document has a number of rows--the field values:
ValueID(PK) Document ID(FK) Field Name Field Value
========================================================
92834756293 MyDocument First Name Richard
92834756294 MyDocument States Lived In TX
92834756295 MyDocument States Lived In KY
Each View is a cross-tab query that selects across a massive UNION ALL's of every Document.
So, it's still relational, but not in the most intuitive sense, and not in the sense that matters most: good data management practices.
Document-oriented databases do not reject the concept of relations, they just sometimes let applications dereference the links (CouchDB) or even have direct support for relations between documents (MongoDB). What's more important is that DODBs are schema-less. In table-based storages this property can be achieved with significant overhead (see answer by richardtallent), but here it's done more efficiently. What we really should learn when switching from a RDBMS to a DODB is to forget about tables and to start thinking about data. That's what sheepsimulator calls the "bottom-up" approach. It's an ever-evolving schema, not a predefined Procrustean bed. Of course this does not mean that schemata should be completely abandoned in any form. Your application must interpret the data, somehow constrain its form -- this can be done by organizing documents into collections, by making models with validation methods -- but this is now the application's job.
may be you should read this
http://books.couchdb.org/relax/getting-started
i myself just heard it and it is interesting but have no idea how to implemented that in the real world application ;)
One thing you can try is getting a copy of firefox and firebug, and playing with the map and reduce functions in javascript. they're actually quite cool and fun, and appear to be the basis of how to get things done in CouchDB
here's Joel's little article on the subject : http://www.joelonsoftware.com/items/2006/08/01.html

Resources