Looking for a faster way to export Resources in HAPI FHIR

Looking for a faster way to export Resources in HAPI FHIR - hl7-fhir

Following the documentation on $export API (link), it is possible to export FHIR data, e.g. Patient, Observation, etc. However, this process is very slow (take many days) if the Resources to be exported are substantial in size. Querying database, serializing, writing down the Resources into NDJSON files, then communicate the data over Internet, etc. is an expensive process.
Is there any other way to export FHIR data? E.g. a more efficient technique? low-level exporting (some customized programs to export data directly from the database on the server)? Or perhaps increase the server capacity (CPU, memory, process priority/multi-threading configuration, etc.)
Feel free to suggest whatever solutions. The ultimate goal is to have NDJSON files (after the export) in order to ingest them into other 3rd party data-warehouses for further analysis.
Environment
HAPI FHIR 6.0.4 REST Server

If you have access to the database, you can write whatever program you want to get the export done, to any format you would like - provided you know how to convert the FHIR data. Or maybe you can use existing database tools that can dump data to files.
If you really need to go through the REST API, you can also request the resources with regular searches and write a program that iterates over the result Bundles to get all the data out and into NDJSON files. However, this will lead to a lot of calls and network load. One of the reasons for $export is to prevent clogging the system/network, and have the server be available for other requests while exporting. Maybe after the initial export you can use the $export parameters to only export changed data, which should be quicker than the whole set.

Related

Golang Cache HTTP GET Results In Memory

I am working on a CLI in Go that scrapes a webpage to collect the href attributes of all the links on the page into a slice. I want to store this slice in memory for some time so that the scraper is not being called on every execution of the CLI command. Ideally, the scraper would only be called after the cache expires or the user provides some sort of --update flag.
I came across the library go-cache and other similar libraries, but from what I could tell they only work for something that is continuously running, like a server.
I thought about writing the links to a file, but then how would I expire the results after a specific duration? Would it make sense to create a small server in the background that shuts down after a while in order to use a library like go-cache? Any help is appreciated.

There are two main approaches in these scenarios:
Create a daemon, service or background application that acts as your data repository. You can run it as an HTTP server / RPC server depending on your requirements. Your CLI application then interacts with this daemon as required;
Implement a persistence mechanism that will allow data to be written and read across multiple CLI application executions. You may use normal text files, databases or even an implementation of golang's encoding/gob to write and read your slice (a map would probably be better) to and from a binary file.
You can timestamp entries and simply remove them after their ttl expires by explicitly deleting them, or by simply not rewriting them during subsequent executions, according to the strategy / approach selected above.
The scope and number of examples for such an open ended question is too myriad to post in a single answer and will most likely require multiple specific questions.

Use a database and store as much detail as you can (fetched_at, host, path, title, meta_desc, anchors etc). You'll be able to query over the data later and it will be useful to have it in a structured format. If you don't want to deal with a db dependency you could embed something like boltdb (pure go) or sqlite (cgo).

Setting up multiple network layers in Relay Modern

I am using a react-native app with relay modern.
Currently our app's fetchQuery implementation, just does a fetch on the network (like in https://facebook.github.io/relay/docs/en/network-layer.html),
Although there is a possibility of another local-network layer like https://github.com/relay-tools/relay-local-schema which returns data from a local-db like sqlite/realm.
Is there a way to setup offline-first response from local-network layer, followed by automatic request to real network which also populates the store with fresher data (along with writing to local-db)?
Also should/can they share the same store?
From the requirements of Network.create(), it should return a promise containing the payload, there does not seem a possibility to return multiple values.
Any ideas/help/suggestions are appreciated.

What you trying to achieve its complex, and ill go for the easy approach which is long time cache.
As you might know relay modern uses a local storage and its exact copy of the data you are fetching, you can configure this store cache as per your needs, no cache on mutations.
To understand how this is achieve the best library around to customise Relay Modern or Classic network layer you can find in https://github.com/nodkz/react-relay-network-modern
My recommendation: setup your cache and watch your request.... (you going to love it)
Thinking in Relay,
https://facebook.github.io/relay/docs/en/thinking-in-relay.html

Live update with Apache Jena

I have a requirement wherein I need to make realtime updates to my ontological data (in Jena) (Around 30 inserts/updates per minute)
I wanted to know if Jena is good for excessive updates.
Also, if not, Is there any other semantic web based technology which supports excessive updates?
Also, if I want to insert lot of resources in my model, is there any way to automatically (sequentially) generate URIs for the new resources?

I can't for certain - it may well work for you. It'll depend on the amount of data in an update, the storage system used and the capabilities of the machine.
There is no server side automatic generation of resources names. The Jena library contains a URN UUID generator (for type V1 UUIDs) and Java provides type V4. This can help you generate unique names in the application.

Costs for setting up data integration tool for Salesforce

I'm writing a report and thought you guys could help by providing me with the costs of company support in setting up and training a client on a data integrator for Salesforce. E.g., if someone wants to use Salesforce, but first needs a tool to consolidate and transfer data from back office systems to Salesforce how much would that support service cost?

Salesforce actually comes with a very good integration tool called Data Loader. It can be run as an interactive application under Windows or Macintosh, or it can be run as a command-line tool on Windows, Mac or Linux.
In interactive mode, it can import & export CSV files.
In batch mode it can also read data from, and write data to, a database.
For example, I have a Linux server where a daily cron job activates the Data Loader which runs several jobs. Some of these jobs run SQL against a database and upload the resulting data into Salesforce. Other jobs extract from Salesforce (using their SOQL query language, which is SQL-like) and store the information into a database.
Data Loader has a bit of a learning curve for batch mode (mostly around creating some XML configuration files), but the Interactive mode is very easy to use.
So, to answer your question... If it's a one-time data load, just run the interactive version and it's easy. If you want regularly-updated data, then use the batch mode. Support costs for operating the integration are really all in the setup. Once it's running, there shouldn't be any on-going costs unless the data structures change and you want to change the data being transferred. Better yet, if the system is setup by somebody who has done it before, you'll avoid a big learning curve.
If you want a figure to put into your report, then allow 3 days for the initial integration (allows for learning curve) and then a half-day for each additional one. That's generous, but provides extra time to debug problems.

To some degree, it depends on two factors:
Where is the data's source of truth?
How often do you want to sync the data?
If the answers are "it's a weird place and I only need to sync it once," then you probably want to figure out how to get it in CSV form and then use tools built into Salesforce to import it.
However, if the data lives in a database or data warehouse (postgres, mysql, mongo, redshift, snowflake, big query, etc) and especially if you want to keep Salesforce up to date with that source of truth continuously, then you could look into so-called "Reverse ETL" tools made for this purpose.
Costs depend on the tool chosen and the data volumes and other factors, but here are some options:
Grouparoo is an open source Reverse ETL tool. You can host it yourself for free. Paid plans start at $150/month.
Census is a SaaS Reverse ETL tool. Paid plans start at $300/month.
Hightouch is a SaaS Reverse ETL tool. Paid plans start at $350/month.

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?

This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio