Given a Resource such as DeviceObservationReport, a number of fields have cardinality 0..many. In some cases these contain reference(s) to other Resource(s) which may also have cardinality 0..many. I am having considerable difficulty in deciding how to support 'chained' queries over referenced Resources which may be two or three steps 'deep' (for want of a better term).
For example, in a single DeviceObservationReport there may be multiple Observation Resource references. It is entirely probable that a client may wish to perform a query which requests all instances of an Observation with a specific code, which have a timestamp (appliesDate) later than a specific instant in time. The named Search Parameter observation would appear to be the obvious starting point and the Path to the observation is specified as virtualDevice.channel.metric.observation. Given that the virtualDevice, channel, and metric fields have cardinality 0..*, would a 'simple' query to retrieve all DeviceObservationReport instances which contain observations with code TESTCODE and observed later than 14:00 on 10 October 2014 look something like:
../data/DeviceObservationReport?virtualDevice[0].channel[0].metric[0].observation.name=TESTCODE&virtualDevice[0].channel[0].metric[0].observation.date>2014-10-10%2014:00
Secondly, if the client requests that the result set be sorted on date, how would that be expressed in the query, because, from the various attempts I have made to implement this, at this point support for the query becomes rather more complex, and thus far I have not been able to come up with a satisfactory solution.
firstly, the path for the parameter is the path within the resource, and chaining links between the defined names. So your query would look like this:
../data/DeviceObservationReport?observation.name=TESTCODE&observation.date=>2014-10-10%2014:00
e.g. the search parameters are aliases within the resource. However the problem with this search is that the parameters are anded at the root, not the leaf - which means this finds all device observation reports that have an observation with a TESTCODE, and that have an observation with date >DATE, which is subtly different to what you probably want: all device observation reports that have an observation with a TESTCODE, and a date >DATE. This will be addressed in the next major release of FHIR.
Sorting is tough with a chained query. I ended up extracting the field that I sort by, but not actually sorting by it - I insert the raw matches into a holding table, and then sort by the sort field as I access the secondary table. The principle reason for this is to make paging robust against on-going changes to the resources.
Related
I'm trying to determine if it's my usage of Entity Framework, or if I just can't get the performance I desire using EF.
I have a structure where "Transactions" have a collection (usually 1-5 items) of "Actions" which in turn have a collection of "Responses" (usually 2-6 items). In most cases, I care about Transactions where any of the Actions meet the specific search criteria, with the most recent "Response" (or lack thereof) determining the status, which is what we're querying based upon.
So, most of my queries (IQueryable, built dynamically) look something like:
Context.Transactions.AsNoTracking().Where(w => w.Actions.Any(action => action.Responses.OrderByDescending(o => o.Id).FirstOrDefault().Type == [Search Type]) || **(1-5 similar filters)**)
I'm executing the query 3 times... 1) get the total result count, 2) get distinct values of a few fields for the entire result set to populate some search criteria in the UI, and 3) to grab the full view model of 25-100 results to display. Maybe there's a better way to do this?
I tried querying against the "Transaction" table directly (as above) and alternatively projecting to an intermediate model that does my ordering and finds the "last response type" before apply the various filters. That looks something like:
Context.Transactions.AsNoTracking().ToSummaryModel().Where(w => w.Any(action => action.LastResponseType == [search type] || ...)
... where the "ToSummaryModel" projects a condensed version of Actions with the latest Response type from the Responses sub-collection.
In the most frequent use cases, we'll be looking at about 8,000-12,000 Transaction records, with occasional queries of 100,000+.
In expected use cases, the results are taking 15-20 seconds to execute, with outlier cases taking a minute+. Can I approach anything differently to potentially cut execution down to 25-30% (or better) of current?
Without seeing the whole query (which may be impractical) it will be hard to narrow down an exact cause, but I would suspect a 20 second execution time for a structure / data volume like that, let alone a minute+ is certainly unreasonable. A few things to check:
When it comes to optional filters, you can either handle the optional nature (i.e. null check) inside the expression, or outside the expression with an 'if(filter != null)`. In almost all cases it is better for performance to perform the null check in code and only append supplied filters rather than encoding the checks into the expressions.
Finding "latest" children is bound to be expensive. When considering computed values like this, I would consider opting for a database View to formulate the "search" aspect of locating these transactions where-by once a transaction is selected, loading that singular transaction and related entities from the respective tables rather than searching based on the table relationships. Alternatively this may be a case to justify denormalizing the latest action reference on the transaction.
However, prior to that it can help to record any/all SQL queries with a profiler on the database to build a picture of what EF is doing to get your data. Executing those queries with an execution plan (SQL Server) could reveal missing indexes that could make a big difference. Looking at the queries being run might also reveal additional issues such as lazy-loads creeping in after one or more of the queries has run. Digging further would require seeing the exact linq expressions being built because each situation is unique when it comes to performance issues.
I'm trying to figure out my options for a large query which is taking a somewhat long but sensible amount of time considering what it does. It has many joins and has to be searched against for up to a predefined number of parameters. Some of these parameter values are predefined (select box) while some are a free-form text box (unfortunately LIKE with prefixed and suffixed wildcards). The data sets returned are large and the filter options are very likely to be changed frequently. The order of the result sets are also controlled by the user. Additionally, user access must be restricted to only results the user is authorized to. This authorization is handled as part of a baseline WHERE clause which is applied regardless of the chosen filters.
I'm not really looking for query optimization advice as I've already reviewed the query and examined/optimized the query plan as much as I can given my requirements. I'm more interested in alternative solutions intended for after the query has been optimized. Outside of trying to break up the query into separate smaller bits (which unfortunately is not an acceptable solution), I can only think of two options. But, I don't think they are a good fit for this situation.
Caching first came to my mind, but I don't think it is viable based
on how likely the filters will vary and the large datasets returned.
From my research, options such as ElasticSearch and Solr would not be the
right fit either as the data sets can be manipulated my multiple programs and these data stores would quickly become outdated.
Are there other options to improve the perceived performance of a search feature with these requirements?
You don't provide enough information about your tables and queries for a concrete solution.
As mentioned in a comment by #jmarkmurphy, DB2 and IBM i does it's own "caching". I agree that it's unlikely you'd be able to improve upon it when dealing with large and varied results sets. But you need to make sure you're using what's provided by IBM. For example, if using SQL embedded in RPGLE, make sure you don't have set option CLOSQLCSR=*ENDMOD. Also check the settings in QAQQINI you're using.
You've mentioned using Visual Explain and building some of the requested indexes. That's a good start. But as the queries are run in production, keep an eye on the plan cache, index usage and the advised indexes.
Lastly, you mentioned that you're seeing full table scans do to the use of LIKE '%SOMETHING%'. Again, without details of the columns and data involved, it's a guess as to what may be useful. As suggested in my comment, Omnifind for IBM i may be an improvement.
However, Omnifind is NOT and improved LIKE. Omnifind is designed to handle linguistic searches. From the article i Can … Find a Needle in a Haystack using OmniFind Text Search Server for DB2 for i:
SELECT story_id FROM story_library.story_table
WHERE CONTAINS(story_doc, 'blind mouse') = 1;
This query result will include matches that we’d expect from a typical search engine. The search is case insensitive, and linguistic variations on the search words will be matched. In other words, the previous query will indicate a match for documents that contain “Blind Mice.” In a similar manner, a search for “bad wolves” would return documents that contained “the Big Bad Wolf.”
I am a Java developer working with a MarkLogic database. A key function of my code is its capacity to dynamically generate 4-6 SPARQL queries and run them via HTTP GET requests. The results of each are added together and then returned. I now need these results sorted consistently.
Since I am paging the results of each query (using the LIMIT and OFFSET statements) each query has its own ORDER BY statement. Without embedding sorting into the queries the pages of results will be returned out of order.
However, each query returns its own results which are individually sorted and need to be merged into a single sorted list. My preference would to be an alphanumeric sort that considers characters before considering case and that sorts empty and null values to the end. (Example: “0123456789AaBbCc…WwXxYyZz ”)
I have already done this in my Java code using a custom compare method, but I recently ran into a problem: my results still aren’t returning sorted. The issue I’m having stems from the fact that my custom ordering scheme is completely separate from the one used by SPARQL, resulting in a decidedly unsorted set of results. While I have considered sorting the results from scratch before returning them instead of assuming MarkLogic is returning sorted results, this seems unnecessarily wasteful and it may not even fix my problem.
In my research I have not been able to find any way to set the Collation for SPARQL, nor have I found a way to write a custom Collation. The documentation on this page (https://www.w3.org/TR/rdf-sparql-query/#modOrderBy) specifically states that SPARQL’s ORDER BY is based on a comparison method driven by XPATH’s fn:compare. That function references this page (https://www.w3.org/TR/xpath-functions/#collations) which specifically mentions options for specifying the Collation as well as using alternative implementations of the of the Unicode Collation Algorithm. What I can’t find is anything detailing how to actually do this.
In short, is there any way for me to manipulate or control how a SPARQL query compares characters to affect the final order?
If I understand what you're asking, you want to use ORDER BY, OFFSET, and LIMIT to select which results you're going to show, and then you want another ORDER BY to determine the order in which you'll show those results (which might be different than the order that you used to select them). You can do that with a nested query:
select ?result {
{ select ?result where {
#-- ...
}
order by #-- ...
offset #-- ...
limit #-- ...
}
}
order by #-- ...
There's not a whole lot of support for custom orderings, but you can use functions in the order expressions, and you can provide multiple expressions to sort first by one thing, then by another. In your case, it looks like you might want to do something like order lcase(?value) to order case-insensitively. (That won't be perfect, of course. For instances, it's not clear to me whether you want numeric sort for numeric prefixes or not (e.g., should the order be 1, 10, 2, or 1, 2, 10).)
I just got a definitive answer from SPARQL implementers.
The SPARQL spec doesn't really address collations. MarkLogic uses unicode codepoint collation for SPARQL ordering.
HOWEVER, we need to know your requirements. MarkLogic as you know supports all kinds of collations, and that support is built into the code backing SPARQL -- we simply have not exposed an interface as to how to leverage collations from SPARQL.
MarkLogic is watching this thread, so feel free to make that request, perhaps with a suggestion of how you would consider accessing collations from the query, and we'll see it.
I contacted Kevin Morgan from MarkLogic about this, and he was extremely helpful. We had a WebEx meeting yesterday discussing various solutions to the problem and it went very well.
Their engineers confirmed that so far there is no means of forcing SPARQL to use a particular sorting order. They proposed two promising solutions to my problem:
• Embed your triples within your documents and leverage document searches and range indexes: While this works for multiple system designs, it does not work for ours. Sorting and Pagination fall under a product upgrade and we cannot require our clients to completely re-ingest their data so we can apply this new standard.
• Wrap your SPARQL queries within an XQuery statement: This approach uses SPARQL to determine the entire result set, and then utilizes a custom collation within the XQuery to handle sorting. Pagination is also handled in the XQuery (for the obvious reason that paginating before sorting breaks both).
The second solution seems like it will work for us, but I will need to look into the performance costs before we can seriously consider implementing it. Incidentally, I find it very odd that SPARQL’s sorting does not support collations when the XQuery functions it is built upon do. It seems illogical to assume that its users will never want to sort untagged literal values with anything other than the basic Unicode Codepoint sorting. At what point does it become reasonable for me to take something built upon XQuery and embed it within XQuery because it seems the creators “left something out?”
I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source.
My Example:
I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.
In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.
How do you design a SCD accomodate changes from both systems to the dimension?
We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.
Not necessarily an answer but here are my thoughts:
You've already covered the real options in your comment. Either:
A. Merge it beforehand
You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data
OR
B. Merge it in the dimension
Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension
However you have two constraints that I feel makes the choice between A & B clearer
Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple
Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match
So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.
You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.
My thinking is actually very trivial. First you need to be able to conclude what is your master dataset on Geo+Location and granularity.
My method will be:
DIM loading
Say below is my target
Dim_Location = {Business_key, Longitude, Latitude, Location Name}
Dictionary
Business_key = Always maps to master record from source system (in this case it is the execution system). Imagine now the unique key from business is combined (longitude, latitude) for this table.
Location Name = Again, since we assume the "Execution system" is master for our data then it will host from Source="Execution System".
The above table is now loaded for Fact lookup.
Fact Loading
You have already integrated record between execution system and billing system. It's a straight forward lookup and load in staging since it exists with necessary combination of geo_location.
Challenging scenarios
What if execution system has a late arriving records on orders?
What if same geo_location points to multiple location names? Not possible but worth profiling the data for errors.
We are trying to implement a FHIR Rest Server for our application. In our current data model (and thus live data) several FHIR resources are represented by multiple tables, e.g. what would all be Observations are stored in tables for vital values, laboratory values and diagnosis. Each table has an independent, auto-incrementing primary ID, so there are entries with the same ID in different tables. But for GET or DELETE calls to the FHIR server a unique ID is needed. What would be the most sensible way to handle this?
Searching didn't reveal an inherent way of doing this, so I'm considering these two options:
Add a prefix to all (or just the problematic) table IDs, e.g lab-123 and vit-123
Add a UUID to every table and use that as the logical identifier
Both have drawbacks: an ID parser is necessary for the first one and the second requires multiple database calls to identify the correct record.
Is there a FHIR way that allows to split a resource into several sub-resources, even in the Rest URL? Ideally I'd get something like GET server:port/Observation/laboratory/123
Server systems will have all sorts of different divisions of data in terms of how data is stored internally. What FHIR does is provide an interface that tries to hide those variations. So Observation/laboratory/123 would be going against what we're trying to do - because every system would have different divisions and it would be very difficult to get interoperability happening.
Either of the options you've proposed could work. I have a slight leaning towards the first option because it doesn't involve changing your persistence layer and it's a relatively straight-forward transformation to convert between external/fhir and internal.
Is there a FHIR way that allows to split a resource into several
sub-resources, even in the Rest URL? Ideally I'd get something like
GET server:port/Observation/laboratory/123
What would this mean for search? So, what would /Obervation?code=xxx search through? Would that search labs, vitals etc combined, or would you just allow access on /Observation/laboratory?
If these are truly "silos", maybe you could use http://servername/lab/Observation (so swap the last two path parts), which suggests your server has multiple "endpoints" for the different observations. I think more clients will be able to handle that url than the url you suggested.
Best, still, I think is having one of your two other options, for which the first is indeed the easiest to implement.