How do you set the Collation for a SPARQL query? - sorting

I am a Java developer working with a MarkLogic database. A key function of my code is its capacity to dynamically generate 4-6 SPARQL queries and run them via HTTP GET requests. The results of each are added together and then returned. I now need these results sorted consistently.
Since I am paging the results of each query (using the LIMIT and OFFSET statements) each query has its own ORDER BY statement. Without embedding sorting into the queries the pages of results will be returned out of order.
However, each query returns its own results which are individually sorted and need to be merged into a single sorted list. My preference would to be an alphanumeric sort that considers characters before considering case and that sorts empty and null values to the end. (Example: “0123456789AaBbCc…WwXxYyZz ”)
I have already done this in my Java code using a custom compare method, but I recently ran into a problem: my results still aren’t returning sorted. The issue I’m having stems from the fact that my custom ordering scheme is completely separate from the one used by SPARQL, resulting in a decidedly unsorted set of results. While I have considered sorting the results from scratch before returning them instead of assuming MarkLogic is returning sorted results, this seems unnecessarily wasteful and it may not even fix my problem.
In my research I have not been able to find any way to set the Collation for SPARQL, nor have I found a way to write a custom Collation. The documentation on this page (https://www.w3.org/TR/rdf-sparql-query/#modOrderBy) specifically states that SPARQL’s ORDER BY is based on a comparison method driven by XPATH’s fn:compare. That function references this page (https://www.w3.org/TR/xpath-functions/#collations) which specifically mentions options for specifying the Collation as well as using alternative implementations of the of the Unicode Collation Algorithm. What I can’t find is anything detailing how to actually do this.
In short, is there any way for me to manipulate or control how a SPARQL query compares characters to affect the final order?

If I understand what you're asking, you want to use ORDER BY, OFFSET, and LIMIT to select which results you're going to show, and then you want another ORDER BY to determine the order in which you'll show those results (which might be different than the order that you used to select them). You can do that with a nested query:
select ?result {
{ select ?result where {
#-- ...
}
order by #-- ...
offset #-- ...
limit #-- ...
}
}
order by #-- ...
There's not a whole lot of support for custom orderings, but you can use functions in the order expressions, and you can provide multiple expressions to sort first by one thing, then by another. In your case, it looks like you might want to do something like order lcase(?value) to order case-insensitively. (That won't be perfect, of course. For instances, it's not clear to me whether you want numeric sort for numeric prefixes or not (e.g., should the order be 1, 10, 2, or 1, 2, 10).)

I just got a definitive answer from SPARQL implementers.
The SPARQL spec doesn't really address collations. MarkLogic uses unicode codepoint collation for SPARQL ordering.
HOWEVER, we need to know your requirements. MarkLogic as you know supports all kinds of collations, and that support is built into the code backing SPARQL -- we simply have not exposed an interface as to how to leverage collations from SPARQL.
MarkLogic is watching this thread, so feel free to make that request, perhaps with a suggestion of how you would consider accessing collations from the query, and we'll see it.

I contacted Kevin Morgan from MarkLogic about this, and he was extremely helpful. We had a WebEx meeting yesterday discussing various solutions to the problem and it went very well.
Their engineers confirmed that so far there is no means of forcing SPARQL to use a particular sorting order. They proposed two promising solutions to my problem:
• Embed your triples within your documents and leverage document searches and range indexes: While this works for multiple system designs, it does not work for ours. Sorting and Pagination fall under a product upgrade and we cannot require our clients to completely re-ingest their data so we can apply this new standard.
• Wrap your SPARQL queries within an XQuery statement: This approach uses SPARQL to determine the entire result set, and then utilizes a custom collation within the XQuery to handle sorting. Pagination is also handled in the XQuery (for the obvious reason that paginating before sorting breaks both).
The second solution seems like it will work for us, but I will need to look into the performance costs before we can seriously consider implementing it. Incidentally, I find it very odd that SPARQL’s sorting does not support collations when the XQuery functions it is built upon do. It seems illogical to assume that its users will never want to sort untagged literal values with anything other than the basic Unicode Codepoint sorting. At what point does it become reasonable for me to take something built upon XQuery and embed it within XQuery because it seems the creators “left something out?”

Related

Caching? Large Query performance for multiple, optional filters

I'm trying to figure out my options for a large query which is taking a somewhat long but sensible amount of time considering what it does. It has many joins and has to be searched against for up to a predefined number of parameters. Some of these parameter values are predefined (select box) while some are a free-form text box (unfortunately LIKE with prefixed and suffixed wildcards). The data sets returned are large and the filter options are very likely to be changed frequently. The order of the result sets are also controlled by the user. Additionally, user access must be restricted to only results the user is authorized to. This authorization is handled as part of a baseline WHERE clause which is applied regardless of the chosen filters.
I'm not really looking for query optimization advice as I've already reviewed the query and examined/optimized the query plan as much as I can given my requirements. I'm more interested in alternative solutions intended for after the query has been optimized. Outside of trying to break up the query into separate smaller bits (which unfortunately is not an acceptable solution), I can only think of two options. But, I don't think they are a good fit for this situation.
Caching first came to my mind, but I don't think it is viable based
on how likely the filters will vary and the large datasets returned.
From my research, options such as ElasticSearch and Solr would not be the
right fit either as the data sets can be manipulated my multiple programs and these data stores would quickly become outdated.
Are there other options to improve the perceived performance of a search feature with these requirements?
You don't provide enough information about your tables and queries for a concrete solution.
As mentioned in a comment by #jmarkmurphy, DB2 and IBM i does it's own "caching". I agree that it's unlikely you'd be able to improve upon it when dealing with large and varied results sets. But you need to make sure you're using what's provided by IBM. For example, if using SQL embedded in RPGLE, make sure you don't have set option CLOSQLCSR=*ENDMOD. Also check the settings in QAQQINI you're using.
You've mentioned using Visual Explain and building some of the requested indexes. That's a good start. But as the queries are run in production, keep an eye on the plan cache, index usage and the advised indexes.
Lastly, you mentioned that you're seeing full table scans do to the use of LIKE '%SOMETHING%'. Again, without details of the columns and data involved, it's a guess as to what may be useful. As suggested in my comment, Omnifind for IBM i may be an improvement.
However, Omnifind is NOT and improved LIKE. Omnifind is designed to handle linguistic searches. From the article i Can … Find a Needle in a Haystack using OmniFind Text Search Server for DB2 for i:
SELECT story_id FROM story_library.story_table
WHERE CONTAINS(story_doc, 'blind mouse') = 1;
This query result will include matches that we’d expect from a typical search engine. The search is case insensitive, and linguistic variations on the search words will be matched. In other words, the previous query will indicate a match for documents that contain “Blind Mice.” In a similar manner, a search for “bad wolves” would return documents that contained “the Big Bad Wolf.”

Can we boost the performance of COUNT, DISTINCT and LIKE queries?

As far as I understand, when we run SQL query with COUNT, DISTINCT or LIKE %query% (wildcards at both sides) keywords the indexes cannot be used and the database have to do the full table scan.
Is there some way to boost the performance of these queries?
Do they really cannot use indexes or we can fix this somehow?
Can we make an index-only scan if we need to return only one column? For example: select count(id) from MY_TABLE: probably in this case we can make index-only scan and avoid hitting the whole table if we have index on 'id'?
My question has a general meaning: could you give me some performance guidelines if we have to use the mentioned operators?
UPDATE
As for me I use PostgreSQL.
with PostgreSQL, you can create GIN pg_trgm indexes for text strings to make LIKE '%foo%' faster, though this requires addons, and PostgreSQL 9.1 or higher.
I doubt distinct by itself will ever use an index. I tried in fact and could not get it to use one. You can sort of force an index to be used by using a recursive CTE to pull individual records out (what can be called a "sparse scan"). We do something like this when pulling individual years out of the accounting record. This requires writing special queries though and so isn't really the general case.
count(*) is never going to be able to use an index due to mvcc rules. You can get approximate results by looking in the appropriate system catalogs however.

Parsing structured and semi-structured text with hundreds of tags with Ruby

I will be processing batches of 10,000-50,000 records with roughly 200-400 characters in each record. I expect the number of search terms I could have would be no more than 1500 (all related to local businesses).
I want to create a function that compares the structured tags with a list of terms to tag the data.
These terms are based on business descriptions. So, for example, a [Jazz Bar], [Nightclub], [Sports Bar], or [Wine Bar] would all correspond to queries for [Bar].
Usually this data has some kind of existing tag, so I can also create a strict hierarchy for the first pass and then do a second pass if there is no definitive existing tag.
What is the most performance sensitive way to implement this? I could have a table with all the keywords and try to match them against each piece of data. This is straightforward in the case where I am matching the existing tag, less straightforward when processing free text.
I'm using Heroku/Postgresql
It's a pretty safe bet to use the Sphinx search engine and the ThinkingSphinx Ruby gem. Yes, there is some configuration overhead, but I am yet to find a scenario where Sphinx has failed me. :-)
If you have 30-60 minutes to tinker with setting this up, give it a try. I have been using Sphinx to search in a DB table with 600,000+ records with complex queries (3 separate search criterias + 2 separate field groupings / sortings) and I was getting results in 0.625 secs, which is not bad at all and I am sure is lot better than anything you could accomplish yourself with a pure Ruby code.

performance with IN clause in postgresql

what are the performance aspects if you have something like this in your query:
... AND x.somfield IN (
33620,262,394,450,673,674,675,2331,2370,2903,4191,4687,5153,6776,6898,6899,7127,7217,7225,
7227,7757,8830,8889,8999,9036,9284,9381,9382,9411,9412,9423,10088,10089,10304,10333,10515,
10527,10596,10651,11442,12636,12976,13275,14261,14262,14382,14389,14567,14568,15792,16557,
17043,17459,17675,17699,17700,17712,18240,18370,18591,18980,19023,19024,19025,19026,19211,
19272,20276,20426,20471,20494,20833,21126,21315,21990,22168,22284,22349,22563,22796,23739,
24006,24321,24642,24827,24867,25049,25248,25249,25276,25572,25665,26000,26046,26646,26647,
26656,27343,27406,27753,28560,28850,29796,29817,30026,30090,31020,31505,32188,32347,32629
,32924,32931,33062,33254,33600,33601,33602,33603,33604,33605,33606,33607,33608,34010,34472,
35800,35977,36179,37342,37439,37459,38425,39592,39661,39926,40376,40561,41226,41279,41568,
42272,42481,43483,43867,44958,45295,45408,46022,46258) AND ...
should i avoid this or is it okay and fast enough?
thanks
You certainly want to check the execution plan. Depending on data, it may or may not be "okay".
If the table is large enough, it's possible that PG converts that to "array contains" operation and decides not to use an index on it. This could lead to a seq scan (if you don't have other WHERE criteria on this table).
In some cases OR is better than IN, because it's executed as two index scans and combined. May not work in your case though, because you have so many values in there. Again, depends on data.
Unless your table is small, in such cases you usually need to rely on other criteria which are easily indexed, such as dates, states, "types" etc. Then this IN is merely a "recheck" filter on limited data.
If the query uses index on the x.somfield - it will be fast enough.
As it was mentioned - you sould use "explain" and "explain analyze" to realy understand what's going on there.

How can I optimize a dynamic search query in Oracle

I am writing a stored procedure to perform a dynamic search that spans 10+ database tables. With millions of records in each table and a dynamic set of search parameters*, I am having some trouble optimizing the procedure.
Is there a "best practice" for building these kinds of queries? E.g. Use strings to build a dynamic query, use a huge list of IF THEN .. ELSE statements, etc? Can anyone provide a simple example or point me to some literature that will help? Here's some psuedocode for the stored procedure I am developing, which accepts a collection of parameters and a ref cursor.
v_query = "SELECT .....";
v_name = ... -- retrieve "name" parameter from collection
if v_name is not null then
v_query := v_query || ' AND table.Name = ' || v_name;
end if;
open search_cursor for v_query;
...
*By "dynamic set of search parameters," I mean that I pass in a collection of parameters. I figured this would be easier than making the caller pass in 20 parameters if they only want to search on one.
There are problems with using the static query approach; also be very careful about using the CURSOR_SHARING=FORCE option - it can really raise hell with your system if you haven't done a coverage test to ensure that all your other queries will work the way you want.
Problems with static queries:
The (x is null or x = col) predicates tend to kill any chance of using indexes. Since the query plan is computed at the time query is parsed the first time, the indexes you use will be based on the values for the first run of the query; later runs, which may not constrain on the same columns, will still use the same indexes.
Having one static statement with substitution variables will prevent the optimizer from making an intelligent choice about which index to use based on the data distribution. In a dynamic query (or in the first run of a query with bind variables), Oracle will see how selective your constraint is; a highly selective constraint will become a prime candidate for index use. For example, if your table had a row for every person in the U.S., STATE='Alaska' will be much more likely to use the index on STATE than STATE='California'.
Of course, in both these cases, if the dynamic columns in your WHERE clause are not indexed anyway, it doesn't matter, although I'd be surprised if that were the case in a database the size you're talking about.
Also, consider the real cost of all that hard parsing. Yes, hard parses serialize system resources, which makes them expensive, but only in the context of high volume queries. By their nature, ad-hoc queries do not get run very often. The cost you pay for all the hard parses you incur in an entire day will likely be hundreds of times less than the cost of a single query that uses the wrong indexes.
In the past, I've implemented these systems pretty much like you've done here - a base query portion, then iterating over a constraint list and adding WHERE clause predicates. I don't think it's hard for someone to maintain or understand, especially if you're talking about constraints that don't involve adding a lot of subqueries or extra tables to the FROM clause.
One thing to consider: If this system is primarily an offline one (in other words, not constantly being updated or inserted into - populated by periodic loads of bulk data), you may want to look into using BITMAP indexes. Bitmap indexes differ from regular b-tree indexes in that multiple indexes on a single table can be used simultaneously, and bitmap indexes are much, much smaller on disk than b-trees. They work very well for applications like this - where you will have a variety of constraints that can't be defined at design time. You will only want to put bitmap indexes on columns that have relatively few distinct values - say, one value constitutes no less than 1/1000 of the table - so don't use bitmaps on unique columns.
However, the downside is that bitmap indexes will noticeably degrade the performance of inserts and updates. The best practice for bitmaps is to use them in data warehouse applications, and they are dropped prior to loads and recreated afterwards.
Except in very particular cases, I don't think it is advisable (or even possible) to try to generate an optimized query. My advice is not to use dynamic SQL if you can : hard to read, hard to debug, hard to optimize, hard to maintain.
First, write a generic query that will work with any parameter sent to your procedure. According to your example, that would give something like :
SELECT * FROM table WHERE ((v_name IS NULL) OR (table.Name=v_name));
As you see, you could easily add other parameters to this query without using dynamic SQL. This query is much easier to read and debug. Ask your DBA for optimization tips.
Then, if you have a particular set of parameters that you know are often passed together, you could write a particular query for this set that you could specifically optimize. Pseudocode :
IF particular_set
THEN
/* Specific query */
ELSE
/* Generic query */
END IF;
The difficult part is to try not to have too many specific queries here, or you could fall into a maintenance hell.
We've had a similar requirement for one of our clients. They have half a dozen tables with millions of rows, and they wanted adhoc search capability on most of the columns.
The solution was a separate package for each table, which would take the search criteria and construct the SQL to run the search. We took advantage of the old system that was being replaced, to discover what the most common types of searches the users were doing, and made sure that those searches ran the best, by tuning the queries that were being generated (supported by the strategic use of indexes). Because each package was only responsible for queries against one table, it could have specific code designed to work with that table (including the odd hint, in a few rare cases).
One question/problem that you'll need to address is, do you hard-code the criteria (e.g. WHERE SURNAME='SMITH') or use bind variables? Using bind variables reduces hard parsing, which reduces load on the database server; however it can be impractical to use bind variables when the SQL is dynamically generated. The way we ended up going was to set CURSOR_SHARING=FORCE (which has its own disadvantages) which was a reasonable compromise in our case.
Read http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6711305251199

Resources