to_tsquery() validation - validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).

Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Related

Search-For Utility Mainframe Algorithm

Can someone please give me some pointers on how the IBM mainframe Search-For Utility algorithm works?
How does it compare strings? What kind of matching algorithm does it use? How should I enter different strings in order to make the less comparisons possible?
I am using the utility but I do not know how it works, and I believe I am not using it as well as I should.
Thank you very much for your help!
Think of it as a very dumb search.
It doesn't have the capacity to enter a REGEX or anything like that. I don't think anyone will be able to tell you what algorithm is used.
Search-For uses the SuperC program to actually perform the search. What it appears to do is search line by line for a match to the string you provided. So if I do a search for:
'PIC 9(9)'
I am going to get back results for every line that has that string in it. The only way I could bring back less search results, would be to add more to that string. So maybe search for:
'PIC 9(9).' 'PIC 9(9) VALUE 'PIC 9(9) COMP'
any of these 3 would provide less results than the first search. So if that string breaks a line like:
05 WS-SOME-VARIABLE PIC 9(9)
VALUE 123456.
a search for 'PIC 9(9) VALUE' will not return anything, but a search for 'PIC 9(9)' would.
The more specific you are, the less search results you will get back. Depending on what you are looking for, you may be able to get better results by using Search-For in batch, or using File-Aid instead. Every specific scenario is different. So without knowing exactly what you are searching for and what your requirement it, its hard to tell you how to proceed.
You might consider IBM Developer for z, which which can do regular expression based searches. When the Remote Systems Explorer Daemon (RSED) is setup and running on the z/OS lpar, you can do searches across a single PDS or groups of PDS's using IDz filters. Very powerful. It also searches in the background so you can do other tasks while it searches. The searches can be saved for future ease of reference.

Search algorithm options for ontology querying?

I have developed a tool that enables searching of an ontology I authored. It submits the searches as SPARQL queries.
I have received some feedback that my search implementation is all-or-none, or "binary". In other words, if a user's input doesn't exactly match a term in the ontology, they won't get any hit at all.
I have been asked to add some more flexible, or "advanced" search algorithms. Indexing and bag-of-words searching were suggested.
Can anyone give some examples of implementing search methods on an ontology that don't require a literal match?
FIrst of all, what kind of entities are you trying to match (literals, or string casts of URIs?), and what kind of SPARQL queries are you running now? Something like this?
?term ?predicate "user input" .
If you are searching across literals, you can make the search more flexible right off the bat by using case-insensitive regular expression filtering, although this will probably make your searches slower, and it won't catch cases where some of the word tokens are present but in a different order. In the following example, your should probably constrain the types of ?term and ?predicate first, or even filter on a string datatype on ?userInput
?term ?predicate ?someLiteral .
FILTER(regex(?someLiteral), "user input", "i"))
Several triplestores offer support for full-text searching and result scoring. These are often extensions to the SPARQL language.
For example, Virtuoso and some others offer a bif:contains predicate. Virtuoso also offers the faceted search web interface (plus a service, I think.) I have been pleased with the web-based full text search in Blazegraph and Stardog, but I can't say anything at this point about using them with a SPARQL query to get a score on a search pattern. Some (GraphDB) even support explicit integration with Lucene or Solr*, so you may be able to take advantage of their search languages.
Finally... are you using a library like the OWL API or RDF4J to access your ontology? If so, you could certainly save the relationships between your terms and any literals in a Java native data structure, and then directly use a fuzzy search component like Lucene to index each literal as a "document" and then search the user input across the index.
Why don't you post your ontology and give an example of a search you would like to peform in a non-binary way. I (or someone else) can try to show you a minimal implementation.
*Solr integration only appears to be offered in the commercially-licensed version of GraphDB

CouchDB, all_docs and filter design documents with endkey

First, this question - filter design documents from all_docs - already seemed to be solved like described here:
https://plus.google.com/+JasonDeRose/posts/1iP5tu3wVqw
/mydb/_all_docs?endkey=%22_%22
and worked in first place. However, suddenly in a different setup (actually just different deploy), the query only returns an empty collection []. It seems like the ordering changed, without endkey="_" the full collection is returned (including design documents). I tried various combinations of endkey/startkey but cannot achieve to filter the design documents again.
Finally I added a filter and switched to _changes?include_docs=true to load the initial documents. I also thought about defining a view, but don't like that this results in data replication and some inconveniences with the changes feed (needed in another context). The filter on the other hand will be executed for every document.
Is it a bug that endkey=%22_%22 doesn't work anymore and is there a more convenient, still working way?
/_all_docs is a special case for CouchDB. Instead of the normal Unicode Collation, it uses ASCII collation.
The '_' character in ASCII order shows up between uppercase letters and lowercase letters. So if your doc id starts with lowercase letters (default behaviour), they will show up after any design docs. If your doc ids start with uppercase letters, they will show up before design docs.
Try creating a document with an id of: "ABC" You will see it show up before the design doc and your trick to filter design docs would work in this case.
However, I recommend you stop using the `_all_docs view altogether. Instead use the normal view functionality. When you create a view, CouchDB automatically skips design docs for you. So if your view looked like:
function(doc){
emit(doc._id, null);
}
You could query this with no start or end key, and get all docs without design docs.
Also, please look at Unicode Collation order, this is the order all your other views will be in, and it's important to understand as you work with CouchDB. You can read all about it here:
http://docs.couchdb.org/en/stable/ddocs/views/collation.html

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.
Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Resources