MarkLogic: Xpath vs searches - xpath

Consider the following Xpath expression:
/book/metadata/title[. = "Good Will Hunting"]
And the following search expression:
cts:search(/book/metadata, cts:element-value-query(xs:QName("title"), "Good Will Hunting"), "unfiltered")
Xpath will make use of the relationship indexes and the value indexes.
Does search make use of both term list indexes and value indexes ? Which of the above queries are more efficient and scale able ?

I'd suggest looking at xdmp:plan of each of these. This will show you exactly what questions we are sending to the index given your particular index settings. These would usually be fairly comparable, except your cts:search is missing the first argument. I'm assuming it would be /book/metadata so that you pick up those constraints in search as well. A key difference is that XPaths will always be filtered. OTOH, the main cost of that is pulling all the fragments off disk, so if you are doing that anyway in consuming the results, that won't make a huge difference unless there are a lot of false positives, or you only consume the top N results.

Related

How term not present queries work in lucene?

I have started reading about indexing in Lucene and sharding in Elastic search.
One thing I have not been able to understand is how queries like these look up indexes.
field-x contains term1 but not term2
Does it look up stored field for it.
The data in a stored field could be relatively large (it could be the text of an entire book). How would you efficiently search that text for an "exclusion" term? By indexing it!
You've already done that, to support field-x contains term1. So, no, you would not use a stored field for this. Instead, you would just use the indexed data to find term2 - and remove those results from the term1 results.
(I'm not saying this is the exact algorithm Lucene uses, because there may be significant optimizations Lucene makes, behind the scenes. But it will not be using the contents of the stored field.)
Also, if your indexed data does not contain any stored fields, the query would still work. You can try that for yourself.
Stored fields are useful when presenting results. From the Field documentation:
StoredField: Stored-only value for retrieving in summary results
In reality you would probably never want to store a large amount of data (e.g. a complete book) in a stored field. You could store a summary of the data - and that would make it unsuitable for use by queries, anyway.
Another consideration: You might as well ask "how does field-x contains term1 and also term2 work? It works the same way as the first example - except you aren't removing the term2 results from the term1 results - instead, you are finding the intersection between the two sets of results (if both terms are mandatory) or you are finding the union of the two sets (if both terms are optional)... and so on.

Why are Lucene/Elasticsearch prefix queries slower than term queries?

I've been recently reading about Lucene and Elasticsearch and it seems the following are true (correct me if i'm wrong):
prefix queries are slower than term queries
suffix queries (* ing) are slower than prefix queries (ing *)
This seems like a strange combination of properties. Perhaps I need to broaden my scope of data structures I'm considering, but if a segment were structured like a hash table, I could easily see that 1 would be true (the term query would be O(1) and a prefix query would require a full scan) however 2 would not be true (both prefix and suffix would require a full scan). If the segment were laid out like a sorted array, I could easily see that 2 would be true (a prefix query could be performed with a binary search O(log n) and the suffix would require a full scan) however 1 would no longer be true (both a term and prefix query would require a binary search).
My only other thought is that there might be some combination of both hash and sort going on to account for this behavior (ex. hash to some partition and sort within that partition). However my understanding is that Elasticsearch partitions by a document identifier but the inverted index key is a term. So a query for a term still requires the request being sent to all partitions.
Can anyone provide me with some intuition as to how/why this behavior exists?
Note:
https://www.youtube.com/watch?v=T5RmMNDR5XI would suggest that a segment is structured similar to a sorted array rather than a hash table.
The reason I believe 1 is true is https://medium.com/#mourjo_sen/a-detailed-comparison-between-autocompletion-strategies-in-elasticsearch-66cb9e9c62c4 mentions "The most important reason why prefix-like queries should never be used in production environments is that they are extremely slow. The reason for this is that the tokens in ES are not directly prefix-able"
The reason I believe 2 is true is https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html mentions "Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match
I'm not that familiar with ES specific details so they might be doing something else than plain Solr - but #1 is not the case usually.
A prefix match will be more expensive than looking up a single term, but it's not that much more expensive. It can be compared to doing a range search (which you can perform if you want to - field:[aa TO ab) could be compared to doing field:aa* (in theory); effectively retrieving all tokens that lie within that range, then resolving the document set that matches those tokens.
The fact that there are more tokens that match means that you can't simply take the list attached to a single token (a matching term) and retrieve those documents, but you have to retrieve a possibly large set of matching tokens and then compute the document set for that. However, this is not a very expensive computation, but it is more expensive than just a single match. The lookup can be done by finding the starting and end indexed of the matching tokens in the index, then retrieving all terms between those two and find the set of matching document ids.
A query of foo* against an indexed with the following terms:
bar, baz, foo, foobar, spam
^----------^
will collect the list of documents attached to foo and foobar, merge it and then retrieve the documents.
Slower does not mean that it's catastrophic or not optimised in any way; just that it's more expensive than a direct match where the set of documents has already been determined. However, you probably have more than one term in your query already, so the same process (albeit slightly higher up in the hierarchy) happens there as well.
A postfix match (your #2) - i.e. matching a wildcard at the beginning of the token - is expensive, since all tokens in the index usually has to be considered. The index have the terms sorted alphanumerically, so when you want to only look at the end of the string you have to consider that each token could match, regardless of where it's located in the index - so you get a full index scan. However, if this is a use case you see happening often, you can use the reverse wildcard filter. This works by reversing the string and having tokens that match the terms in reverse order, so that foo is indexed as oof and a wildcard search gets turned into oof* instead.
A query of *ar against an indexed with the following terms:
bar, baz, foo, foobar, spam
?! ? ? ?! ?
will have to look at each term to decide if it ends with ar.
The reason for using an EdgeNGramFilter (your comment / #3) is that you move as much of the required processing as possible to indexing time (doing the work that you know do query time, even if prefix queries aren't really expensive, they still have a cost), and additionally: wildcard queries does not support most analysis. So many people end up with wildcard queries against a set of tokens that have been stemmed or otherwise processed, and are then surprised when their wildcard queries doesn't generate a match. Only a small subset of filters can be applied to wildcard queries (such as the LowercaseFilter). Those filters are known as being "Multi term aware", since the terms the process can end up being expanded to multiple terms before collection of documents happen.
Another reason is that using an EdgeNGramFilter will give you proper frequency scores for each prefix, giving you effective scoring for prefixed terms as well.

List items is some indices first in Elasticsearch search results

I'm scraping few sites and relisting their products, each site has their own index in Elasticsearch. Some sites have affiliate programs, I'd like to list those first in my search results.
Is there a way for me to "boost" results from a certain index?
Should I write a field hasAffiliate: true into ES when I'm scraping and then boosting the query clauses that have that has that value? Or is there a better way?
Using boost could be difficult to guarantee that they appear first in the search. According to the official guide:
Practically, there is no simple formula for deciding on the “correct”
boost value for a particular query clause. It’s a matter of
try-it-and-see. Remember that boost is just one of the factors
involved in the relevance score
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
It depends on the type of queries you are doing, but here you have other couple of options:
A score function with weights: could be a more predictable option.
Simply using a sort by hasAffiliate (the easiest one).
Note: Not sure if sorting by boolean field is possible, in that case you could set hasAffiliate mapping as integer byte (smallest one), setting it as 1 when true.

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Elasticsearch filter only if no matches to first filter

My use case is for searching UK addresses where there is a well defined postal code system however my users may still make mistakes in the postcode. I want to use a filter as in most cases the user will get the postcode right and I do not want to make Elasticsearch work harder than it needs to however I want to avoid roundtrips from my application to ES.
I am using an edge n-gram analyzer as described in the docs, so, taking the postcode ME4 4NR as an example I have ME4 4NR, ME4 4N, ME4 4 and ME4 indexed. I want to first filter by ME4 4NR and only widen to ME4 4N if this yields no matches.
Can I achieve this in my ES query or do I need to implement this in my application logic? Any advice would be much appreciated. I could use a boolean filter with a must on the ME4 and shoulds on the others but I wondered if there is a better way?
I think you are a bit over-complicating the matter here. This if-this-then-that-else-somethingelse can be achieved with ES, but the cases when this is possible are limited. For example - this question - the "else" part was a must in which the statement was a bool filter that first checked another must with a missing "condition". So, something must still be true in order for the other part of a "if-then-else" statement to be applied. Is not a strict matter of doing this only if "a certain condition" is true or false like in programming. You need to approach this Elasticsearch way, not programming way.
Your solution - use a must on ME4 and shoulds on the others - is not necessary imo. If you have analyzer set to an edge n-gram, then the same analyzer is used at indexing time but also at search time. Which means that, depending on the query/filter used, your input text will be analyzed before the search is performed.
For example, if you use at search time match query, then the input text you provide is analyzed. What this means is that if you input ME4 4N as search text, first ES will edge n-gram the input text and use the resulting tokens to search the inverted index. So, no need of doing this in your own code or come up with multiple shoulds in your ES query.
My suggestion here is to have a well-defined set of requirements set up properly first. Meaning, know what you want your search to do: think about the tokens that should be put in the inverted index and think about what users input. Decide if you need analysis at index time, but also at search time. Depending on this, think about the ways to use filters/queries at search time, meaning which analyze the input text and which don't (term doesn't for example while match does). Then, test you approaches and see the performance. Don't assume something is putting more work on ES than it should because you might be wrong. Test and compare the results, then start improving and coming up with other ideas.

Resources