Paging in Elasticsearch when results have equal scores - elasticsearch

Is it possible to implement reliable paging of elasticsearch search results if multiple documents have equal scores?
I'm experimenting with custom scoring in elasticsearch. Many of the scoring expressions I try yield result sets where many documents have equal scores. They seem to come in the same order each time I try, but can it be guaranteed?
AFAIU it can't, especially not if there is more than one shard in a cluster. Documents with equal score wrt. a given elasticsearch query are returned in random, non-deterministic order that can change between invocations of the same query, even if the underlying database does not change (and therefore paging is unreliable) unless one of the following holds:
I use function_score to guarantee that the score is unique for each document (e.g. by using a unique number field).
I use sort and guarantee that the sorting defines a total order (e.g. by using a unique field as fallback if everything else is equal).
Can anyone confirm (and maybe point at some reference)?
Does this change if I know that there is only one primary shard without any replicas (see other, similar querstion: Inconsistent ordering of results across primary /replica for documents with equivalent score) ? E.g. if I guarantee that there is one shard AND there is no change in the database between two invocations of the same query then that query will return results in the same order?
What are other alternatives (if any)?

I ended up using additional sort in cases where equal scores are likely to happen - for example searching by product category. This additional sort could be id, creation date or similar. The setup is 2 servers, 3 shards and 1 replica.

Related

How does Elasticsearch 7 track_total_hits improve query speed?

I recently upgraded from Elasticsearch 6 to 7 and stumbled across the 10000 hits limit.
Changelog, Documentation, and I also found a single blog post from a company that tried this new feature and measured their performance gains.
But I'm still not sure how and why this feature works. Or does it only improve performance under special circumstances?
Especially when sorting is involved, I can't get my head around it. Because (at least in my world) when sorting a collection you have to visit every document, and that's exactly what they are trying to avoid according to the Documentation: "Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents."
Hopefully someone can explain how things work under the hood and which important point I am missing.
There are at least two different contexts in which not all documents need to be sorted:
A. When index sorting is configured, the documents are already stored in sorted order within the index segment files. So whenever a query specifies the same sort as the one in which the index was pre-sorted, then only the top N documents of each segment files need to be visited and returned. So in this case, if you are only interested in the top N results and you don't care about the total number of hits, you can simply set track_total_hits to false. That's a big optimization since there's no need to visit all the documents of the index.
B. When querying in the filter context (i.e. bool/filter) because no scores will be calculated. The index is simply checked for documents that match a yes/no question and that process is usually very fast. Since there is no scoring, only the top N matching documents are returned per shard.
If track_total_hits is set to false (because you don't care about the exact number of matching docs), then there's no need to count the docs at all, hence no need to visit all documents.
If track_total_hits is set to N (because you only care to know whether there are at least N matching documents), then the counting will stop after N documents per shard.
Relevant links:
https://github.com/elastic/elasticsearch/pull/24864
https://github.com/elastic/elasticsearch/issues/33028
https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand

How to count users who visited more than X times in a period

I'm trying to count active users for the service. We consider a user active if he did more than X actions in a span of a particular time period. Count will do fine, the list of user ids is not necessary.
I couldn't find the suitable query in Elasticsearch, not just Graphana. Terms aggregation can't do that because it only return top 10 buckets. Composite and cardinality aggregation don't allow minimum document count.
Value count and top hits don't have the necessary data and/or filters. Regular and extended stats work only with numeric fields.
What am I missing?
There's an answer from a person who contributes to Elasticsearch. Basically he says there's no built-in query to do this
P.S. It's my understanding that Elasticsearch is not the solution to this type of queries. Redis and/or Druid might be a better fit

Terminate After in elasticsearch

I have the intention to use the Terminate After feature of elasticsearch in order to reduce the result set.
The question is, the documents retrieved when using Terminate After, are ranked among the complete set of documents, or just among the reduced returned set?
Terminate after limits the number of search hits per shard so any document that may have had a hit later could also have had a higher ranking(higher score) than highest ranked document returned since the score used for ranking is independent of the other hits.
So yes the document will be ranked depending upon only the result set returned, but this would not affect how the actual score was calculated which takes into account all the documents.
Wanting a reduced result set and wanting it to be ranked depending on all the hits that may have occurred is a contradiction in itself.
Terminate after is generally used for filter type queries where the score of all returned docs is the same so that ranking doesn't matter.
For match type queries ES uses pagination so it's already quite efficient and you don't really need to restrict the document set anyways.

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Way to factor in search locality in Solr/Elasticsearch/Sphinx?

My problem is to search data of thousands of users, e.g. mailboxes. Almost all the time search is filtered by user id. How this locality of searches could be taken into consideration? I'm trying to achieve performance comparable to a case where each user has dedicated index.
Sharding is not an option because it will be used (total number of users ~ 1M), and I'm looking for a solution to use inside a shard of ~4k users.
Well it can be done in Sphinx with Attributes. Most of the time can make the search more efficient by adding the user-id as a fake keyword too*. Then the documents can be filtered during the full-text stage. (still keep the attribute too, so as avoid possibility of manipulating results by constructing a careful query to return results from other users)
eg, add _user1234 as a full-text field, then add to query WHERE MATCH('example _user1234') AND user = 1234 then finds documents just from that user.
One possible solution is to group documents of the same user in inverted index block. Given that inverted index block is sorted by document id, such grouping can be done only by assigning ids to documents appropriately. Same user's documents should have monotonic ids. There could be minor violations of this rule - it would not harm performance significantly.
Implementations.
index sorting having just become a first-class citizen in Lucene 6.21
Could be achieved in elasticsearch 2.3 (see here). And I think it's achievable in Solr in the same way.
As for sphinx, I suppose the same technique of assigning monotonic document ids should work.
For more technical reasoning see previous link.

Resources