Sorting and scoring in elasticsearch - elasticsearch

I need to use machine learning algorithms in order to sort / rank query results.
Our queries are running on elasticsearch.
For that, we need to combine data from the document itself, from the explanation section (although the explanation should not be returned) and from external sources.
This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
Creating a scoring plugin will run on all documents; I didn't see any option to create plugin for the rescoring phase.
So, it seems like I must create a sorting plugin.
My question is - how many documents are running through the sorting phase? Is there any way to control it (like window_size in rescore)? What happens if I have pagination - does my sorting runs again?
Is it possible to get 1000 docs with the explanation section into the sorting phase and return only 100 without the explanation?
Thanks!

-This is pretty heavy computation, so I don't want to run the ranking algorithms on all documents, but only on top 1000, and return my top 100.
use rescoring combined with your scoring plugin, rescoring algo runs only on top N results
-how many documents are running through the sorting phase?
all which match your query, if you have asked for N docs , each shard sends top N and then they are merged together
-What happens if I have pagination - does my sorting runs again?
yes , sorting runs again and worse if you asked for documents fro 100000 to 100010 , sorting happens for 100010 docs per shard

Related

Elasticsearch Track total hits alternative with approximation

Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.

Using Search After without Index Sorting

I use ElasticSearch 6.2.4 and currently using Scroll to deal with queries that returns more than 10000 documents, since most of the queries return less (90%) and involve real time usage building scroll is inefficient so I consider to start use the Search After feature.
I noticed that in the example in Search After Doc that a sort is used and if I understand correctly it will be used in every query , will elastic order the results every time again and again? this can have a huge implication on performance.
I read about the Sorted Index, can this solve my problem since the shard will always be sorted?
Is there any reason to use search after without Sort Index?

Paging elasticsearch aggregation results

Imagine i have two kind of records: a bucket and an item, where item is contained in a bucket, and bucket may have relatively small amount of items (normally not more than 4, never more than 10). Those records are squashed into one (an item with extra bucket information) and placed inside Elasticsearch.
The task i am trying to solve is to find 500 buckets (at max) with all related items at once by filtered query that relies on item's attributes, and i'm stuck on limiting / offsetting aggregations. How do i perform such kind of task? I see top_hits aggregation which allows me to control size of related items amount, but i can't find a clue how can i control size of returned buckets.
update: okay, i'm terribly stupid. The size parameter of terms aggregation provides me with limiting. Is there any way to perform offset task? I don't need 100% precision and probably won't ever page those results, but anyway i'd like to see this functionality.
I don't think we'll be seeing this feature any time soon, see relevant discussion at GitHub.
Paging is tricky to implement because document counts for terms
aggregations are not exact when shard_size is less than the field
cardinality and sorting on count desc. So weird things may happen like
the first term of the 2nd page having a higher count than the last
element of the first page, etc.
There an interesting approach is mentioned, you could request like top 20 results on 1st page, then on 2nd page you run the same aggregation but exclude those 20 terms you already saw on the previous page and so forth. But this doesn't allow you "random" access to arbitrary page, you must go through pages in-order.
...if you only have a limited number of unique values compared to the
number of matched documents, doing the paging on client-side would be
more efficient. On the other hand, on high-cardinality-fields, your
first approach based on an exclude would probably be better.

Top 10% of results with sort

I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter
As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.

How to turn off sorting in Solr?

I have a number of documents stored in Solr cluster, and want to get large amount of them (about 20 millions) by particular query. I use standard approach reading batches of rows (say, 10000) and moving to next batch with start parameter. However after about 1 400 000 docs I'm starting to get OutOfMemoryError. I believe this is because of the way Solr sorts docs before sending them to the client. As far as I know, it uses priority queue to get only top N results, and thus need not to load headers of all documents into memory. However, when I ask it to return results, say, from 1,000,000 to 1,010,000, it has to load headers for all previous 1,000,000 docs too.
I'm looking for a way to avoid this and just get all results satisfying query without sorting. Is there a way to do it? If not, what is appropriate way to get large number of results from Solr?
Your assumptions are correct. When you search for results from 1,000,000 to 1,010,000, Solr instantiates a priority queue of size 1,010,000.
This is really not a natural use-case for Solr which has been designed to return the top-k list of results, rather than an exhaustive list of results.
You could work around this by filtering by ranges of your primary key (q=yourquery&fq=ID:[1 TO 1000]&rows=1000, q=yourquery&fq=ID:[1001 TO 2000]&rows=1000, ...) but this is a ugly hack. :-)
Why do you need to get all results? For example, if you need to compute facets or statistics, Solr has two components that can do that efficiently.

Resources