Elasticsearch number of results changes with pagination - elasticsearch

I'm using Elasticsearch 7.6.0 and have paginated one of my queries. It seems to work well, and I can vary the number of results per page and the selected page using the search from and size parameters.
query = 'sample query'
items_per_page = 12
page = 0
es_query = {'query': {
'bool': {
'must': [{
'multi_match': {
'query': query,
"fuzziness": "AUTO",
"operator": "and",
'fields': ['title^2', 'description']
},
}]
}
}, 'min_score': 5.0}
res = es.search(index='my-index', body=es_query, size=items_per_page, from_=items_per_page*page)
hits = sorted(res['hits']['hits'], key=lambda x: x['_score'], reverse=True)
print(res['hits']['total']['value']) # This changes depending on the page provided
I've noticed that the number of results returned depends on the page provided, which makes no sense to me! The number of results also oscillates which further confuses me: Page 0, 233 items. Page 1, 157 items. Page 2, 157 items. Page 3, 233 items...
Why does res['hits']['total']['value'] depend on the size and from parameters?

The search is distributed and being sent to all the nodes holding shards matching the searched indices. Then all the results will be merged and returned. Sometimes, not all shards can be searched. This happens when
The cluster is very busy
The specific shard is not available due to recovery process
The search has been optimized and the shard has been omitted.
In the response, there is a _shards section like this:
{
"took": 1,
"timed_out": false,
"_shards":{
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits":{...}
}
Check if there is any value other than 0 for failed shards. If so, check the logs and cluster and index status.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-track-total-hits
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.

len(res['hits']['hits']) will always return the same number as specified in items_per_page (i.e. 12 in your case), except for the last page, where it might return a number smaller or equal to 12.
However, res['hits']['total']['value'] is the total number of documents in your index, not the number of results returned. If the number of documents increases, it means that new documents got indexed between the last query and the current one.

Related

ElasticSearch Segment merge not happening when deleted documents count is greater than 50%

Elasticsearch version: 7.10.0
I have an elasticsearch index with 8 shards in 8 different nodes and with a document count greater than 25 million documents(nested not included). It's an heavy update index. The document size grows over a period of time because of deleted documents. I did a search on this issue and read blogs like one below which tells a segment will automatically be merged when the deleted docs count in that segment is greater than 50%.
https://discuss.elastic.co/t/too-many-deleted-docs/84964/4
I did a /_segments for the index and found segments like the below
"segments": {
"_bbx": {
"generation": 14685,
"num_docs": 27901732,
"deleted_docs": 23290932,
"size_in_bytes": 5071187083,
"memory_in_bytes": 137008,
"committed": true,
"search": true,
"version": "8.7.0",
"compound": false,
"attributes": {
"Lucene87StoredFieldsFormat.mode": "BEST_SPEED"
}
},
Full response of /_segment call can be found here
https://drive.google.com/file/d/1mLE2xw0u7lnogHnfzz65rWCBS8JrcnNm/view?usp=sharing
In many segments like the one above the deleted_docs count is more than 75% of the num_docs but is still not getting merged. We haven't set any max_merged_segment so the default is 5gb. We also haven't changed any MergePolicy and are using the default ones as of Es version 7.10.0.
Is my understanding correct ?
Any thoughts on this would be helpful. Thanks in advance.
The num_docs contains only the present documents and doesn't include the deleted documents.
So in this case we have 23,290,932 deleted documents out of a total of 51,192,664 (27,901,732 + 23,290,932) documents which means 45.5% are deleted in that segment. Hence segment merge didn't happen.
Note : Posted the same question in elasticsearch forums got this reply
https://discuss.elastic.co/t/elasticsearch-segment-merge-not-happening-when-deleted-documents-count-is-greater-than-50/277209

ElasticSearch Sorted Index not working as expected with multiple shards

I have an elastic index with default sort mapping of price:
shop_prices_sort_index
"sort" : {
"field" : "enrich.price",
"order" : "desc"
},
If I insert 10 documents:
100, 98, 10230, 34, 1, 23, 777, 2323, 3, 109
And Fetch results using /_search. By default it returns documents in order of price descending.
10230, 2323...
But if I distribute my documents into 3 shards, Then the same query returns some other sequence of products:
100, 98, 34...
I am really stuck here, I am not sure if I am missing out something basic or Do I need any extra settings to make a Sorted Index behave correctly.
PS: I also tried 'routing' & 'preference'. but no luck.
Any help much appreciated.
When configuring index sorting, you're only making sure that each segment inside each shard is properly sorted. The goal of index sorting is to provide some more optimization during searches
Due to the distributed nature of ES, when your index has many shards, each shard will be properly sorted, but your search query will still need to use sorting explicitly.
So if your index settings contains the following to apply sorting at indexing time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
your search queries will also need to contain the same sort specification at query time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
By using index sorting you'll hit a little overhead at indexing time, but your search queries will be a bit faster in the end.

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Does "from" parameter in ElasticSearch Impact the ElasticSearch Cluster?

I have a large number of documents(around 34719074 documents) in a type of an index(ES 2.4.4). While searching, my ES Cluster seems to be in high impact(Search Latency, CPU Usage, JVM Memory and Load Average) when the "from" parameter is high(greater than 100000, "size" parameter being constant). Any specific reason for it? My query looks like:
{
"explain": false,
"size": 100,
"from": <>,
"_source": {
"excludes": [],
"includes": [
<around 850 fields>
]
},
"sort": [
<sorting from an string field>
]
}
This is a classic problem of deep pagination. You may read the link on pagination in Elasticsearch. Essentially, to get the next set documents after skipping 100000 documents would be an memory intensive task because to attain a result set of 100000+ documents, 100000+ documents need to fetched from each shard and then processed (ranking, sorting, etc.). Ranking/Sorting over a smaller result set takes lesser time that doing that on a larger result set.

Offset calculation in Spring Pagination

I have a service with pagination index starting from 1.I get the list of entity after some logic I return the same(responses) as below
totalCount = responses.size();
return new PageImpl<>(responses, pageable, totalCount);
and when I have requested the 1st page as
new PageRequest(1, 100)
I get back the response as
{"content": [
{
"id": "e1",
}{
"id": "2",
}
],
"last": false,
"totalElements": 102,
"totalPages": 2,
"size": 100,
"number": 1,
"sort": null,
"first": true,
"numberOfElements": 2
}
Here even though I have "numberOfElements": 2 I get back "totalElements": 102
The issue which I found is because of pageable.getOffset() calculation in PageImpl
this.total = !content.isEmpty() && pageable != null && pageable.getOffset()
+ pageable.getPageSize() > total
? pageable.getOffset() + content.size() : total;
In my scenario for the 1st Page I'am getting offset as 100 (1*100). How do I resolve this.
Note : I use a third party service to get the responses which is indexed 1 . So I am trying to align my service to that so that the entire logic follow the same indexing.
The result you get is correct, since PageRequest used zero-based pages as stated in the API docs:
Parameters:
page - zero-based page index.
size - the size of the page to be returned.
So that means you're retrieving the second page (not the first one), and since you have a limit of 100 records and a total of 102 records, you'll only retrieve the last two of them.
You can still expose a 1-based number though:
new PageRequest(page-1, 100);
Alternatively, you can customize this by implementing Pageable. This allows you to override the actual offset being used by Spring data.
Nonetheless, this doesn't change the fact that Spring data expects getPageNumber() to be a zero based number. You cannot change that, you can only add an abstraction layer on top of it to make it meet your requirements.
And what's wrong with that? totalElements tells you how many elements are stored within the data source. numberOfElements tells you how many elements the current page contains.
When have 102 elements in total and you requesting page 2 with size 100, you should get exactly the response you received.
What probably confuses you:
With new PageRequest(1, 100) you are requesting the 2nd page as the index starts at 0.

Resources