ElasticSearch paginate over 10K result - elasticsearch

Elasticsearch's search feature only support 10K result by default. I know I can specific the "size" parameter in the search query, but this only applies to number of result to get back in one call.
If I want to iterate over 20K results using size=100, making 200 calls total. How should I do it?

Related

Elastic Search Version 7.17 Java Rest API returns incorrect totalElements and total pages using queryBuilder

We are currently upgrading our system from ElasticSearch 6.8.8 to ElasticSearch 7.17. When we run pageable queries using the Java Rest API, the results are incorrect.
For example, in version 6.8.8, if we query for data with and request page 2 with a page size of 10, the query return the 10 items on page 2 and give us a totalElement of 10000 records which is correct. When we run this same exact query on Version 7.17, it returns 10 items on page 2 but only gives us a totalElement of 10 instead of the correct number. We need the correct number, so that our gridview handles paging correctly. Is there a setting I am missing in ElasticSearch version 7.17?
Elasticsearch implemented an option of Track_total_hits in all search in ES 7.X.
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It is a good trade-off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
So to force ES to calculate all the hit documents you should set Track_total_hits to true. For more information, you can check the ES official documentation page here.

ES returns only 10 records by default. How to get all records without using scroll API

When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.

ElasticSearch aggregations for all pages

I use size and from keywords for pagination across ElasticSearch results and each page change requires another search query to be executed.
I would like to compute facets with the aggregations feature, however aggregations are computed only based on the results constrained by size and from keywords e.g. when I ask for records 20-30 from the list, the aggregations are computed only on these 10 records that are returned. And I would like of course to have global facets computed on all the matching records that do not change while I switch the pages.
Any ideas how to do it apart from performing an additional global (uncostrained by size and from) search?
Aggregations are computed on all documents that match "query". The scope of aggregations has nothing to do with "size" and "from" values.

Loading all documents from ElasticSearch takes too long

In order to load all the documents index by ElasticSearch, I am using the following query through tire.
def all
max = total
Tire.search 'my_documents' do
query { all }
size max
end.results.map { |entry| entry.to_hash }
end
Where max, respectively total is a count query to return the number of present documents. I have indexed about 10,000 documents. Currently, the request takes too long.
I am aware, that I should not query all documents like this. What is the best alternative here? Using pagination, if yes, toward which metric would I define the number of documents per page?
I am also planning to extend the size of the documents, to 100,000 or even 1,000,000 and I don't see yet how this can scale.
I appreciate every comment.
Rationale: I do this, because I am running calculations over these data. Hence, I need all the data, run the computations and save the results back into the documents.
Have a look at the scroll API, which is highly optimized to fetch a large amount of results. It uses the scan search type and doesn't support sorting but let you provide a query to filter the documents you want to fetch. Have a look at the reference to know more about it. Remember the size that you define in the request is per shard; that means that if you have 5 primary shards, setting 10 would lead to have 50 results back per request.

Unexplainable count results in ElasticSearch

We have an index running with 241.047 items in it. These items can have any number of subitems, which are indexed as nested documents. The total number of subitems is 381.705.
Both include_in_parent and include_in_root are not set in the mapping, which means that each nested document is indexed as additional documents. This should mean that there will be a total number of 241.047 + 381.705 = 622.752 documents in the index.
When I run the following Curl command to look up the number of documents in the index I get a different number, it's not far off but I'm wondering why it's giving me a different number and it's not returning the number I'm expecting.
curl -XGET
'http://localhost:9200/catawiki_development/_status?pretty' returns 622.861
Next to that, when I'm running a Curl command to get the number of root documents I get a different number than if I run a match_all query and ask for the number of documents returned
curl -XGET 'http://localhost:9200/elasticsearch_development/_count?pretty' returns 241.156
The match_all query returns the correct number of documents, 241.047
How can these difference be explained?
The path of a count api request is quite different from the path of a normal search request. In fact it is a shortcut that allows to only get the count of the documents matching a query, thats' it. It differs from a search with search_type=count too, which is effectively only the first part of a search: broadcast the search request to all shards, but no reduce/fetch since we only want to return the total number of matching documents. You can also add facets etc. to a search request (when using search_type=count too), which is something that you cannot do with the count api.
That said, I'm not that surprised you see a difference for the above reason, it would be nice to understand exactly what the problem is though. The best would be to be able to reproduce the problem with a small number of documents and open an issue including a curl recreation so that we can have a look at it.
In the meantime, I would suggest to use a search request with search_type=count if you have problems with the count api. That one is guaranteed to return the same number of documents as a normal search, just because it is exactly the same logic.

Resources