How to fetch more than 10000 doc in elastic search 2.4 with jest client - elasticsearch

I am trying to fetch more than 10000 doc with jest client.
I used scroll feature and use a query size of 50, but my program goes into an infinite loop and in every iteration returns the same 50 doc results.
I guess it is problem with scroll id which I am not passing can some body help.

Below is the call to be made to retrieve the first 50 records :
POST <host_name>:<port_num>/<index_name>/_search?scroll=1m&size=50
As shown above, the size is mentioned as 50 and scroll is 1m, this means that the scroll api will retrieve 50 records per hit and this scroll is available for 1 minute. Also, this api returns a scroll id, which should be used for further retrieval of records. Please find the sample below:
POST <host_name>:<port_num>/_search?scroll=1m&scroll_id=<scroll_id>
Note : For further scroll api calls, index name need not be mentioned. Only the scroll_id and scroll time is sufficient.
For more information, please refer to the elastic search documentation on scroll api : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Related

how to get previous page with search_after elasticsearch with java client?

using search_after
i have more than 100,000 documents on elasticsearch and i need to get previous page and forward page
but my problem is i cannot get previous page with search_After. is it possible to use search_After with from to size to pagination more than 10,000 documents? or do i other options?
i tried using search_after java code and sort fileds are userId and date(HH:mm:ss.SSS) but when i click previous button or jump to page 1 to page 5, its not showing proper values

What is the maximum number of results that can be returned by the Bing Custom Search API per query?

From reading the documentation I can see that the maximum number of results per response can be set to 50. See link below.
https://learn.microsoft.com/en-us/rest/api/cognitiveservices-bingsearch/bing-custom-search-api-v7-reference#count
What is the maximum number of total results this API can return?
If a search returned a total of 400 results, would the Bing Custom Search API be able to return them all?
For example the Google Custom Search API will only return a maximum of 100 results per query (10 pages of 10 results).

How to retrieve all documents(size greater than 10000) in an elasticsearch index

I am trying to get all documents in an index, I tried the following-
1) getting the total number of records first and then setting /_search?size= parameter -doesn't work as size parameter is restricted to 10000
2)tried paginating by making multiple calls and used the parameters '?size=1000&from=9000'
-worked till 'from' was < 9000 but after it exceeds 9000 i again get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
So how can I retrieve all documents in the index?I read some answers suggesting to use the scroll api and even the documentation states -
"While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database."
But I couldn't find any sample query to get all records in a single request.
I have a total of 388794 documents in the index.
Also note, this is a one time call so I am not worried about performance concerns.
Figured out the solution-
Scroll api is the proper way to do it- here's how its working-
In the first call to fetch the documents, a size say 1000 can be provided and scroll parameter specifying the time in minutes after which search context times out.
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {....
}
}
For all subsequent calls we can use the scroll_id returned in the response of the first call to get the nest chunk of records.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhIOLSJJKSVNNZZND344D123RRRBNMBBNNN==="
}

Pagination with multi match query

I'm trying to figure out how to accomplish pagination with a multi match query using elasticsearch.
The scroll and search_after APIs seem like they won't work. scroll isn't meant for real time user requests as per documentation. search_after requires some unique field per id and requires you to sort on that field as per documentation but when using a multi-match query you're basically sorting by the score.
So, the only thing I've thought of so far is to do the following:
Send back last document id + score and use the score as the sort field. But, this could potentially return duplicate documents if other documents were added in between two queries.
If you want to paginate the first option is to use from and size parameter in your query. The documentation here
Pagination of results can be done by using the from and size
parameters. The from parameter defines the offset from the first
result you want to fetch. The size parameter allows you to configure
the maximum amount of hits to be returned.
Though from and size can be set as request parameters, they can also
be set within the search body. from defaults to 0, and size defaults
to 10.
Note that from + size can not be more than the index.max_result_window
index setting which defaults to 10,000. See the Scroll or Search After
API for more efficient ways to do deep scrolling.
If you don't need to paginate over 10k results it's your best choice. The max_result_window can be modified, but the performance will decrease as the selected page number will increase.
But of course if some documents are added during your user pagination they will be added and your pagination can be slightly inaccurate.

Fetch data less than a score in Elasticsearch

I am trying to make an Instagram like Explore page using Elasticsearch. The contents are scored based on time as well as number of likes. Since, the content likes are frequently updated, pagination is difficult using From/Size and Search After. Suppose, I fetched first 10 posts using From 0, Size 10. Another 10 posts scored more likes by the time I'm trying to fetch the second page in pagination. Now, I have the same posts that I fetched in first pagination at positions 10 to 20. This will create lot of duplicate in my explore page.
I am more concerned about avoiding duplicates in pagination than missing some content, because if the user refresh explore page, the top contents will be displayed again. The best way I think is to fetch all posts below a particular score. Is there anything like a max_score api. If not, how can i solve this problem?

Resources