i want get all documents in a bucket, but the documents is so big, there is about 40 million documents. so i create a view, and get 1 million documents every time by set the limit argument, but it always cause timeout error halfway through, how can i get the documents,
thank you
Related
Elasticsearch's search feature only support 10K result by default. I know I can specific the "size" parameter in the search query, but this only applies to number of result to get back in one call.
If I want to iterate over 20K results using size=100, making 200 calls total. How should I do it?
I am trying to fetch about 2.5 million records from elastic search using elastic search's Java High Level Client. Which is taking too much time (15 to 22 minutes based on number of records) to fetch all the record using scroll API as it has a limitation of fetch 10,000 record in one request. I tried sliced scroll also but that is taking more time than normal scroll. Following is my assumption about sliced scroll API:
I divided my scroll request into five slices. Which creates 5 requests.
I send 5 request in different threads.
Because every sliced scroll request is an individual request. I guess for each sliced scroll request first it fetches all the records (2.5 million) then filters out the records which belongs to that particular slice.
Which is resulting in more time.
Can anyone tell me more efficient way to fetch all the records.
When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.
I have to built an index in Elastic Search which will have more than 500,000 unique documents. The documents have nested fields as well.
All the documents in the index are updated every 10 mins (using PUT).
I read that updating an document includes reindexing the document and it can affect the search performance.
Did anyone faced similar scenario in using EL and if someone can share their experience on the search/query response time across such an index if the expected response for query is under 2 seconds?
Update:
Now, I Indexed document with id as 1 using update request. Then, I updated document (id=1) using PUT to /_update with
"doc_as_upsert" : true and doc field, I see the response contains the same version as before update for the document and has attribute result ="noop" in the output.
I assume that indexing didn't happened as version of the document is not updated.
Does this reduce impact on search response(assuming there are 100 requests/second happening) and indexing response for my use case if do the same but for 500,000 documents every 10 mins compared to using PUT (INDEX API)?
In order to load all the documents index by ElasticSearch, I am using the following query through tire.
def all
max = total
Tire.search 'my_documents' do
query { all }
size max
end.results.map { |entry| entry.to_hash }
end
Where max, respectively total is a count query to return the number of present documents. I have indexed about 10,000 documents. Currently, the request takes too long.
I am aware, that I should not query all documents like this. What is the best alternative here? Using pagination, if yes, toward which metric would I define the number of documents per page?
I am also planning to extend the size of the documents, to 100,000 or even 1,000,000 and I don't see yet how this can scale.
I appreciate every comment.
Rationale: I do this, because I am running calculations over these data. Hence, I need all the data, run the computations and save the results back into the documents.
Have a look at the scroll API, which is highly optimized to fetch a large amount of results. It uses the scan search type and doesn't support sorting but let you provide a query to filter the documents you want to fetch. Have a look at the reference to know more about it. Remember the size that you define in the request is per shard; that means that if you have 5 primary shards, setting 10 would lead to have 50 results back per request.