If I have requests 1,2,3 in the bulk API of elasticsearch, am I guaranteed that it is executed sequentially, i.e 1 first then 2 and then 3?
This article says that
Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others.
This implies that you should not count on the order of the requests, because some of them might not finish successfully at all.
However, the response contains the status for each subrequest in the same order as they were submitted.
Also note that the index is refreshed only 1/sec (by default), so i would expect that individual subrequests would not see the changes of other operations from the same batch.
After reading the source code, we've found that, for operations upon the same doc id, the order can be assured. Because Elasticsearch server will first sort the bulk request and group them by Shard. Then distributed requests will be sending to those shards. Once a shard receives a Shard bulk request, it will execute the requests one by one.
Related
We are using elastic search 6.0 and using bulk indexing to index many documents in a single request using “index” action. In a single request we can have a scenario where there are multiple “index” requests on same document. Will ES fail the bulk request in such case OR it will process all of them in order?
Edit1: I use a script for indexing in bulk request where we are handling out of order updates. So as long as all “index” requests are getting processed, we don’t have any issue.
ES will not fail, but it is not necessarily clear which indexing operation will "win". It might be the last one but since all operations in the bulk batch might be spread over several ingest nodes, and not all of those nodes process the indexing operations at the same rate, it might not be clear which operation will be processed first and which will be processed last.
The only guarantee that you have is that in the response, you'll get the state of each operation in the same order as specified in the request batch.
If your index has only one primary shard, then the order in which you submit the operations will be the same order as the one those operations are processed, hence the last one wins, but if you have more than one primary shard on more than one node, then you can't really know.
A better question would be why do you submit several indexing operations per document knowing in advance that only one will win?
I am using _msearch in Elasticsearch 6.4: https://www.elastic.co/guide/en/elasticsearch/reference/6.4/search-multi-search.html.
I can send multiple search in one API call and get the combined response. I'd like to sort and limit the response. It can be easily done by adding sort and size parameter for a single query. But how can I do that in _msearch? Queries in _msearch is running in parallel so can I attach a sort and size after all queries complete?
_msearch provides an API to run multiple queries in a single request, but those queries are independent and not related. The order of the queries responses is the same as the requests order, you have to correlate the responses to your queries (the nth response id for the nth query), so you can't sort the responses.
As you said, you can add sort and size to each of the queries, and control each response independently.
According to ES documentation document indexing/deletion happens as follows:
Request received at one of the nodes.
Request forwarded to the document's primary shard.
The operation performed on the primary shard and parallel requests sent to replica nodes.
Primary shard node waits for a response from replica nodes and then send the response to the node where the request was originally received.
Send the response back to the client.
Now in my case, I am sending a create document request to ES at time t and then sending a request to delete the same document (using delete_by_query) at approximately t+800 milliseconds. These requests are sent via a messaging system (internal implementation of kafka) which ensures that the delete request will be sent to ES only after receiving 200 OK response for the indexing operation from ES.
According to ES documentation, delete_by_query throws a 409 version conflict only when the documents present in the delete query have been updated during the time delete_by_query was still executing.
In my case, it is always guaranteed that the delete_by_query request will be sent to ES only when a 200 OK response has been received for all the documents that have to be deleted. Hence there is no possibility of an update/create of a document that has to be deleted during delete_by_query operation.
Please let me know if I am missing something or this is an issue with ES.
Possible reason could be due to the fact that when a document is created, it is not "committed" to the index immediately.
Elasticsearch indices operate on a refresh_interval, which defaults to 1 second.
This documentation around refresh cycles is old, but I cannot for the life of me find anything as descriptive in the more modern ES versions.
A few things you can try:
Send _refresh with your request
Add ?refresh=wait_for or ?refresh=true param
Note that refreshing the index on every indexing request is terrible for performance, which begs the question as to why you are trying to delete a document immediately after indexing it.
add
deleteByQueryRequest.setAbortOnVersionConflict(false);
I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.
I was going through the details about how updates to a document are propagated through the primary shard and sent to replica shards as provided here: https://www.elastic.co/guide/en/elasticsearch/guide/current/_partial_updates_to_a_document.html
It is written that the updates to the document is communicated asynchronously to the replica shards and it tries till retry_on_conflict times to make sure that it is successfully executed.
Why does it have to try this this many times, it could have returned an error on the first try itself. Please provide some examples where the update would fail in the first case and would successfully take place after some tries.