We are using elastic search 6.0 and using bulk indexing to index many documents in a single request using “index” action. In a single request we can have a scenario where there are multiple “index” requests on same document. Will ES fail the bulk request in such case OR it will process all of them in order?
Edit1: I use a script for indexing in bulk request where we are handling out of order updates. So as long as all “index” requests are getting processed, we don’t have any issue.
ES will not fail, but it is not necessarily clear which indexing operation will "win". It might be the last one but since all operations in the bulk batch might be spread over several ingest nodes, and not all of those nodes process the indexing operations at the same rate, it might not be clear which operation will be processed first and which will be processed last.
The only guarantee that you have is that in the response, you'll get the state of each operation in the same order as specified in the request batch.
If your index has only one primary shard, then the order in which you submit the operations will be the same order as the one those operations are processed, hence the last one wins, but if you have more than one primary shard on more than one node, then you can't really know.
A better question would be why do you submit several indexing operations per document knowing in advance that only one will win?
Related
Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.
If I have requests 1,2,3 in the bulk API of elasticsearch, am I guaranteed that it is executed sequentially, i.e 1 first then 2 and then 3?
This article says that
Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others.
This implies that you should not count on the order of the requests, because some of them might not finish successfully at all.
However, the response contains the status for each subrequest in the same order as they were submitted.
Also note that the index is refreshed only 1/sec (by default), so i would expect that individual subrequests would not see the changes of other operations from the same batch.
After reading the source code, we've found that, for operations upon the same doc id, the order can be assured. Because Elasticsearch server will first sort the bulk request and group them by Shard. Then distributed requests will be sending to those shards. Once a shard receives a Shard bulk request, it will execute the requests one by one.
I'm using ES as backend. So, my architecture is based on a client-server.
Very often, maybe too much, I'm realizing when I perform two operations from client: index and search almost one after the other, the document indexed is not returned by ES.
When I refresh the result, the last indexed document is obtained from server.
Should I take something in mind in order to avoid this behavior?
Is this behavior something usual?
Yes, it is usual behaviour. ElasticSearch refreshes shard every 1 second.
ElasticSearch could work really slow if you refresh it after every index.
In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks
No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.