Elasticsearch: indeterministic data corruption on replicas - elasticsearch

I want to refer to this aprt of documentation: https://www.elastic.co/guide/en/elasticsearch/guide/current/_partial_updates_to_a_document.html
Namely the last blue box and its last sentence.
When a primary shard forwards changes to its replica shards, it
doesn’t forward the update request. Instead it forwards the new
version of the full document. Remember that these changes are
forwarded to the replica shards asynchronously, and there is no
guarantee that they will arrive in the same order that they were sent.
If Elasticsearch forwarded just the change, it is possible that
changes would be applied in the wrong order, resulting in a corrupt
document.
So what does it mean?
It says that sometimes if the random asynchronous order is unfortunate then the replica will contain a corrupt document (?) It does not seem reliable for an industry solution.
If this the above is true then how one can eliminate this problem of indeterministic data corruption on replicas?

Related

How does elastic search brings back a node which is down

I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.

elastic query returns same results after insert

I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html

ElasticSearch: Synchronization between replicas and primary shard

I was going through the details about how updates to a document are propagated through the primary shard and sent to replica shards as provided here: https://www.elastic.co/guide/en/elasticsearch/guide/current/_partial_updates_to_a_document.html
It is written that the updates to the document is communicated asynchronously to the replica shards and it tries till retry_on_conflict times to make sure that it is successfully executed.
Why does it have to try this this many times, it could have returned an error on the first try itself. Please provide some examples where the update would fail in the first case and would successfully take place after some tries.

Retrieving a document at replica shard while it is not yet updated data

I have a question with Elasticsearch as below:
In mode replication is async, while a document is being indexed, the document will already be present on the primary shard but not yet copied to the replica shards. At this time, a GET request for this document is forwarded to a replica shard.
How does Elasticsearch handle this in the cases the document is not yet indexed on the replica shards or the document is not yet updated on the replica shards ?
A fail request will be returned in the case indexing a new document or a old document returned in the case updating a document? Or requesting node will re-forward to the primary shard to get data?
First of all, I wouldn't recommend using the async mode. It doesn't really provides any benefits that you couldn't achieve in a safer way but creates issues when in comes to reliability. Because of this, this feature was deprecated in v1.5 and completely removed in v2.0 of elasticsearch.
Saying that, if you still want to use async and care about getting the latest results, you have to use it with primary preference.
In case of update operation, you don't have to do anything. Update is always performed on the primary shard first and then the result of the operation is replicated to all replicas.

I am looking for storing particular field in particular shard in elasticsearch

By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks
No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.

Resources