ElasticSearch incremental snapshot is ambiguous - elasticsearch

Elasticsearch snapshot/restore doc states that the index snapshot process is incremental.
Could you please explain what does it mean and confirm that every snapshot is autonomous in terms of restoration?
Use case :
Let's say I have created repository and first snapshotA containing all indexes at the moment A.
Sometime later (for example one hour later) I create new snapshotB of all the indexes at the moment B that have changed since the moment A.
There are two questions :
Does the size of snapshotB will be equal to the actual size of all indexes and contain all the data at the moment B or contain just the partial data : difference between snapshotA and snapshotB ?
If the second, how does elasticseach calculate that difference ?
If the second, can we safely delete snapshotA without loosing the data for the snapshotB ?
Thanks.

The snapshots are incremental at file level, not document level.
Each shard is a Lucene index and each Lucene index is performing automatic segments merging in the background. These segments are the files that are considered for a snapshot.
If at time A your index has 5 segments and by the time B 3 of them have merged into a bigger one, the snapshot taken at time B will only add this new segment in the snapshots repository. And in the metadata of the snapshot it will record that it needs this file and the 2 other files that were already added when snapshot A was created.
If you use the normal DELETE snapshot API Elasticsearch will delete those files that are not needed by any other existent snapshot. In this example, ES will delete the 3 segments that were merged into the larger one. Any other option of deleting a snapshot is not recommended and could lead to data loss.

Related

Elasticsearch snapshot how it works

i want to understand how snapshot works in elasticsearch
case1
snapshots are taken every day and snapshots older than 1 month are deleted
I have an index cities and for example there are 3 documents
{ barcelona, ​​madrid, urumqi} and, for example, I deleted the barcelona document from the index, it turns out that if a month passes and the last snapshot in which this index was deleted, then I can no longer recover this document?
case2
I have an elasticsearch cluster and a fairly large number of indexes, the rotation is 3 months, if, for example, a couple of indexes change or all are deleted, then if I restore from a snapshot that was taken 3 months ago, will my cluster be fully restored 3 on months ago data? will snapshot process rewrite all data or not?
if you delete the snapshots that cover an index then you cannot recover any of the data in the index. so no, you cannot recover the document
a restore will restore the data from the time the snapshot is taken. which means yes, the full data from 3 months will be what you see

Elasticsearch index is taking up too much disk space

I have index in production with 1 replica (this takes total ~ 1TB). Into this index every time coming new data (a lot of updates and creates).
When i have created the copy of this index - by running _reindex(with the same data and 1 replica as well) - the new index takes 600 GB.
Looks like there is a lot of junk and some kind of logs in original index which possible to cleanup. But not sure how to do it.
The questions: how to cleanup the index (without _reindex), why this is happening and how to prevent for it in the future?
Lucene segment files are immutable so when you delete or update (since it can't update doc in place) a document, old version is just marked deleted but not actually removed from disk. ES runs merge operation periodically to "defragment" the data but you can also trigger merge manually with _forcemerge (try running with only_expunge_deletes as well: it might be faster).
Also, make sure your shards are sized correctly and use ILM rollover to keep index size under control.

How to maintain all the changes made to Elasticsearch Mapping?

How do people maintain all the changes done to the elasticsearch index over time so that if I have re-built the elasticsearch index from scratch to be same as the existing one, I can just do so in minutes. Do people maintain the logs of all PUT calls made over time to update the mappings and other settings?
I guess one way is to use snapshot ,It's a backup taken from a running Elasticsearch cluster or index. You can take a snapshot of individual index or of the entire cluster and store it in a repository on a shared filesystem. It contains a copy of the on-disk data structures and mappings that make up an index beside that when you create a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot so you can build or recover an index from scratch to last version of taken snapshot very quickly.

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

elastic query returns same results after insert

I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html

Resources