Elasticsearch snapshot how it works - elasticsearch

i want to understand how snapshot works in elasticsearch
case1
snapshots are taken every day and snapshots older than 1 month are deleted
I have an index cities and for example there are 3 documents
{ barcelona, ​​madrid, urumqi} and, for example, I deleted the barcelona document from the index, it turns out that if a month passes and the last snapshot in which this index was deleted, then I can no longer recover this document?
case2
I have an elasticsearch cluster and a fairly large number of indexes, the rotation is 3 months, if, for example, a couple of indexes change or all are deleted, then if I restore from a snapshot that was taken 3 months ago, will my cluster be fully restored 3 on months ago data? will snapshot process rewrite all data or not?

if you delete the snapshots that cover an index then you cannot recover any of the data in the index. so no, you cannot recover the document
a restore will restore the data from the time the snapshot is taken. which means yes, the full data from 3 months will be what you see

Related

ElasticSearch deletes documents in an index automatically

I have configured an ELK Cluster with 5 nodes, one being master and the other slaves.
I index logs in the cluster once a day using logstash. I use a CronJOB (script) to copy
the log files to the configured logstash directory. I have also manually set a .sincedb path for logstash.
However, a tricky thing happens. Almost every 3 days, index seems to be loosing documents and deleting everything prior to certain dates. I haven't configured any ILM policy, nor there is any script performing delete by query or delete full index. Even when calling _cat/indices formatted to show the creation date of te index, I see that it has been created almost 2 weeks ago. However, the documents that should've been for 2 weeks aren't there anymore, and even today it only had documents from 3 days ago.
Does anyone know why could this behaviour be happening or what can trigger it ?

How to check the index is used for searching or indexing

I've a lot of elasticsearch clusters which hold the historical indices(more than 10 years old), some of these indices are created newly with latest settings and fields, but old ones are not deleted.
Now I need to delete the old indices which are not receiving any search and index requests.
I've already gone to elasticsearch curator but it would not work with older version of ES.
Is there is any API which can just gives the last time of index and search request in ES, that would serve my purpose very well.
EDIT:- I've also check https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html but this also doesn't give the last time when indexing or search request came. all it gave is the number of these requests from last restart.

ElasticSearch incremental snapshot is ambiguous

Elasticsearch snapshot/restore doc states that the index snapshot process is incremental.
Could you please explain what does it mean and confirm that every snapshot is autonomous in terms of restoration?
Use case :
Let's say I have created repository and first snapshotA containing all indexes at the moment A.
Sometime later (for example one hour later) I create new snapshotB of all the indexes at the moment B that have changed since the moment A.
There are two questions :
Does the size of snapshotB will be equal to the actual size of all indexes and contain all the data at the moment B or contain just the partial data : difference between snapshotA and snapshotB ?
If the second, how does elasticseach calculate that difference ?
If the second, can we safely delete snapshotA without loosing the data for the snapshotB ?
Thanks.
The snapshots are incremental at file level, not document level.
Each shard is a Lucene index and each Lucene index is performing automatic segments merging in the background. These segments are the files that are considered for a snapshot.
If at time A your index has 5 segments and by the time B 3 of them have merged into a bigger one, the snapshot taken at time B will only add this new segment in the snapshots repository. And in the metadata of the snapshot it will record that it needs this file and the 2 other files that were already added when snapshot A was created.
If you use the normal DELETE snapshot API Elasticsearch will delete those files that are not needed by any other existent snapshot. In this example, ES will delete the 3 segments that were merged into the larger one. Any other option of deleting a snapshot is not recommended and could lead to data loss.

ElasticSearch 5, document time to live, create and update

I'm using ElasticSearch 5 and I need my document, older than X days/weeks or a date, to be automatically deleted. I am not sure _ttl is available in 5 but from what I read Elastic do not recommend it any way.
I will update my documents, it is only the one non update for a define period that I need deleting.
Any ideas?
If you need to do that for all docs which are older than a date X, then it's definitely better to create one index per period (let say per day) then after X days, simply drop the index.
It's a way more efficient than doing delete doc operations.
If it's with a given query, docs that are older than X days and match XYZ, then add yourself a timestamp within your doc and run a delete by query call every day.

Apache Lucene / Elasticsearch snapshot restore with merge

I have successfully snapshotted and restored data multiple times in ElasticSearch (ES) using its APIs. But now I want to merge two snapshots in ES or directly in Lucene to restore a 'larger' chunk of data.
Details:
I take weekly snapshots of my data and as soon as restoration is done I delete the index so essentially the workflow looks like this
Create index abc
Snapshot index abc
Delete index abc
-----
Create index abc (again)
Snapshot index abc
Delete index abc
I have looked around but it seems there is no way to do that but those posts are an year old so wanted to reach out to the community again.
Also if not in ElasticSearch is there a way to do this Lucene directly and then configure ES to use 'new combined' index for restoration?
My language of choice for development is Python so I am looking into PyLucene as well but haven't explored it much yet.

Resources