How to maintain all the changes made to Elasticsearch Mapping? - elasticsearch

How do people maintain all the changes done to the elasticsearch index over time so that if I have re-built the elasticsearch index from scratch to be same as the existing one, I can just do so in minutes. Do people maintain the logs of all PUT calls made over time to update the mappings and other settings?

I guess one way is to use snapshot ,It's a backup taken from a running Elasticsearch cluster or index. You can take a snapshot of individual index or of the entire cluster and store it in a repository on a shared filesystem. It contains a copy of the on-disk data structures and mappings that make up an index beside that when you create a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot so you can build or recover an index from scratch to last version of taken snapshot very quickly.

Related

Elasticsearch restore indicies from data-node snapshot files only

I was upgrading from Elasticsearch 7.10 to 8.4. I wanted to make a Filesystem snapshot, copy the data, install a new version and restore the data from the snapshot files I created earlier.
I have a setup with two node roles: master and data.
I didn't know that, in such a setup, when Elastic is making a Filesystem snapshot, it'll create a structure with raw indices on the data node, something like this:
indicies/
8wPAc89lSrqFunOTSkShSQ/
0/
__LHqdmaHLQU6WWpJVlqFY4w
index-AXVMDc2DQZyBZihEeGOM9g
snap-7Mv54vkoRjS9YLLgSaokDw.dat
...
I25vR794SZmFJ3TvjF3d-Q/
0/
__-f2Sb1onSlaj9XSAhc84LQ
index-sc-iDaI7TRGX0BKg7Mzk2w
snap-7Mv54vkoRjS9YLLgSaokDw.dat
and a structure with some metadata on the master node, like this:
index-0
index.latest
indicies/
I25vR794SZmFJ3TvjF3d-Q/
0/
meta-oHtfvYQBIjpWMF5xqR1L.dat
meta-7Mv54vkoRjS9YLLgSaokDw.dat
snap-7Mv54vkoRjS9YLLgSaokDw.dat
When I was copying the files, I only copied the ones from the data node (not knowing that Elasticsearch is also writing metadata information to the master node). So I now have raw indices data without metadata information for it.
I wanted to re-create some of the metadata (index-0 is a JSON with some mapping) by myself but there are also some encoded files for each snapshot so I assume they're probably some calculated control hashes and my approach might not work.
Is there a way to restore all these indices in Elasticsearch without the metadata information?
Unfortunately, I don't think it's possible to rebuild the metadata without knowing what all needs to go in there.
Also between 7.10 to 8.4 there has been significant changes in the index format and you will probably not be able to get 8.4 to read your 7.10 raw files without any issues.
Also when upgrading from 7.x to 8.4, you must first upgrade to 7.17 before upgrading to 8.4.

How to properly delete AWS ElasticSearch index to free disk space

I am using AWS ElasticSearch, and publishing data to it from AWS Kinesis Firehose delivery stream.
In Kinesis Firehose settings I specified rotation period for ES index as 1 month. Every month Firehose will create new index for me appending month timestamp. As I understand, old index will be still presented, It wouldn’t be deleted.
Questions I have:
With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
Do I need to manually delete old index every month to clean disk space?
In order to clean disk space, is it enough just to run CURL command to delete the old index?
With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
No, you will need to create an index pattern on kibana, something like kinesis-*, then you will create your visualizations and dashboards using this index pattern.
Do I need to manually delete old index every month to clean disk space?
It depends of which version of Elasticsearch you are using, the last versions have a Index Lifecycle Management built-in in the Kibana UI, if your version does not have it you will need to do it manually or use curator, an elasticsearch python application to deal with theses tasks.
In order to clean disk space, is it enough just to run CURL command to delete the old index?
Yes, if you delete an index it will free the space used by that index.

ElasticSearch incremental snapshot is ambiguous

Elasticsearch snapshot/restore doc states that the index snapshot process is incremental.
Could you please explain what does it mean and confirm that every snapshot is autonomous in terms of restoration?
Use case :
Let's say I have created repository and first snapshotA containing all indexes at the moment A.
Sometime later (for example one hour later) I create new snapshotB of all the indexes at the moment B that have changed since the moment A.
There are two questions :
Does the size of snapshotB will be equal to the actual size of all indexes and contain all the data at the moment B or contain just the partial data : difference between snapshotA and snapshotB ?
If the second, how does elasticseach calculate that difference ?
If the second, can we safely delete snapshotA without loosing the data for the snapshotB ?
Thanks.
The snapshots are incremental at file level, not document level.
Each shard is a Lucene index and each Lucene index is performing automatic segments merging in the background. These segments are the files that are considered for a snapshot.
If at time A your index has 5 segments and by the time B 3 of them have merged into a bigger one, the snapshot taken at time B will only add this new segment in the snapshots repository. And in the metadata of the snapshot it will record that it needs this file and the 2 other files that were already added when snapshot A was created.
If you use the normal DELETE snapshot API Elasticsearch will delete those files that are not needed by any other existent snapshot. In this example, ES will delete the 3 segments that were merged into the larger one. Any other option of deleting a snapshot is not recommended and could lead to data loss.

Backing up, Deleting, Restoring Elasticsearch Indexes By Index Folder

Most of the ElasticSearch documentation discusses working with the indexes through the REST API - is there any reason I can't simply move or delete index folders from the disk?
You can move data around on disk, to a point -
If Elasticsearch is running, it is never a good idea to move or delete the index
folders, because Elasticsearch will not know what happened to the data, and you
will get all kinds of FileNotFoundExceptions in the logs as well as indices
that are red until you manually delete them.
If Elasticsearch is not running, you can move index folders to another node (for
instance, if you were decomissioning a node permanently and needed to get the
data off), however, if the delete or move the folder to a place where
Elasticsearch cannot see it when the service is restarted, then Elasticsearch
will be unhappy. This is because Elasticsearch writes what is known as the
cluster state to disk, and in this cluster state the indices are recorded, so if
ES starts up and expects to find index "foo", but you have deleted the "foo"
index directory, the index will stay in a red state until it is deleted through
the REST API.
Because of this, I would recommend that if you want to move or delete individual
index folders from disk, that you use the REST API whenever possible, as it's
possible to get ES into an unhappy state if you delete a folder that it expects
to find an index in.
EDIT: I should mention that it's safe to copy (for backups) an indices folder,
from the perspective of Elasticsearch, because it doesn't modify the contents of
the folder. Sometimes people do this to perform backups outside of the snapshot
& restore API.
I use this procedure: I close, backup, then delete the indexes.
curl -XPOST "http://127.0.0.1:9200/*index_name*/_close"
After this point all index data is on disk and in a consistent state, and no writes are possible. I copy the directory where the index is stored and then delete it:
curl -XPOST "http://127.0.0.1:9200/*index_name*/_delete"
By closing the index, elasticsearch stop all access on the index. Then I send a command to delete the index (and all corresponding files on disk).

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

Resources