How to properly delete AWS ElasticSearch index to free disk space - elasticsearch

I am using AWS ElasticSearch, and publishing data to it from AWS Kinesis Firehose delivery stream.
In Kinesis Firehose settings I specified rotation period for ES index as 1 month. Every month Firehose will create new index for me appending month timestamp. As I understand, old index will be still presented, It wouldn’t be deleted.
Questions I have:
With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
Do I need to manually delete old index every month to clean disk space?
In order to clean disk space, is it enough just to run CURL command to delete the old index?

With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
No, you will need to create an index pattern on kibana, something like kinesis-*, then you will create your visualizations and dashboards using this index pattern.
Do I need to manually delete old index every month to clean disk space?
It depends of which version of Elasticsearch you are using, the last versions have a Index Lifecycle Management built-in in the Kibana UI, if your version does not have it you will need to do it manually or use curator, an elasticsearch python application to deal with theses tasks.
In order to clean disk space, is it enough just to run CURL command to delete the old index?
Yes, if you delete an index it will free the space used by that index.

Related

How to Customize Index name in ILM(index lifecycle management) of Elastic Search?

I am using ILM (Index Lifecycle Management) of Elastic to Index my live data(Email recieved).
The policy is created to rollover to new index on every 30 days.
The Index template is : WikiEmail-*.
So, Index is getting created every 30 days named as : WikiEmail-000001 and so forth.
Now I have an requirement wherein I need to index historical data(Older Email from past few years).
How do I index the Older data in the monthly index fashion ?
IS there a way we can have cusotmied IndexName in ILM , so that the starting Index name is : WikiEmail-0000099.
In that case , I can index the older document by creating corresponding indices in the Warm Phase named as WikiEmail-0000098 ,WikiEmail-0000097 and likewise.
you will run into issues here as the ILM policy will look at the index creation date when it comes to retention. so your old data may actually be around for longer than more recent data
if you want to have this data accessible under the ILM read alias, then you should index the data into whatever named indices you want, then attach them to that read alias
the only caveat is you will need to manage retention manually for those indices

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

Is there archive mechanizem for ElasticSearch docs?

I have elastic search container which holds 12 indexes (index per month) when must of the data is history data. I looking for mechanizem which save only the information about the current day and when the day passed It will pass the data to the index of the current month.
What you are looking for is index lifecycle management(ILM) where you can define various policies on the management of your indices, like moving the indices, deleting the indices etc.
More specifically you can look for a automate rollover of ILM which seems to be the use-case of yours.

How to handle Elasticsearch data when it fills up dedicated volume

I am creating an EFK stack on a k8s cluster. I am using an EFK helm chart described here. This creates two PVC's: one for es-master and one for es-data.
Let's say I allocated 50 Gi for each of these PVC's. When these eventually fill up, my desired behavior is to have new data start overwriting the old data. Then I want the old data stored to, for example, an s3 bucket. How can I configure Elasticsearch to do this?
One easy tool that can help you do that is Elasticsearch Curator:
https://www.elastic.co/guide/en/elasticsearch/client/curator/5.5/actions.html
you can use it to:
Rollover the indices that hold the data, by size/time. This will cause each PVC to hold few indices, based on time.
snapshot the rolled over indices to backup in S3
delete old indices based on their date - delete the oldest indices in order to free up space for the new indices.
Curator can help you do all these.

Resources