How to handle Elasticsearch data when it fills up dedicated volume - elasticsearch

I am creating an EFK stack on a k8s cluster. I am using an EFK helm chart described here. This creates two PVC's: one for es-master and one for es-data.
Let's say I allocated 50 Gi for each of these PVC's. When these eventually fill up, my desired behavior is to have new data start overwriting the old data. Then I want the old data stored to, for example, an s3 bucket. How can I configure Elasticsearch to do this?

One easy tool that can help you do that is Elasticsearch Curator:
https://www.elastic.co/guide/en/elasticsearch/client/curator/5.5/actions.html
you can use it to:
Rollover the indices that hold the data, by size/time. This will cause each PVC to hold few indices, based on time.
snapshot the rolled over indices to backup in S3
delete old indices based on their date - delete the oldest indices in order to free up space for the new indices.
Curator can help you do all these.

Related

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

How to properly delete AWS ElasticSearch index to free disk space

I am using AWS ElasticSearch, and publishing data to it from AWS Kinesis Firehose delivery stream.
In Kinesis Firehose settings I specified rotation period for ES index as 1 month. Every month Firehose will create new index for me appending month timestamp. As I understand, old index will be still presented, It wouldn’t be deleted.
Questions I have:
With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
Do I need to manually delete old index every month to clean disk space?
In order to clean disk space, is it enough just to run CURL command to delete the old index?
With new index being created each month with different name, do I need to recreate my Kibana dashboards each month?
No, you will need to create an index pattern on kibana, something like kinesis-*, then you will create your visualizations and dashboards using this index pattern.
Do I need to manually delete old index every month to clean disk space?
It depends of which version of Elasticsearch you are using, the last versions have a Index Lifecycle Management built-in in the Kibana UI, if your version does not have it you will need to do it manually or use curator, an elasticsearch python application to deal with theses tasks.
In order to clean disk space, is it enough just to run CURL command to delete the old index?
Yes, if you delete an index it will free the space used by that index.

Elasticseach & Kibana - best practice for visualizing one year's logs

I am using ElasticSearch with Kibana to store and visualize data from my logs. I know it is customary to use Logstash, but I just use the elasticsearch Rest API and POST new elements to it.
I am trying to look for best practices in terms of how I should manage my indices, given I have about 50k logs per day, and I want to visualize sometimes weekly, sometimes monthly and sometimes yearly data. And also I have no need for more than one node. I don't need a high available cluster.
So I am basically trying to determine:
-How should I store my indexes, by time? Monthly? Weekly? One index for everything?
-What are the disadvantages of a huge index (one index that contains all my data)? Does it mean that the entire index is in memory?
Thank you.
I like to match indexes to the data retention policy. Daily indexes work very well for log files, so you can expire one day's worth after X days of retention.
The fewer indexes/shards you have, the less RAM is used in overhead by Elasticsearch to manage them.
The mapping for a field is frozen when the field is added to the index. With a daily index, I can update the mapping and have it take effect for the new indexes, and wait for the old ones to expire. With a longer-term indexes, you'd probably need to reindex the data, which I always try to avoid.
The settings for shards and replicas are also frozen when you create the index.
You can visualize them in Kibana regardless of how they're stored. Use the #timestamp field as your X-axis and change the "interval" to the period you want.
Using logstash would be important if you wanted to alter your logs at all. We do a lot of normalization and creation of new fields, so it's very helpful. If it's not a requirement for you, you might also look into filebeats, which can write directly to elasticsearch.
Lots to consider...

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

Resources