Unknow source of daily clean up of indices - elasticsearch

I have two separate elastic clusters, each one of elastic node is docker container, which live in docker swarm. I aggregate logs from various microservices in indices, and one of them is in format "logs-timestamp".
In one of cluster I have those indices from previous days, in other one I have only from present day.
This affect only those ones in "logs-timestamp" format.
Do you have any idea? or point from I can start to lookup?
Does elastic has some form of builtin garbage collector?
Ps. I didn't start this project so basiclly I have quite small knowledge about whole infrastructure.

You should check the ILM policies documentation (here) which is one way of automatically removing old indices.
In short, check the result of this command in kibana
GET _ilm/policy
It will tell you if you have some policy configured.
The other way I know for automatic indices curation is Curator ( see here and here). You should check if Curator is installed somewhere in your infrastructure and check the configuration.
Hope it helps.

Related

Can I use a single elasticsearch/kibana for multiple k8 clusters?

Do you know of any gotcha's or requirements that would not allow using a single ES/kibana as a target for fluentd in multiple k8 clusters?
We are engineering rolling out a new kubernetes model. I have requirements to run multiple kubernetes clusters, lets say 4-6. Even though the workload is split in multiple k8 clusters, I do not have a requirement to split the logging and believe it would be easier to find the logs for pods in all clusters in a centralized location. Also less maintenance for kibana/elasticsearch.
Using EFK for Kubernetes, can I point Fluentd from multiple k8 clusters at a single ElasticSearch/Kibana? I don't think I'm the first one with this thought however I haven't been able to find any discussion of doing this. Found lots of discussions of setting up efk but all that I have found only discuss a single k8 to its own elasticsearch/kibana.
Has anyone else gone down the path of using a single es/kibana to service logs from multiple kubernetes clusters? We'll plunge ahead with testing it out but seeing if anyone else has already gone down this road.
I dont think you should create an elastic instance for each kubernetes cluster, you can run a main elastic instance and index it all logs.
But even if you don`t have an elastic instance for each kubernetes client, i think you sohuld have a drp, so lets says instead moving your logs of all pods to elastic directly, maybe move it to kafka, and then split it to two elastic clusters.
Also it is very depend on the use case, if every kubernetes cluster is on different regions, and you need the pod`s logs in low latency (<1s), so maybe one elastic instance is not the right answer.
Based on [1] we can read:
Fluentd collects logs from pods running on cluster nodes, then routes
them to a central​​​​​​ized Elasticsearch.
Then Elasticsearch ingests these logs from Fluentd and stores them in a central location. It is also used to efficiently search text files.
Kibana is the UI; the user can visualize the collected logs and metrics and create custom dashboards based on queries.
There are several ways in which they can solve your dilemma:
a) Create a centralized dashboard and use each cluster’s Elasticsearch as backend. So you can see all your clusters logs in one place.
b) Create an Elasticsearch cluster and add each Elasticsearch into it. This is NOT the best option since you will duplicate your data several times, you will need to handle each index shards and you will need to fight with the split brain dilemma but it’s great for data resiliency.
c) Use another solution like an APM (New Relic, Instana, etc) to fully centralize your logs in one place.
[1] https://techbeacon.com/enterprise-it/9-top-open-source-tools-monitoring-kubernetes

Is there any way to restore Elasticsearch snapshots apart from using the Elasticsearch restore API?

my company wants to use an existing Elasticsearch snapshot repository (consisting of various hundreds of gigabytes) to obtain the original documents and store them elsewhere. I must state that the snapshots have been obtained using the Elasticsearch snapshot API.
My company is somehow reluctant to use Elasticsearch to restore the snapshots, as they fear that would involve creating a new Elasticsearch cluster that would consume considerable resources. So far, I have not seen any other way to restore the snapshots than to use Elasticsearch, but, given my company's insistence, I ask here: is there any other tool that I could use to restore said snapshots? Thank you in advance for any help resolving this issue.
What I would do in your shoes is to spin up a local cluster and restore the existing snapshot into it (here is the relevant Elastic documentation: Restoring to a different cluster). Then, from there, I would either export the data by using the Kibana Reporting plugin (https://www.elastic.co/what-is/kibana-reporting), or by writing a Logstash pipeline to export the data from the local cluster to - say - a CSV file.

What's the easiest way of moving Elastic Search data between servers

I've got Elastic Search v6.1.0 installed on Windows and Centos7 machines. The goal is to migrate data from Win to Centos7 machine.
Since they both have the same ES version, I simply dragged "data" folder from machine A to B. When I checked its health, its status was red and active_primary_shards was 0. So I reversed the changes I made.
What other methods are there? Can Snapshot/Restore method be used for this purpose? I think it's for migrating between different versions.
So the question is, what's the best/easiest method for moving data between 2 servers with same ES versions?
Using snapshot/restore
You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster. The shared FS should meet the following criteria:
In order to register the shared file system repository it is necessary
to mount the same shared filesystem to the same location on all master
and data nodes.
So it's not a problem if you have a single-node cluster. In this case just make a snapshot and copy it over to other machine.
It might though be a challenging task if you have many nodes running.
You may use one of the supported plugins for S3, HDFS and other cloud storages.
The advantage of this approach is that the data and the indices are snapshotted entirely.
Using _reindex API
It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.
What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).
There are couple of issues you should take care of:
setting up the target index (no mapping, no settings will be set by reindex)
if some fields on the source index are excluded from _source then their contents won't be copied to the target index
Summing up
For snapshot/restore
Pros:
all data and the indices are saved/restored as they are
2 calls to the ES API are needed
Cons:
if cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
For _reindex
Pros:
Works for cluster of any size
Data is copied directly (no intermediate storage required)
1 call to the ES API is needed
Cons:
Data excluded from _source will be lost
Here's also a similar SO question from some three years ago.
Hope that helps!

Elasticsearch snaphots to s3

I have a elasticsearch 5.6.2 cluster with one master and two data nodes and I am using Kibana for visualizing . I want to enable automatic snapshots for the elasticsearch cluster to Amazon-s3 every 30mins. Can I Know How Can I accomplish it ..? There is no proper Documentation . I had also refered curator docs and I have a question, DO I need to configure that curator or on each node ...?
Please help guys
Curator is an external process.
You must put it on one single machine. It can be a node or any other machine.
It will send REST requests to elasticsearch when needed.
Put in your crontab and that is going to be ok.
You can also call the SNAPSHOT endpoint manually from a shell script every 30 minutes and don’t use curator at all.
Elastic cloud does a backup every 30 minutes (in case you don’t want to manage the cluster yourself and have that kind of advanced features like also rolling upgrades, Kibana, security...)

Setting up a single backup node for an elasticsearch cluster?

Given Elasticsearch cluster with several machines, I would want to have a single machine(special node) that is located on a different geographical region that can effectively sync with the cluster for read only purpose. (i.e. no write for the special node; and that special node should be able to handle all query on its own). Is it possible and how can this be done?
With elasticsearch 1.0 (currently available in RC1) you can use the snapshot & restore api; have a look at this blog too to know more.
You can basically make a snapshot of your indices, then copy the snapshot over to the secondary location and restore it into a different cluster. The nice part is that snapshots are incremental, which means that only the files that have changed since the last snapshot are actually backed up. You can then create snapshots at regular intervals, and import them into the secondary cluster.
If you are not using 1.0 yet, I would suggest to have a look at it, snapshot & restore is a great addition. You can still make backups manually and restore them with 0.90, but you don't have a nice api to do that and you need to do everything pretty much manually.

Resources