Strategy to persist the node's data for dynamic Elasticsearch clusters - elasticsearch

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.

Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.

The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Related

Image pull over multiple K8s nodes

When I create a pod, a corresponding image is pulled to the node where the pod is created
Can I have those images shared among the cluster nodes, instead of being stored locally on each node?
Thanks a lot
Best Regards
It's possible if you have shared storage across all the Kubernetes nodes. However, it's not a good idea 🙅 since typically the place where images get stored is also the place where the container runtime stores its files when it's actually running the container. For example, if you are using Docker, everything gets stored under /var/lib/docker or in the case of containerd it's /var/lib/containerd
So in summary, it's possible with shared files/cluster file systems like NFS, Ceph, Glusterfs, AWS EFS, etc, but it's not a good idea in my opinion 🚫.
Update (#BMitch):
Make sure that the container storage driver you are using supports the filesystem that you are using.
✌️

Unknow source of daily clean up of indices

I have two separate elastic clusters, each one of elastic node is docker container, which live in docker swarm. I aggregate logs from various microservices in indices, and one of them is in format "logs-timestamp".
In one of cluster I have those indices from previous days, in other one I have only from present day.
This affect only those ones in "logs-timestamp" format.
Do you have any idea? or point from I can start to lookup?
Does elastic has some form of builtin garbage collector?
Ps. I didn't start this project so basiclly I have quite small knowledge about whole infrastructure.
You should check the ILM policies documentation (here) which is one way of automatically removing old indices.
In short, check the result of this command in kibana
GET _ilm/policy
It will tell you if you have some policy configured.
The other way I know for automatic indices curation is Curator ( see here and here). You should check if Curator is installed somewhere in your infrastructure and check the configuration.
Hope it helps.

Closing down amazon EMR temporarily

I wanted to know if I can temporarily close down my EMR ec2 instance to avoid extra charges. I waned to know if I can get a snapshot of my cluster and closing the ec2 instances temporarily.
You cannot currently terminate/stop your master instance without losing everything on your cluster, including in HDFS, but one thing you might be able to do is shrink your core/task node instance groups when you don't need them. You must still keep at least one core instance (or more if you have a lot of data in HDFS that you want to keep), but you can resize your task instance groups down to zero if your cluster is not in use.
On the other hand, unless you really have anything on your cluster that you need to keep, you might just want to terminate your cluster when you no longer need it, then clone it to a new cluster when you require it again.
For instance, if you only ever store your output data in S3 or another service external to your EMR cluster, there is no reason you need to keep your EMR cluster running while idle and no reason to need to "snapshot" it because your cluster is essentially stateless.
If you do have any data/configuration stored on your cluster that you don't want to lose, you might want to consider moving it off the cluster so that you can shut down your cluster when not in use. (Of course, how you would do this depends upon what exactly we're talking about.)

How to set up Hadoop in Docker Swarm?

I would like to be able to start a Hadoop cluster in Docker, distributing the Hadoop nodes to the different physical nodes, using swarm.
I have found the sequenceiq image that lets me run hadoop in a docker container, but this doesn't allow me to use multiple nodes. I have also looked at the cloudbreak project, but it seems to need an openstack installation, which seems a bit overkill, because it seems to me like swarm alone should be enough to do what we need.
Also I found this Stackoverflow question+answer which relies on weave, which needs sudo-rights, which our admin won't give to everyone.
Is there a solution so that starting the hadoop cluster comes down to starting a few containers via swarm?
I cannot give a definitive answer, but if you are looking to set this up without administratrator privileges and all answers to this question fail I fear you might be out of luck.
Consider asking the admin why he does not want to give out sudo access, chances are that either you can take away his doubts, or else that it turns out that what you want to do is undesirable.

Running multiple elasticsearch instances

I need to setup 2 Elasticsearch instances:
one for kibana logs (my separate application will throw logs at it)
one for search for my production application
My plan is to create a separate folders with elasticsearch in them. They dont talk to each other which means they are separate databases and if one goes down, the other still runs. Is this good solution or should I use only one elasticsearch folder with muliple elasticsearch.yaml configuration files? What is the best practice for multiple elasticsearch instances?
The best practice is to NOT run two Elasticsearch instances on the SAME server.
Your production search will probably need a lot of ram to work fast and stay responsive. You don't want your logging system interfere with that.

Resources