How can I deploy a Elasticsearch on K8S with auto scale enabled? - elasticsearch

I am planning to deploy Elasticsearch to K8S. I deployed a StatefulSet pods which has replicas: 3 configuration. That means there will be 3 pods deployed and I am hoping these 3 nodes will work as data nodes for Elasticsearch cluster.
But I got an error which is [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured.
It means I need to specify all three hosts ip or name on ES configuration. But I'd like to enable auto scale on the ES cluster which means I don't know all the node name when I deploy. I'd like to make ES cluster pick up new node automatically. How can I achieve that?

Related

Kubernetes - Apply pod affinity rule to live deployment

I am guess I am just asking for confirmation really. As had some major issues in the past with our elastic search cluster on kubernetes.
Is it fine to add a pod affinity to rule to a already running deployment. This is a live production elastic search cluster and I want to pin the elastic search pods to specific nodes with large storage.
I kind of understand kubernetes but not really elastic search so dont want to cause any production issues/outages as there is no one around that could really help to fix it.
Currently running 6 replicas but want to reduce to 3 that run on 3 worker nodes with plenty of storage.
I have labelled my 3 worker nodes with the label 'priority-elastic-node=true'
This is podaffinity i will add to my yaml file and apply:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: priority-elastic-node
operator: In
values:
- "true"
topologyKey: "kubernetes.io/hostname"
What I assume will happen is nothing after I apply but then when I start scaling down the elastic node replicas the elastic nodes stay on the preferred worker nodes.
Any change to the pod template will cause the deployment to roll all pods. That includes a change to those fields. So it’s fine to change, but your cluster will be restarted. This should be fine as long as your replication settings are cromulent.

How to change cluster IP in a replication controller run time

I am using Kubernetes 1.0.3 where a master and 5 minion nodes deployed.
I have an Elasricsearch application that is deployed on 3 nodes using a replication controller and service is defined.
Now i have added a new minion node to the cluster and wanted to run the container elasticsearch on the new node.
I am scaling my replication controller to 4 so that based on the node label the elasticsearch container is deployed on new node.Below is my issue and please let me k ow if there is any solution ?
The cluster IP defined in the RC is wrong as it is not the same in service.yaml file.Now when I scale the RC new node is installed with the ES container pointing to the wrong Cluster IP due to which the new node is not joining the ES cluster.Is there any way that I can modify the cluster IP of deployed RC so that when I scale the RC the image is deployed on new node with the correct cluster IP ?
Since I am using old version I don't see kubectl edit command and I tried changing using kubectl patch command but the IP didn't change.
The problem is that I need to do this on a production cluster so I can't delete the existing pods but only option is to change the cluster IP of deployed RC and then scale so that it will take the new IP and image is started accordingly.
Please let me know if any way I can do this ?
Kubernetes creates that (virtual) ClusterIP for every service.
Whatever you defined in your service definition (which you should have posted along with your question) is being ignored by Kubernetes, if I recall correctly.
I don't quite understand the issue with scaling, but basically, you want to point at the service name (resolved by Kubernetes's internal DNS) rather than the ClusterIP.
E.g., http://myelasticsearchservice instead of http://1.2.3.4

Elasticsearch in production with kubernetes

I am working on product in which we are using elasticsearch for search. Our production setup is in K8S (1.7.7) and we are able to scale it pretty well. Only thing I am not sure about is whether we should be hosting elasticsearch in k8s (it can go on dedicated host as well using label selector nodes) or it is advisable to host elasticsearch on VM than docker.
Our data set size is 2-3 GB and would go further. But this is the benchmark we can consider.
And elasticsearch cluster I am planning to have ti is - 3 master (with 2 as eligible master), one client node, and one data node. We can scale datanode and client node as data increases.
Is anyone did this before? thanks in advance.
IMO the best resource for Elasticsearch on Kubernetes is https://github.com/pires/kubernetes-elasticsearch-cluster
Note that while there are official Docker containers, no official solution for orchestration is being provided at the moment. This is currently covered by the community only.
3 master (with 2 as eligible master)
This doesn't make much sense. You'll want 3 master eligible nodes with the setting discovery.zen.minimum_master_nodes: 2 and one of the 3 nodes will be the actual master.

how to sync up two ElasticSearch cluster

I need to setup a replicated ES clusterII in data centerII, the ES clusterII just need to sync up with ES clusterI which in data centerI. So far my idea is that store snapshot in custerII and restore the snapshot in order to sync up clusterI. But this way kind of having some delay. Is there any better way please.
The ability to cluster is a concept baked into ElasticSearch. However it was not designed to be scaled across datacenters because this involves network latency, but it can do it.
The idea behind ElasticSearch is to have a highly-available cluster that replicates shards within itself (i.e. a replica level of 2 in a cluster means that you have 2 copies of the data across your cluster). This means one cluster alone is its own backup.
First, if you don't have it configured as a cluster, do so by adding the following to your /etc/elasticsearch/elasticsearch.yml (or wherever you put your config):
/etc/elasticsearch/elasticsearch.yml:
cluster.name: thisismycluster
node.name: ${HOSTNAME}
Alternatively, you can make node.name whatever you want, but it's best to put in your hostname.
You also want to make sure you have the ElasticSearch service bound to a particular address and/or interface, where the interface is probably your best bet because you need a point-to-point link across those datacenters:
/etc/elasticsearch/elasticsearch.yml:
network.host: [_tun1_]
You will need to make sure you set a list of discovery hosts, which means that on every host in the cluster, if their cluster.name parameter name matches, they will be discovered and assigned to that cluster. ElasticSearch takes care of the rest, it's magical!
You may add the host by name (only if defined in your /etc/hosts or DNS across your datacenters can resolve it) or IP:
/etc/elasticsearch/elasticsearch.yml:
discovery.zen.ping.unicast.hosts: ["ip1", "ip2", "..."]
Save the config and restart ElasticSearch:
sudo systemctl restart elasticsearch
OR
sudo service elasticsearch restart
If you aren't using systemd (depending on your OS), I would highly suggest using it.
I will tell you though that doing snapshots with ElasticSearch is a terrible idea, and to avoid it at all costs because ElasticSearch built the mentality of high-availability into the application already - this is why this application is so powerful and is being heavily adopted by the community and companies alike.

Logstash cluster output to Elasticseach cluster without multicast

I want to run logstash -> elasticsearch with high availability and cannot find an easy way to achieve it. Please review how I see it and correct me:
Goal:
5 machines each running elasticsearch united into a single cluster.
5 machines each running logstash server and streaming data into elasticsearch cluster.
N machines under monitoring each running lumberjack and streaming data into logstash servers.
Constraint:
It is supposed to be run on PaaS (CoreOS/Docker) so multi-casting
discovery does not work.
Solution:
Lumberjack allows to specify a list of logstash servers to forward data to. Lumberjack will randomly select the target server and switch to another one if this server goes down. It works.
I can use zookeeper discovery plugin to construct elasticsearch cluster. It works.
With multi-casting each logstash server discovers and joins the elasticsearch cluster. Without multicasting it allows me to specify a single elasticsearch host. But it is not high availability. I want to output to the cluster, not a single host that can go down.
Question:
Is it realistic to add a zookeeper discovery plugin to logstash's embedded elasticsearch? How?
Is there an easier (natural) solution for this problem?
Thanks!
You could potentially run a separate (non-embedded) Elasticsearch instance within the Logstash container, but configure Elasticsearch not to store data, maybe set these as the master nodes.
node.data: false
node.master: true
You could then add your Zookeeper plugin to all Elasticsearch instances so they form the cluster.
Logstash then logs over http to the local Elasticsearch, who works out where in the 5 data storing nodes to actually index the data.
Alternatively this Q explains how to get plugins working with the embedded version of Elasticsearch Logstash output to Elasticsearch on AWS EC2

Resources