Best practices for data storage with Elasticsearch and Kubernetes - elasticsearch

After reading some documentation regarding Persistent Volumes in Kubernetes I am wondering which one would be the best setup (storage speaking) for running a highly available ElasticSearch cluster. I am not running the typical EFK (or ELK) setup, but I am using ElasticSearch as a proper full-text search engine.
I've read the official Elastic Documentation, but I find it quite lacking of clarification. According to "Kubernetes in Action", Chapter 6:
When an application running in a pod needs to persist data to disk and
have that same data available even when the pod is rescheduled to
another node, you can’t use any of the volume types we’ve mentioned so
far. Because this data needs to be accessible from any cluster node,
it must be stored on some type of network-attached storage (NAS).
So if I am not mistaken, I need a Volume and access it through PersistentVolumes and PersistentVolumeClaim with Retain policies.
When looking at Official Volumes, I get a feeling that one should define the Volume type him/herself. Though, when looking at a DigitalOcean guide, it does not seem there was any Volume setup there.
I picked that tutorial, but there are dozens on Medium that are all doing the same thing.
So: which one is the best setup for an ElasticSearch cluster? Of course keeping in mind order to not loose any data within an index, and being able to add pods(Kubernetes) or nodes (ElasticSearch) that can access the index.

A good pattern to deploy an ElasticSearch cluster in kubernetes is to define a StatefulSets.
Because the StatefulSet replicates more than one Pod you cannot simply reference a persistent volume claim. Instead, you need to add a persistent volume claim template to the StatefulSet state definition.
In order for these replicated peristent volumes to work, you need to create a Dynamic Volume Provisioning and StorageClass which allows storage volumes to be created on-demand.
In the DigitalOcean guide tutorial, the persistent volume claim template is as follow:
volumeClaimTemplates:
- metadata:
name: data
labels:
app: elasticsearch
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: do-block-storage
resources:
requests:
storage: 100Gi
Here, the StorageClass is do-block-storage. You can replace it with your own storage class

Very interesting question,
You need to think of an Elasticsearch node in Kubernetes that would be equivalent to an Elasticsearch Pod.
And Kubernetes need to hold the identity of each pod to attach to the correct Persistent Volume claim in case of an outage, here comes the StatefulSet
A StatefulSet will ensure the same PersistentVolumeClaim stays bound to the same Pod throughout its lifetime.
A PersistentVolume (PV) is a Kubernetes abstraction for storage on the provided hardware. This can be AWS EBS, DigitalOcean Volumes, etc.
I'd recommend having a look into the Elasticsearch Offical Helm chart: https://github.com/elastic/helm-charts/tree/master/elasticsearch
Also Elasticsearch Operator: https://operatorhub.io/operator/elastic-cloud-eck

Related

Kubernetes - Apply pod affinity rule to live deployment

I am guess I am just asking for confirmation really. As had some major issues in the past with our elastic search cluster on kubernetes.
Is it fine to add a pod affinity to rule to a already running deployment. This is a live production elastic search cluster and I want to pin the elastic search pods to specific nodes with large storage.
I kind of understand kubernetes but not really elastic search so dont want to cause any production issues/outages as there is no one around that could really help to fix it.
Currently running 6 replicas but want to reduce to 3 that run on 3 worker nodes with plenty of storage.
I have labelled my 3 worker nodes with the label 'priority-elastic-node=true'
This is podaffinity i will add to my yaml file and apply:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: priority-elastic-node
operator: In
values:
- "true"
topologyKey: "kubernetes.io/hostname"
What I assume will happen is nothing after I apply but then when I start scaling down the elastic node replicas the elastic nodes stay on the preferred worker nodes.
Any change to the pod template will cause the deployment to roll all pods. That includes a change to those fields. So it’s fine to change, but your cluster will be restarted. This should be fine as long as your replication settings are cromulent.

Elastic search cluster on Kubernetes Cluster vs VM

I want to setup elastic stack (elastic search, logstash, beats and kibana) for monitoring my kubernetes cluster which is running on on-prem bare metals. I need some recommendations on the following 2 approaches, like which one would be more robust,fault-tolerant and of production grade. Let's say I have a K8 cluster named as K8-abc.
Approach 1- Will be it be good to setup the elastic stack outside the kubernetes cluster?
In this approach, all the logs from pods running in kube-system namespace and user-defined namespaces would be fetched by beats(running on K8-abc) and put into into the ES Cluster which is configured on Linux Bare Metals via Logstash (which is also running on VMs). And for fetching the kubernetes node logs, the beats running on respective VMs (which are participating in forming the K8-abc) would fetch the logs and put it into the ES Cluster which is configured on VMs. The thing to note here is the VMs used for forming the ES Cluster are not the part of the K8-abc.
Approach 2- Will be it be good to setup the elastic stack on the kubernetes cluster k8-abc itself?
In this approach, all the logs from pods running in kube-system namespace and user-defined namespaces would be send to Elastic search cluster configured on the K8-abc via logstash and beats (both running on K8-abc). For fetching the K8-abc node logs, the beats running on VMs (which are participating in forming the K8-abc) would put the logs into ES running on K8-abc via logstash which is running on k8-abc.
Can some one help me in evaluating the pros and cons of the before mentioned two approaches? It will be helpful even if the relevant links to blogs and case studies is provided.
I would be more inclined to the second solution. It has many advantages over the first one however it may seem more complex as it comes to the initial setup. You can actually ask similar question when it comes to migrate any other type of workload to Kubernetes. It has many advantages over VM. To name just a few:
self-healing cluster,
service discovery and integrated load balancing,
Such solution is much easier to scale (HPA) in comparison with VMs,
Storage orchestration. Kubernetes allows you to automatically mount a storage system of your choice, such as local storage, public cloud providers, and many more including Dynamic Volume Provisioning mechanism.
All the above points could be easily applied to any other workload and may bee seen as Kubernetes advantages in general so let's look why to use it for implementing Elastic Stack:
It looks like Elastic is actively promoting use of Kubernetes on their website. See also this article.
They also provide an official elasticsearch helm chart so it is already quite well supported by Elastic.
Probably there are many other reasons in favour of Kubernetes solution I didn't mention here. Here you can find a hands-on article about setting up Highly Available and Scalable Elasticsearch on Kubernetes.

How to attach storage volume with elasticsearch nodes in kubernetes?

I am doing setup of Elasticseach on Kubernetes. I have created the cluster of Elasticsearch of 2 nodes. I want to attach storage with both of these nodes. like 80Gi with the first node and 100Gi with the second node.
My Kubernetes cluster is on EC2 and I am using EBS as storage.
In order to attach persistence, you need:
A StorageClass Object (Define the Storage)
A PersistentVolume Object (Provision the Storage)
A PersistentVolumeClaim Object (Attach the storage)
With each Node in ElasticSearch that you can attached with the pods in deployment\pod object definition.
An easier way is deploying ES cluster using Helm Chart.
As per helm chart documentation:
Automated testing of this chart is currently only run against GKE (Google Kubernetes Engine). If you are using a different Kubernetes provider you will likely need to adjust the storageClassName in the volumeClaimTemplate
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: elast
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
fsType: ext4
Hope this help.

How Do I Make A Persistent Volume Accessible to Multiple Kubernetes Pods?

I've done quite a bit of research and have yet to find an answer to this. Here's what I'm trying to accomplish:
I have an ELK stack container running in a pod on a k8s cluster in GCE - the cluster also contains a PersistentVolume (format: ext4) and a PersistentVolumeClaim.
In order to scale the ELK stack to multiple pods/nodes and keep persistent data in ElasticSearch, I either need to have all pods write to the same PV (using the node/index structure of the ES file system), or have some volume logic to scale up/create these PVs/PVCs.
Currently what happens is if I spin up a second pod on the replication controller, it can't mount the PV.
So I'm wondering if I'm going about this the wrong way, and what is the best way to architect this solution to allow for persistent data in ES when my cluster/nodes autoscale.
Persistent Volumes have access semantics. on GCE I'm assuming you are using a Persistent Disk, which can either be mounted as writable to a single pod or to multiple pods as read-only. If you want multi writer semantics, you need to setup Nfs or some other storage that let's you write from multiple pods.
In case you are interested in running NFS - https://github.com/kubernetes/kubernetes/blob/release-1.2/examples/nfs/README.md
FYI: We are still working on supporting auto-provisioning of PVs as you scale your deployment. As of now it is a manual process.

Does elasticsearch need a persistent storage when deployed on kubernetes?

In the Kubernetes example of Elasticsearch production deployment, there is a warning about using emptyDir, and advises to "be adapted according to your storage needs", which is linked to the documentation of persistent storage on Kubernetes.
Is it better to use a persistent storage, which is an external storage for the node, and so needs (high) I/O over network, or can we deploy a reliable Elasticsearch using multiple data nodes with local emptyDir storage?
Context: We're deploying our Kubernetes on commodity hardware, and we prefer not to use SAN for the storage layer (because it doesn't seem like commodity).
The warning is so that folks don't assume that using emptyDir provides a persistent storage layer. An emptyDir volume will persist as long as the pod is running on the same host. But if the host is replaced or it's disk becomes corrupted, then all data would be lost. Using network mounted storage is one way to work around both of these failure modes. If you want to use replicated storage instead, that works as well.

Resources