Which algorithim does Kubernetes use to assign pods to nodes? - algorithm

This is more of cost estimation question than how to use features like node affinity.
So basically there are m pods with some constraints like:
each pod of specific Deployments / StatefulSets should be on a different kubernetes node
pods of specific Deployments / StatefulSets should be balanced over 3 availability zones
Now, I want to find how many nodes (all same types) I will need to host given set of Deployments / StatefulSets.
I first thought this of more like an assignment problem to be solved using Hungarian Algorithim but this seems much more complex in term like multi dimensional constraints.

By my knowledge the algorithm used by default by the kube-scheduler is described on github here.
It explains how it works. It first filters nodes that do not meet the requirements of the pods, e.g. resource requests > available resources on nodes, affinity etc.
Then uses a ranking algorithm to determine the best fitting node. For deeper insights on the

Kubernetes assigns the pods based on the many constraints like
Resource requirement
Resource existence (Node capacity)
Node selectors if any or Affinity rules
Weight of Affinity rules
This is good article for same : https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
Also : https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
i would suggest to read the : https://kubernetes.io/docs/concepts/scheduling-eviction/

In reference to the very good answer of user Harsh Manvar, I will add a few more information from myself. This topic was covered in the documentation, as described by my predecessor. Besides her, you can find very good materials here:
What Happens When You Create a Pod On a Kubernetes Cluster?
In a matter of seconds, the Pod is up and running on one of the cluster nodes. However, a lot has happened within those seconds. Let’s see:
While scanning the API server (which it is continuously doing), the Kubernetes Scheduler detects that there is a new Pod without a nodeName parameter. The nodeName is what shows which node should be owning this Pod.
The Scheduler selects a suitable node for this Pod and updates the Pod definition with the node name (though the nodeName parameter).
The kubelet on the chosen node is notified that there is a pod that is pending execution.
The kubelet executes the Pod, and the latter starts running on the node.
You can also find a tutorial about Scheduling Process and Scheduler Algorithms.

Related

Elasticsearch cluster setup

I'm curerntly running a single node ES-Instance. As there are some limitations with a single server setup in ES, and the queries are becoming pretty slow sometimes, I want to upgrade to a full cluster.
The ES-Instance currently only stores data, and is not doing any fancy stuff (Transformations, Ingest Pipelines, ...). All I currently need is a place to store my data at, and to retrieve it (Search + Aggregations). There are more reads than writes.
In a lot of forums and blog posts I read about the "Split-Brain" issue. To circumvent this, the minimum node count should be 3.
The idea is to keep the amount of machines low, because this is a private project and I do not want to also manage a lot of OS in my spare time..
The structure I thought about was:
- 1 Coordinator + Voting-only Node
- 2 Master-eligible + Data Nodes
minimum_master_nodes: 2 to circumvent Split-Brains
Send all ES-Queries to the Coordinator, which will then issue the requests on the data nodes and reduce the final results.
My question is: Does this make sense? Or is it better to use 3 master-eligible + Data nodes?
Online I found no guidance for ES-Newbies to get an idea of the structure of a simple cluster.
You are in right direction and I can see most of your thinking is also right so don't consider yourself as ES newbie :).
Anyway as you are going to have 3 nodes in your cluster, why note make all three nodes as master eligible nodes and why you are making a dedicated co-ordinating node when by default every ES node works as a co-ordinating node and in your small project you won't need a dedicated co-ordinating node. this way you will have a simple configuration, just don't assign any explicit role to any node as by default all ES nodes are master, data and co-ordinating node.
Also, you should invest some time to identify the slow logs and its cause to make it more performant rather than adding more resources that too in personal project, please refer to my short tips on improving the search performance

Are there any downsides to running Elasticsearch on a multi-purpose (i.e. non-dedicated) cluster?

I just set up an Elasticsearch (ES) 3 node cluster using one of GKE's click to deploy configurations. Each node is of n1-standard-4 machine type (4vCPUs/15GB RAM). I have always run ES on clusters dedicated to that single purpose (performance reasons, separation of concerns, make my life easier to debug machine faults), and currently, this GKE cluster is the same.
However, i have a group of batch jobs i would like to port to run on a GKE cluster. Since it updates several large files, I would like this to also run on a stateful cluster (just like ES) so I can move updated files to the cloud once a day rather than round tripping on every run. The batch jobs in question run at 5min, 15min or daily frequency for about 18hrs every day.
My question now is, what is the best way to deploy this batch process given the existing ES cluster...
Create an entirely new cluster?
Create another node pool?
Create a separate namespace and increase the cluster's autoscaling?
Some other approach i'm missing?
Note: I'm a few days into using GKE and containerization in general
Based on my knowledge I would go for another nodepool or autoscaler.
Create an entirely new cluster?
For me it would be an overkill for just running the jobs.
Create another node pool?
I would say it's the best option equally with the autoscaler, create a new nodepool just for the jobs which would scale down to 0 if there is nothing more to do.
Create a separate namespace and increase the cluster's autoscaling?
Same as another node pool, but from my point of view if you would like to do that, then you would have to label your nodes to the Elasticsearch, then jobs can't take any resources from them, so answering your question from comment
my question is more about if doing this with autoscaler within the same cluster would in any way affect elasticsearch esp with all the ES specific yaml configs?
It shouldn't, as I said above, you can always label the 3 specific nodes(default nodepool) to work only with elasticsearch then nothing will take their resources, cluster will rescale when it will need more resources for jobs and rescale to 3 ES nodes when jobs end their 18hrs work.
Also with regards to the 6h node pool doing nothing comment, wouldn't I be able to avoid this on a new cluster or node pool with a minimum scaling parameter of zero?
Based on gcp documentation it would work for nodepool, but not for new cluster.
If you specify a minimum of zero nodes, an idle node pool can scale down completely. However, at least one node must always be available in the cluster to run system Pods.
tldr Go for the autoscaler or another nodepool, if you're worried about resources for your ES label the 3 nodes just for ES.
I hope it answer your question. Let me know if you have any more questions.

Is the etcd clustering design suitable for reliable key/value replication applications?

I know this isn't a pure programming question, but I'm developing a distributed appliance management application that has a need for reliably sharing/replicating data across multiple servers/VMs, without the heavy orchestration of a typical database.
Is etcd clustering actually a reliable architecture for key/value replication in a cluster?
This is all supposed to be automated and since I'm starting with Atomic/Core OS, and etcd is already present, I stood up a 4 node cluster with full TLS and some user/role configuration to play with. I went as far as developing a library to read/write/export the data structures my application requires.
The etcd v3 documentation isn't as clear on some of the more revealing points that the v2 documentation was.
I was hoping for peer-to-peer replication of data, assuming connectivity between at least 1 other node in the cluster. If a node was unable to reach the rest of the cluster, its store would be available locally and I could codify how I want my appliance to deal with being "offline" for too long, until it re-establishes it availability.
I'm going to describe some of the limitations I've encountered with clustered etcd, and maybe someone can comment whether its the right solution for my job. This isn't a compliant session, I just want to understand if I'm using the wrong solution.
Availability/fail-over is impossible when only 2 nodes are in a cluster: I started with a 2 node configuration and setup a working cluster (eg put/get data from one to the other, replication works, etc...) When one goes down, the other enters endless leader election mode. It's the only one, yet raft needs at least 1 other node available to elect a leader. So fail-over needs at least 3 etcd nodes, so if 1 goes down, the other 2 can pick a leader. I'm really floored by this limitation. I must be doing something wrong.
Joining a new etcd node to a cluster requires almost all details about the existing cluster: When joining a new node, the member needs to be added in advance (with its peering connection details), then on the new node, you have to start etcd with cluster-state "existing" and initial-cluster to contain exactly all nodes in the cluster with their names and peer urls. This is a pretty synchronized procedure because after you join, everything is stored in the data directory. So subsequent starts don't need any cluster configuration at all. I find this makes adding a node kind of fragile, and kind of unnecessary as the cluster already knows about you (you had to add them as a member).
During the leader election process you can't access any node's keys/values: When a node goes down, no get/put requests are serviced. If you only have 2 nodes in the cluster, the election process never ends (see #1), so your application will never get access to that data, resulting in a bit of a cascading failure.
If anyone is interested in testing etcd out for this kind of stuff, I included my scripting below used (I left out the TSL configuration bits) in my testing. I don't expect anyone to test this for me.
# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export TOKEN=$(date +%s | sha256sum | base64 | head -c 32 ; echo)
# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN
# Next come up with a name for the next node, etcd2 then etcd3, etc... Get their hostname/IP and add them as a future member
etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"
# 1st etcd is now running, with peering available and members
# added for next nodes
# copy "echo $TOKEN" for next steps where its needed
# On the 1st/next node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload
export CL_NAME=etcd2
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export ETCDCTL_API=3
export TOKEN=<TOKEN string from above>
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"
# NOTE the --initial-cluster ALWAYS MUST have all nodes in the
# cluster, with their names and peer urls, otherwise it won't join
# Here's an example for the 3 node
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://<IP of 2nd node>:2380,etcd3=http://<IP of 3rd node>:2830"
When you get down to 2 running nodes (kill one of them), a new leader will be elected right away. When you're down to 1, the entire effort it useless. If you bring 1 up again, it'll all work again.
Thank you for your insightful and well-asked question. I'm currently doing a similar investigation. I'm using etcd 3.4, where the API is v3 (not v3beta etc.). I have some partial answers for you at least (formed in part by my own understanding of distributed systems). I'll answer your questions 1 and 3 first, then 2 at the end.
2-node clusters are particularly troublesome for mutable state is concerned, largely because of the 'split-brain' problem, where there is no quorum and neither side (of a network split) has enough knowledge to know it is still authoritative; neither side can safely write any data because there is no useful (or at least generic) way to update the cluster when quorum returns.
2-node clusters in practice would be more tenable for active/passive deployments, such as hosting a floating IP. You can use the likes of pacemaker to implement HA solutions to coordinate resources such as a floating IP, DRBD block device, filesystem and MariaDB stack (I've done so)... BUT this still comes with the proviso that a 2-node cluster is theoretically (in the sence of sometimes that the risk of split-brain can be accepted).
To make 2-node 'clusters' more robust (to mitigate the risk of split-brain) something else is needed. Here's a few examples:
pacemaker recommends the use of STONITH (Shoot The Other Node In The Head), which is an example of a 'fencing' mechanism. So if the network inbetween the two nodes goes away and both nodes now think they are the sole remaining node, then the idea is that each node will try and power-off the other (first one wins). This might be interacting with the a power-controller in a cluster, or with a hypervisor. As you can see, this requires a certain level of priviledge with the supporting infrastructure.
network appliances do often operate as a 2-node cluster, and generally active/passive (its hard enough at times selling the idea of 2 identical pieces of equipment and only using one at a time). In a physical networking deployment, the HA link may be required to be a point-to-point link such as a single fibre pair over which heartbeats are sent. In routing protocols such as BGP this is even a requirement; the idea being that if the remote side powers off the link is seen to go down... lots of subtleties here, such as what happens if one fibre in a fibre-pair were to be broken/degraded.
I've also seen HA solutions that use some other data in deciding which side of a two-node cluster gets to be the leader. Things like (and this can be a combination and in a particular order) of:
comparison of IP address (which is numerically greater or lesser)
uptime (last observed uptime)
appliance serial number
statically configured weighting
All this concern over quorum and split-brain etc. is related to CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem):
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
In the event of a network partition, you have to choose between Consistency and Availability. Etcd has made the choice of Consistency; which is good, because etcd is a data-store for critical infrastructure. After an event, you really want things to come back up ASAP and without manual and error-prone recovery (ie. just guessing which side of the cluster we should mark as a the authoritative side).
Now, you might be thinking that etcd could at least revert to being able satisfy a read-attempt in the result of a partition; but then that would violate Consistency by risking the delivery of out-of-date information. Etcd also provides some other cluster fundamental services, such as making it easier to do leader elections and locks for your applications; so if etcd were to offer a potentially stale response, that would likely break promises and off-load a lot of the correctness responsibilities onto the clients (so you can bet that would tend not to be done correctly, particularly on anything that might consume its data via some REST/gRPC interface).
This is no different for systems such as Zookeeper, who make a requirement for a three (or five) node cluster much more overt; along with cautions about performance concerns that might cause heartbeat failures (eg. due to garbage collection pauses).
You mention you're on CoreOS, so your application is presumably running in a container. So in terms of HA strategies, you do have another option that people often haven't really considered.... just restart it automatically. At least in Kubernetes (or Openshift, which embodies Kubernetes --- and CoreOS now), if you have HA built into your cluster, then if a container or node were to fail, it can be rescheduled onto a different node.
So if you don't do a three-node etcd service, do a single-node etcd service that can be restarted on failure. That's a viable HA strategy; its not a load-balancing strategy though.
There is also the concept of mirroring, which is supported by etcd and etcdctl has build-in functionality for this. You could use the mirror-maker functionality to create multiple independent HA clusters (say, for a remote site or availability zone). You should generally never operate a cluster over a WAN; instead you should replicate a cluster.
You'll find this same sort of pattern for etcd, elasticsearch, kafka, etc.
With regard to your second question, etcd3 does have a cluster bootstrap facility (https://etcd.io/docs/v3.4.0/dev-internal/discovery_protocol/). In short, you can point your node at another information store and it will get all the same information you would otherwise have to have specified.
To understand the thinking around why etcd3 doesn't work the same as say kafka or elasticsearch cluster membership, you have to think of etcd as providing an infrastructure service -- the base of the pyrimid. You don't want to many dependencies (or assumptions about what traffic might be able to pass or what discovery features might be available).
Etcd could in theory use tools such as cloud-init to obtain things like AWS inventory, but that flexibility is devolved to the user. You should perhaps expect that to be embodied in a script or deployment playbook; its too entangled to do that within etcd and that's a sign of violating separation of concerns, and the expected deployment mechanism should already have all that inventory knowledge.
Kafka and Elasticsearch is kinda interesting to compare here. Kafka gets its cluster membership information from Zookeeper (which is what etcd was designed to replace, and solves much the same sort of problem).
Elasticsearch is interesting because it makes clustering remarkably easy. But that's because it uses sharding and replication, and ends up being quite different in terms of CAP theorem. That's not to say that it sits in the same space as etcd or zookeeper though. In my experience with Elasticsearch I always manually specify all the nodes in my playbook, but Elasticsearch does have more support for discovering its other nodes in environments such as AWS. (cf. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-hosts-providers.html)
Hope that's useful for you,
Cameron

Elasticsearch one big cluster VS tribe node?

Problem descriptions:
- Multiple machines producing logs.
- On each machine we have logstash which filters the log files and sends them to a local elasticsearch
- We would like to keep the machines as separate as possible and avoid intercommunication
- But we would also like to be able to visualize all of these logs with a single Kibana instance
Approaches:
Make each machine a single node ES cluster, and have one of the machines as a tribe node with Kibana installed on this machine (of course with avoiding indices conflict)
Make all machines (nodes) part of a single cluster with each node writing to unique index of one shard and statically map each shard to its node, and finally of course having one instance of kibana for the cluster
Question:
Which approach is more appropriate for the described scenario in terms of: limiting inter machine communications, cluster management, and maybe other aspects that I haven't think about ?
Tribe node is there because of this requirements. So my advice to use the Tribe node setup.
With the second option;
There will be a cluster but you will not use its benefits (replica shards, shard relocation, query performance, etc)
Benefits mentioned above will be pain points that will generate configuration complexity and troubleshooting hell.
Besides the shard allocation and node communication there will be other things to configure that nodes will have when they are in a cluster.

What cluster node should be active?

There is some cluster and there is some unix network daemon. This daemon is started on each cluster node, but only one can be active.
When active daemon breaks (whether program breaks of node breaks), other node should become active.
I could think of few possible algorithms, but I think there is some already done research on this and some ready-to-go algorithms? Am I right? Can you point me to the answer?
Thanks.
Jgroups is a Java network stack which includes DistributedLockManager type of support and cluster voting capabilities. These allow any number of unix daemons to agree on who should be active. All of the nodes could be trying to obtain a lock (for example) and only one will succeed until the application or the node fails.
Jgroups also have the concept of the coordinator of a specific communication channel. Only one node can be coordinator at one time and when a node fails, another node becomes coordinator. It is simple to test to see if you are the coordinator in which case you would be active.
See: http://www.jgroups.org/javadoc/org/jgroups/blocks/DistributedLockManager.html
If you are going to implement this yourself there is a bunch of stuff to keep in mind:
Each node needs to have a consistent view of the cluster.
All nodes will need to inform all of the rest of the nodes that they are online -- maybe with multicast.
Nodes that go offline (because of ap or node failure) will need to be removed from all other nodes' "view".
You can then have the node with the lowest IP or something be the active node.
If this isn't appropriate then you will need to have some sort of voting exchange so the nodes can agree who is active. Something like: http://en.wikipedia.org/wiki/Two-phase_commit_protocol

Resources