Does the kubernetes scheduler support anti-affinity? - elasticsearch

I'm looking at deploying Kubernetes on top of a CoreOS cluster, but I think I've run into a deal breaker of sorts.
If I'm using just CoreOS and fleet, I can specify within the unit files that I want certain services to not run on the same physical machine as other services (anti-affinity). This is sort of essential for high availability. But it doesn't look like kubernetes has this functionality yet.
In my specific use-case, I'm going to need to run a few clusters of elasticsearch machines that need to always be available. If, for any reason, kubernetes decides to schedule all of my elasticsearch node containers for a given ES cluster on a single machine, (or even the majority on a single machine), and that machine dies, then my elasticsearch cluster will die with it. That can't be allowed to happen.
It seems like there could be work-arounds. I could set up the resource requirements and machine specs such that only one elasticsearch instance could fit on each machine. Or I could probably use labels in some way to specify that certain elasticsearch containers should go on certain machines. I could also just provision way more machines than necessary, and way more ES nodes than necessary, and assume kubernetes will spread them out enough to be reasonably certain of high availability.
But all of that seems awkward. It's much more elegant from a resource-management standpoint to just specify required hardware and anti-affinity, and let the scheduler optimize from there.
So does Kubernetes support anti-affinity in some way I couldn't find? Or does anyone know if it will any time soon?
Or should I be thinking about this another way? Do I have to write my own scheduler?

Looks like there are a few ways that kubernetes decides how to spread containers, and these are in active development.
Firstly, of course there have to be the necessary resources on any machine for the scheduler to consider bringing up a pod there.
After that, kubernetes spreads pods by replication controller, attempting to keep the different instances created by a given replication controller on different nodes.
It seems like there was recently implemented a method of scheduling that considers services and various other parameters. https://github.com/GoogleCloudPlatform/kubernetes/pull/2906 Though I'm not completely clear on exactly how to use it. Perhaps in coordination with this scheduler config? https://github.com/GoogleCloudPlatform/kubernetes/pull/4674
Probably the most interesting issue to me is that none of these scheduling priorities are considered during scale-down, only scale-up. https://github.com/GoogleCloudPlatform/kubernetes/issues/4301 That's a bit of big deal, it seems like over time you could weird distributions of pods because they stay whereever they are originally placed.
Overall, I think the answer to my question at the moment is that this is an area of kubernetes that is in flux (as to be expected with pre-v1). However, it looks like much of what I need will be done automatically with sufficient nodes, and proper use of replication controllers and services.

Anti-Affinity is when you don’t want certain pods to run on certain nodes. For the present scenario I think TAINTS AND TOLERATIONS can come handy. You can mark nodes with a taint, then ,only those pods which explicitly have tolerance for that taint are going to be allowed to run on that specific node.
Below I am describing how can anti-affinity concept be implemented:
Taint any node you want.
kubectl taint nodes gke-cluster1-default-pool-4db7fabf-726z env=dev:NoSchedule
here, env=dev is the key:value pair or rather the label for this node ,
NoSchedule is the effect which describes not to schedule any pod on this node unless with the specific toleration.
Create a deployment
kubectl run newdep1 --image=nginx
Add the same label as the label to the node to this deployment so that all pods of this deployment are hosted on this node and this node will not host any other pod without the matching label.
kubectl label deployments/newdep1 env=dev
kubectl scale deployments/newdep1 --replicas=5
kubectl get pods -o wide
running this you will see all pods of newdep1 will run on your tainted node.

Related

How to share large files between two microservices in Mesos?

I have a mesos cluster and I need to run two types of microservices, one is producing very large files (might be more than 2GB for file) the other one is analyzing those files. The analyzing microservice is taking more time than the producer service.
After the analysis service is done - the file can be deleted.
I thought of two options:
NFS - producer service creates all files on NFS and the analysis service is taking it directly from the shared folder. (I'm concerned that this approach will consume all internal bandwidth in my cluster)
Local Disk (my preferred) - in this case I need to somehow enforce the analysis micoroservice to run on the same Mesos slave as the producer service that created this specific file. (I'm not sure this approach is possible)
What would be best practice in this case?
I guess this can be implemented in different ways, depending on your requirements:
If you want to be able to handle a host (agent) failure, I think there is no other way than using a shared filesystem such as NFS. Otherwise, if you use Marathon to schedule your Microservices, the task will be restarted on another agent (where the data isn't locally available). Also, you would then need to make sure that the same mount points are available on each agent, and use these as host volumes in your containers. Unfortunately, the POD feature availability for co-locating tasks starts to be available in Mesos 1.1.0 and Marathon 1.4 (not yet finally released), as a side note...
If you don't care about host (agent) failures, then you possible could co-locate the two Microservices on the same agent if you use hostname constraints in Marathon, and mount the host volumes which then can be shared across the services. I guess you'd need some orchestration to only start the analysis service only after the producing service has finished.

Clearing out 'finished' service history in DC/OS

I am currently setting up a POC network of dockerised services using DC/OS. As this is a POC, I have had to make many revisions to the containers in order to get things working.
Consequently, when drilling into some services using the DC/OS (v1.8.7) web UI, I can see hundreds of old tasks - the vast majority of which have a status of 'finished'.
I realise that I can filter out and just see the 'active' containers by clicking on the appropriate tab, but this is not what I am after because I would like to be able to see when a container is staging, and also - all the history for the finished tasks is worthless to me.
How do I purge DC/OS of these finished tasks, as they are clogging up the UI?
Is there a CLI command for this, or do I have to clear out stuff on the master nodes... or is there an handy plug-in that will manage this for me? I've been looking around both on the web and the master nodes, but can't work out what I need to do.
In DC/OS 1.8.x there is no UI or CLI method to influence garbage collection. You can however, with a custom install, influence some parameter, such as gc_delay (default value: 2 days in DC/OS) and others are using the Mesos defaults, like gc_disk_headroom (which is unchanged set to 0.1, which means, Mesos targets to have 10% of the assigned disk as free space).
For parameters you can change at install time see the Install Configuration Parameters docs for more details.

File sync between n web servers in cluster

There are n nodes in a web cluster. Files may be uploaded to any node and then must be distributed to every other node. This distribution does not have to happen in a transaction (in fact it must not, distributed transactions don't scale) and some latency is acceptable, although must be minimal. Conflicts can be resolved arbitrarily (typically last write wins) provided that the resolution is also distributed to all nodes so that eventually all nodes have the same set of files. Nodes can be added and removed dynamically without having to reconfigure existing nodes. There must be no single point of failure and no additional boxes required to solve this (such as RabbitMQ)
I am thinking along the lines of using consul.io for dynamic configuration so that each node can refer to consul to determine what other nodes are available and writing a daemon (Golang) that monitors the relevant folders and communicates with other nodes using ZeroMQ.
Feels like I would be re-inventing the wheel though. This is a common problem and I expect there are solutions available already that I don't know about? Or perhaps my approach is wrong and there is another way to solve this?
Yes, there has been some stuff going on with distributed synchronization lately:
You could use syncthing (open source) or BitTorrent Sync.
Syncthing is node-based, i.e. you add nodes to a cluster and choose which folders to synchronize.
BTSync is folder-based, i.e. you obtain a "secret" for a folder and can synchronize with everyone in the swarm for that folder.
From my experience, BTSync has a better discovery and connectivity, but the whole synchronization process is closed source and nobody really knows what happens. Syncthing is written in go, but sometimes has trouble discovering peers.
Both syncthing and BTSync use LAN discovery via broadcast and a tracker for discovery, AFAIK.
EDIT: Or, if you're really cool, use IPFS to host the latest version, IPNS to "name" that and mount the IPNS on the servers. You can set the IPFS bootstrap list to some of your servers, which would even make you independent of external trackers. :)

How to select CPU parameter for Marathon apps ran on Mesos?

I've been playing with Mesos cluster for a little bit, and thinking of utilizing Mesos cluster in our production environment. One problem I can't seem to find an answer to: how to properly schedule long running apps that will have varying load?
Marathon has "CPUs" property, where you can set weight for CPU allocation to particular app. (I'm planning on running Docker containers) But from what I've read, it is only a weight, not a reservation, allocation, or limitation that I am setting for the app. It can still use 100% of CPU on the server, if it's the only thing that's running. The problem is that for long running apps, resource demands change over time. Web server, for example, is directly proportional to the traffic. Coupled to Mesos treating this setting as a "reservation," I am choosing between 2 evils: set it too low, and it may start too many processes on the same host and all of them will suffer, with host CPU going past 100%. Set it too high, and CPU will go idle, as reservation is made (or so Mesos think), but there is nothing that's using those resources.
How do you approach this problem? Am I missing something in how Mesos and Marathon handle resources?
I was thinking of an ideal way of doing this:
Specify weight for CPU for different apps (on the order of, say, 0.1 through 1), so that when going gets tough, higher priority gets more (as is right now)
Have Mesos slave report "Available LA" with its status (e.g. if 10 minute LA is 2, with 8 CPUs available, report 6 "Available LA")
Configure Marathon to require "Available LA" resource on the slave to schedule a task (e.g. don't start on particular host if Available LA is < 2)
When available LA goes to 0 (due to influx of traffic at the same time as some job was started on the same server before the influx) - have Marathon move jobs to another slave, one that has more "Available LA"
Is there a way to achieve any of this?
So far, I gather that I can possible write a custom isolator module that will run on slaves, and report this custom metric to the master. Then I can use it in resource negotiation. Is this true?
I wasn't able to find anything on Marathon rescheduling tasks on different nodes if one becomes overloaded. Any suggestions?
As of Mesos 0.23.0 oversubscription is supported. Unfortunately it is not yet implemented in Marathon: https://github.com/mesosphere/marathon/issues/2424
In order to dynamically do allocation, you can use the Mesos slave metrics along with the Marathon HTTP API to scale, for example, as I've done here, in a different context. My colleague Niklas did related work with nibbler, which might also be of help.

What keeps the cluster resource manager running?

I would like to use Apache Marathon to manage resources in a clustered product. Mesos and Marathon solves some of the "cluster resource manager" problems for additional components that need to be kept running with HA, failover, etc.
However, there are a number of services that need to be kept running to keep mesos and marathon running (like zookeeper, mesos itself, etc). What can we use to keep those services running with HA, failover, etc?
It seems like solving this across a cluster (managing how many instances of zookeeper, etc, and where they run and how they fail over) is exactly the problem that mesos/marathon are trying to solve.
As the Mesos HA doc explains, you can start multiple Mesos masters and let ZK elect the leader. Then if your leading master fails, you still have at least 2 left to handle things. It is common to use something like systemd to automatically restart the mesos-master on the same host if it's still healthy, or something like Amazon AutoScalingGroups to ensure you always have 3 master machines even if a host dies.
The same can be done for Marathon in its HA mode (on by default if you start multiple instances pointing to the same znode). Many users start these on the same 3 nodes as their Mesos masters, using systemd to restart failed Marathon services, and the same ASG to ensure there are 3 Mesos/Marathon master nodes.
These same 3 nodes are often configured to be the ZK quorum as well, so there are only 3 nodes you have to manage for all these services running outside of Mesos.
Conceivably, you could bootstrap both Mesos-master and Marathon into the cluster as Marathon/Mesos tasks. Spin up a single Mesos+Marathon master to get the cluster started, then create a Mesos-master app in Marathon to launch 2-3 masters as Mesos tasks, and a Marathon-master app in Marathon to launch a couple of HA Marathon instances (as Mesos tasks). Once those are healthy, you can kill the original standalone Mesos/Marathon master and the cluster would failover to the self-hosted Mesos and Marathon masters, which would be automatically restarted elsewhere on the cluster if they failed. Maybe this would work with ZK too. You'd probably need something like Mesos-DNS and/or ELB to let other services find Mesos/Marathon. I doubt anybody's running Mesos this way, but it's crazy enough it just might work!
In order to understand this, I suggest you spend a few minutes reading up on the architecture and the HA part in the official Mesos doc. There, it is clearly explained how HA/failover in Mesos core is handled (which is, BTW, nothing magic—many systems I know of use pretty much exactly this model, incl. HBase, Storm, Kafka, etc.).
Also, note that—naturally—the challenge keeping a handful of the Mesos masters/Zk alive is not directly comparable with keeping potentially 10000s of processes across a cluster alive, evict them or fail them over (in terms of fan out, memory footprint, throughput, etc.).

Resources