k8s tasks slowdown with no excess CPU or RAM usage - performance

I have a small virtualised k8s cluster running on top of KVM on 2 physical machines. After deploying Ceph (a storage framework), all the k8s tasks like creating containers or starting containers became insufferably slow, like taking over a minute to get from creating to starting a container.
I checked the nodes for excess CPU or RAM usages, both work nodes and the master node is well below consuming half the assigned resources. I have about 10-20 pods running on each node at the moment.
I am not sure what to google and given my level of k8s knowledge am completely out of ideas. Anyone with similar experience or could point me in the right direction would be much appreciated!

Related

What are the resource requirements to run Logstash in a k8s pod?

I was noticing that running a ELK stack on a Raspberry Pi running a Kubernetes Cluster. I noticed that it didnt have the resources to run all three containers. I was looking up that with Kubernetes you can put limits and requests on your resources CPU and Memory, and it got me thinking. What are the minimum requirements? To me, applications are greedy, so is there a way to cut down the requirements for Logstash, to emphasize resources for Elasticsearch?
Right now, I am running a Raspberry Pi 4, 4g RAM, 32G disk.
If I can put min and max requirements on the container it will better allow me manage the resources. The think though that I noticed is that there was no insight from what I can tell as to minimum requirements for the different containers.
https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html
The above link i believe tells me that the CPU consumption is greedy, but the default MEMORY for Elastic and Kibana 2Gi and 1Gi respectively. It mentioned nothing about logstash though, and whether or not there is a Minimum requirement for CPUs.
I wasnt sure if I should set each ELK container to 1CPU, 1Gi RAM, and I can try it to see if it functions, but since the concept of it throttling down makes me curious what the happy medium would be.
Logstash is not part of the Elastic Cloud, that is why there is no mention of it in the Elastic Cloud on Kubernetes documentation link that you shared.
Logstash is way more CPU bound than memory bound, but how much memory does it needs is completely dependent on your pipelines.
In Logstash the memory depends on the pipelines, the batch size, the filters used, the number of events per seconds, the queue type etc. If you are running a dev or lab environment I think that you can try to give Logstash 1 CPU and 512 MB of RAM and see if it feets your use case.
But I would say that 4GB is pretty small for a full stack since you need to have memory for the applications and still have some memory left for the sytems.

Is there a way to force re-election in Apache mesos master quorum?

We have a Apache Mesos master running in HA mode with 3 nodes(each with 4CPU, 15G Memory), this cluster stops offering resources when the memory gets completely exhausted (happens every week)
we have >200 agents connected to this master and it grows, so a long term solution is to increase the CPU & Memory. But till we get bigger VMs, we have to baby sit every day to monitor the CPU load and memory to restart the mesos master service (which will force the re-election) as a precaution.
To avoid this manual effort, we are planning to force the re-election of this cluster on a specific interval.. say every 2days.
So my question here is, whether mesos master has support to force re-election like this, if so how, is it recommended and does it has any caveat?
Appreciate your time to answer and help me here!

Mesos master configuration on DC/OS

I am in the process of creating DC/OS cluster on AWS for running Kafka->Spark->Cassandra workloads.
I am interested what is the minimum specification for master node in DC/OS environment. I see that DC/OS suggests m3.xlarge instances, but I do not know why do I need 4 processors and 15GB of RAM, when master is only runnning processes described on: https://docs.mesosphere.com/overview/architecture/
-> There is no actual data processing performed by the master.
I would maybe go with m3.large or r3.large instances.
Kindest regards,
Srdjan
DC/OS masters are not used to run any high computation loads, but their memory usage tends to be pretty exhaustive, and so a large instance is recommended.
There might be a way to use smaller instances and compensate for the missing memory using swap files, but straying from the supplier's recommendations should only be done after careful consideration of the potential consequences.

Running Hadoop in virtual environment

I would like to know whether I should expect problems when having Hadoop cluster on virtual instead of physical machines?
I'm mostly worried about using the same hard drive, I read that I should count for 1-2 containers per drive,but in my case only one drive will exist. Could that be a problem?
I think it depends upon how much size are you allocating for containers. Of course there would be limitation to number of containers if you have restriction to the memory.
I can highlight few points which can be considered while running hadoop cluster in virtual environment:
Network configuration in case of multi node cluster
Obvious the performance of application
Affect on scalability as limited resources if you are planning to run the cluster on host which has low configuration hardware

How to select CPU parameter for Marathon apps ran on Mesos?

I've been playing with Mesos cluster for a little bit, and thinking of utilizing Mesos cluster in our production environment. One problem I can't seem to find an answer to: how to properly schedule long running apps that will have varying load?
Marathon has "CPUs" property, where you can set weight for CPU allocation to particular app. (I'm planning on running Docker containers) But from what I've read, it is only a weight, not a reservation, allocation, or limitation that I am setting for the app. It can still use 100% of CPU on the server, if it's the only thing that's running. The problem is that for long running apps, resource demands change over time. Web server, for example, is directly proportional to the traffic. Coupled to Mesos treating this setting as a "reservation," I am choosing between 2 evils: set it too low, and it may start too many processes on the same host and all of them will suffer, with host CPU going past 100%. Set it too high, and CPU will go idle, as reservation is made (or so Mesos think), but there is nothing that's using those resources.
How do you approach this problem? Am I missing something in how Mesos and Marathon handle resources?
I was thinking of an ideal way of doing this:
Specify weight for CPU for different apps (on the order of, say, 0.1 through 1), so that when going gets tough, higher priority gets more (as is right now)
Have Mesos slave report "Available LA" with its status (e.g. if 10 minute LA is 2, with 8 CPUs available, report 6 "Available LA")
Configure Marathon to require "Available LA" resource on the slave to schedule a task (e.g. don't start on particular host if Available LA is < 2)
When available LA goes to 0 (due to influx of traffic at the same time as some job was started on the same server before the influx) - have Marathon move jobs to another slave, one that has more "Available LA"
Is there a way to achieve any of this?
So far, I gather that I can possible write a custom isolator module that will run on slaves, and report this custom metric to the master. Then I can use it in resource negotiation. Is this true?
I wasn't able to find anything on Marathon rescheduling tasks on different nodes if one becomes overloaded. Any suggestions?
As of Mesos 0.23.0 oversubscription is supported. Unfortunately it is not yet implemented in Marathon: https://github.com/mesosphere/marathon/issues/2424
In order to dynamically do allocation, you can use the Mesos slave metrics along with the Marathon HTTP API to scale, for example, as I've done here, in a different context. My colleague Niklas did related work with nibbler, which might also be of help.

Resources