Mesos cgroups isolation not killing tasks when limit is reached - mesos

I was testing mesos cgroups isolation. To see what kind of error gets thrown.
I ran the below shell program with marathon. Assigned 1 MB memory and 1 CPU.
#!/bin/sh
temp=a
while :
do
temp=$temp$temp
echo ${#temp}
sleep 1
done
A single character takes 1B of space so the program above needs to throw an exception once the length of the temp string reaches about 1 MB. But the tasks seem to get killed randomly. The task sometimes gets killed at length 1048576 or 2097152 or 4194304.
Ideally since 1MB is the limit it should have stopped when length is 524288.
Additional info -
Slave is run with --isolation='cgroups/cpu,cgroups/mem'
Mesos version - 0.25

The variance you are seeing can be explained with the following:
The amount of memory taken up by your script is not entirely deterministic, as it depends on the implementation of the shell interpreter as well as the size of your system's shared libraries (i.e. the parts of those libraries loaded into your program's resident set).
A 1 MB task in Mesos is accompanied 32 MB for the executor. Because the executor requires slightly less than 32 MB, you will have slightly more than 1 MB for your task.

Related

GNU parallel: limit CPUs and RAM per Job

I have a code, for example
cat sample_name.list | parallel -j 5 --max-args=1 --progress --keep-order --results logs --joblog logs.txt echo {1}
I can not find an option that help me to limit the number of CPUs and amount of RAM assigned at each job.
With lscpu I have 12 CPUs and 16Gb RAM; I want to give at each job 2 CPUs and 1G RAM
Any help? Thanks!
I would use -j 50%.
This will run 6 jobs in parallel on a 12 core machine. This way there will be 2 cores per jobs.
It is unclear what you want to happen if a job uses more than 1 GB.
Maybe --memsuspend 1G is what you are looking for?
This will start suspending jobs when the free memory falls below 2 GB. If there is only 1 GB free, only a single job will be allowed to run. When the jobs free up more memory, the suspended jobs will be resumed.
The idea here is that suspended jobs will be swapped out, thus freeing up memory.
It is particularly useful if your program runs for a long time with low memory usage, but needs a lot of memory when finishing up. Here you optimally only want a single job to be in the finishing up state at a time.

Docker Container CPU usage Monitoring

As per the documentation of docker.
We can get CPU usage of docker container with docker stats command.
The column CPU % will give the percentage of the host’s CPU the container is using.
Let say I limit the container to use 50% of hosts single CPU. I can specify 50% single CPU core limit by --cpus=0.5 option as per https://docs.docker.com/config/containers/resource_constraints/
How can we get the CPU% usage of container out of allowed CPU core by any docker command?
E.g. Out of 50% Single CPU core, 99% is used.
Is there any way to get it with cadvisor or prometheus?
How can we get the CPU% usage of container out of allowed CPU core by any docker command? E.g. Out of 50% Single CPU core, 99% is used.
Docker has docker stats command which shows CPU/Memory usage and few other stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c43f085dea8c foo_test.1.l5haec5oyr36qdjkv82w9q32r 0.00% 11.15MiB / 100MiB 11.15% 7.45kB / 0B 3.29MB / 8.19kB 9
Though it does show memory usage regarding the limit out of the box, there is no such feature for CPU yet. It is possible to solve that with a script that will calculate the value on the fly, but I'd rather chosen the second option.
Is there any way to get it with cadvisor or prometheus?
Yes, there is:
irate(container_cpu_usage_seconds_total{cpu="total"}[1m])
/ ignoring(cpu)
(container_spec_cpu_quota/container_spec_cpu_period)
The first line is a typical irate function that calculates how much of CPU seconds a container has used. It comes with a label cpu="total", which the second part does not have, and that's why there is ignoring(cpu).
The bottom line calculates how many CPU cores a container is allowed to use. There are two metrics:
container_spec_cpu_quota - the actual quota value. The value is computed of a fraction of CPU cores that you've set as the limit and multiplied by container_spec_cpu_period.
container_spec_cpu_period - comes from CFS Scheduler and it is like a unit of the quota value.
I know it may be hard to grasp at first, allow me to explain on an example:
Consider that you have container_spec_cpu_period set to the default value, which is 100,000 microseconds, and container CPU limit is set to half a core (0.5). In this case:
container_spec_cpu_period 100,000
container_spec_cpu_quota 50,000 # =container_spec_cpu_period*0.5
With CPU limit set to two cores you will have this:
container_spec_cpu_quota 200,000
And so by dividing one by another we get the fraction of CPU cores back, which is then used in another division to calculate how much of the limit is used.

How to measure performance in Docker?

Is it possible to have performance issues in Docker?
Because I know vm's and you have to specify how much RAM you want to use etc.
But I don't know that in docker. It's just running. Will it automatically use the RAM what it needs or how is this working?
Will it automatically use the RAM what it needs or how is this working?
No by default, it will use the minimum memory needed, up to a limit.
You can use docker stats to see it against a running container:
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
When you use docker run, you can specify those limits with Runtime constraints on resources.
That includes RAM:
-m, --memory=""
Memory limit (format: <number>[<unit>], where unit = b, k, m or g)
Under normal circumstances, containers can use as much of the memory as needed and are constrained only by the hard limits set with the -m/--memory option.
When memory reservation is set, Docker detects memory contention or low memory and forces containers to restrict their consumption to a reservation limit.
By default, kernel kills processes in a container if an out-of-memory (OOM) error occurs.
To change this behaviour, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option.
Note: the upcoming (1.10) docker update command might include dynamic memory changes. See docker update.
By default, docker containers are not limited in the amount of resources they can consume from the host. Containers are limited in what permissions / capabilities they have (that's the "container" part).
You should always set constraints on a container, for example, the maximum amount of memory a container is allowed to use, the amount of swap space, and the amount of CPU. Not setting such limits can potentially lead to the host running out of memory, and the kernel killing off random processes (OOM kill), to free up memory. "Random" in that case, can also mean that the kernel kills your ssh server, or the docker daemon itself.
Read more on constraining resources on a container in Runtime constraints on resources in the manual.

discrepancy between htop and golang readmemstats

My program loads a lot of data at start up and then calls debug.FreeOSMemory() so that any extra space is given back immediately.
loadDataIntoMem()
debug.FreeOSMemory()
after loading into memory , htop shows me the following for the process
VIRT RES SHR
11.6G 7629M 8000
But a call to runtime.ReadMemStats shows me the following
Alloc 5593336608 5.3G
BuckHashSys 1574016 1.6M
HeapAlloc 5593336610 5.3G
HeapIdle 2607980544 2.5G
HeapInuse 7062446080 6.6G
HeapReleased 2607980544 2.5G
HeapSys 9670426624 9.1G
MCacheInuse 9600 9.4K
MCacheSys 16384 16K
MSpanInuse 106776176 102M
MSpanSys 115785728 111M
OtherSys 25638523 25M
StackInuse 589824 576K
StackSys 589824 576K
Sys 10426738360 9.8G
TotalAlloc 50754542056 48G
Alloc is the amount obtained from system and not yet freed ( This is
resident memory right ?) But there is a big difference between the two.
I rely on HeapIdle to kill my program i.e if HeapIdle is more than 2 GB, restart - in this case it is 2.5, and isn't going down even after a while. Golang should use from heap idle when allocating more in the future, thus reducing heap idle right ?
If assumption 1 is wrong, which stat can accurately tell me what the RES value in htop is.
What can I do to reduce the value of HeapIdle ?
This was tried on go 1.4.2, 1.5.2 and 1.6.beta1
The effective memory consumption of your program will be Sys-HeapReleased. This still won't be exactly what the OS reports, because the OS can choose to allocate memory how it sees fit based on the requests of the program.
If your program runs for any appreciable amount of time, the excess memory will be offered back to the OS so there's no need to call debug.FreeOSMemory(). It's also not the job of the garbage collector to keep memory as low as possible; the goal is to use memory as efficiently as possible. This requires some overhead, and room for future allocations.
If you're having trouble with memory usage, it would be a lot more productive to profile your program and see why you're allocating more than expected, instead of killing your process based on incorrect assumptions about memory.

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Resources