Docker Container CPU usage Monitoring - performance

As per the documentation of docker.
We can get CPU usage of docker container with docker stats command.
The column CPU % will give the percentage of the host’s CPU the container is using.
Let say I limit the container to use 50% of hosts single CPU. I can specify 50% single CPU core limit by --cpus=0.5 option as per https://docs.docker.com/config/containers/resource_constraints/
How can we get the CPU% usage of container out of allowed CPU core by any docker command?
E.g. Out of 50% Single CPU core, 99% is used.
Is there any way to get it with cadvisor or prometheus?

How can we get the CPU% usage of container out of allowed CPU core by any docker command? E.g. Out of 50% Single CPU core, 99% is used.
Docker has docker stats command which shows CPU/Memory usage and few other stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c43f085dea8c foo_test.1.l5haec5oyr36qdjkv82w9q32r 0.00% 11.15MiB / 100MiB 11.15% 7.45kB / 0B 3.29MB / 8.19kB 9
Though it does show memory usage regarding the limit out of the box, there is no such feature for CPU yet. It is possible to solve that with a script that will calculate the value on the fly, but I'd rather chosen the second option.
Is there any way to get it with cadvisor or prometheus?
Yes, there is:
irate(container_cpu_usage_seconds_total{cpu="total"}[1m])
/ ignoring(cpu)
(container_spec_cpu_quota/container_spec_cpu_period)
The first line is a typical irate function that calculates how much of CPU seconds a container has used. It comes with a label cpu="total", which the second part does not have, and that's why there is ignoring(cpu).
The bottom line calculates how many CPU cores a container is allowed to use. There are two metrics:
container_spec_cpu_quota - the actual quota value. The value is computed of a fraction of CPU cores that you've set as the limit and multiplied by container_spec_cpu_period.
container_spec_cpu_period - comes from CFS Scheduler and it is like a unit of the quota value.
I know it may be hard to grasp at first, allow me to explain on an example:
Consider that you have container_spec_cpu_period set to the default value, which is 100,000 microseconds, and container CPU limit is set to half a core (0.5). In this case:
container_spec_cpu_period 100,000
container_spec_cpu_quota 50,000 # =container_spec_cpu_period*0.5
With CPU limit set to two cores you will have this:
container_spec_cpu_quota 200,000
And so by dividing one by another we get the fraction of CPU cores back, which is then used in another division to calculate how much of the limit is used.

Related

Spark UI: How to understand the min/med/max in DAG

I would like to fully understand the meaning of the information about min/med/max.
for example:
scan time total(min, med, max)
34m(3.1s, 10.8s, 15.1s)
means of all cores, the min scan time is 3.1s and the max is 15.1, the total time accumulated is up to 34 minutes, right?
then for
data size total (min, med, max)
8.2GB(41.5MB, 42.2MB, 43.6MB)
means of all the cores, the max usage is 43.6MB and the min usage is 41.5MB, right?
so the same logic, for the step of Sort at left, 80MB of ram has been used for each core.
Now, the executor has 4 core and 6G RAM, according to the metrix, I think a lot of RAM has been set aside, since each core could use up to around 1G RAM. So I would like to try reducing partition number and force each executor to process more data and reduce shuffle size, do you think theoretically it is possible?

TensorFlow object detection limiting memory and cpu usage

I manage to run tensorflow pet example from the tutorial. I decided to use the slowest model (because I want to use for my own data). However, when I start the training it gets killed after running a bit. It used all my cpus (4) and all my memory 8GB. Do you know anyway I can limit the number of CPUs to 2 and limit the amount of memory used ? If I reduce the batch size ? My batch size is already 1.
I managed to run by reducing the resize:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 300
max_dimension: 612
}
Thanks in advance.
Another idea to reduce memory usage is to reduce the queue sizes for input data. Specifically, in the object_detection/protos/train.proto file, you will see entries for batch_queue_capacity and prefetch_queue_capacity --- consider setting these fields explicitly in your config file to smaller numbers.

Calculating actual flop/core when using actual memory bandwidth

I want to calculate the actual amount of mflop/s/core using the following information:
I have measured actual amount of memory bandwidth of each core in 1 node which is 4371 MB/s.
I have also measured mflop/s/core on one node if I use only one core on a node(in this case the whole memory of node would be available for that core), the result is 2094.45. So I measured the memory bandwidth which was available for that core = 10812.3 MB/s
So now I want to calculate the actual mflop/s/core when the core has it's real memory bandwidth (4371MB/s).
Do you think it would be correct if I calculate it like this:
actual mflop/s/core= (mflop/s/core * actual memory bw) / used memory bandwidth
Any help would be appreciated.

How to measure performance in Docker?

Is it possible to have performance issues in Docker?
Because I know vm's and you have to specify how much RAM you want to use etc.
But I don't know that in docker. It's just running. Will it automatically use the RAM what it needs or how is this working?
Will it automatically use the RAM what it needs or how is this working?
No by default, it will use the minimum memory needed, up to a limit.
You can use docker stats to see it against a running container:
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
When you use docker run, you can specify those limits with Runtime constraints on resources.
That includes RAM:
-m, --memory=""
Memory limit (format: <number>[<unit>], where unit = b, k, m or g)
Under normal circumstances, containers can use as much of the memory as needed and are constrained only by the hard limits set with the -m/--memory option.
When memory reservation is set, Docker detects memory contention or low memory and forces containers to restrict their consumption to a reservation limit.
By default, kernel kills processes in a container if an out-of-memory (OOM) error occurs.
To change this behaviour, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option.
Note: the upcoming (1.10) docker update command might include dynamic memory changes. See docker update.
By default, docker containers are not limited in the amount of resources they can consume from the host. Containers are limited in what permissions / capabilities they have (that's the "container" part).
You should always set constraints on a container, for example, the maximum amount of memory a container is allowed to use, the amount of swap space, and the amount of CPU. Not setting such limits can potentially lead to the host running out of memory, and the kernel killing off random processes (OOM kill), to free up memory. "Random" in that case, can also mean that the kernel kills your ssh server, or the docker daemon itself.
Read more on constraining resources on a container in Runtime constraints on resources in the manual.

What can be the reason for CPU load to NOT scale linearly with the number of traffic processing workers?

We are writing a Front End that is supposed to process large volume of traffic (in our case it is Diameter traffic, but that may be irrelevant to the question). As client connects, the server socket gets assigned to one of the Worker processes that perform all the actual traffic processing. In other words, Worker does all the work, and more Workers should be added when more clients get connected.
One would expect the CPU load per message to be the same for different number of Workers, because Workers are totally independent, and serve different sets of client connections. Yet our tests show that it takes more CPU time per message, as the number of Workers grow.
To be more precise, the CPU load depends on the TPS (Transactions or Request-Responses per second) as follows.
For 1 Worker:
60K TPS - 16%, 65K TPS - 17%... i.e. ~0.26% CPU per KTPS
For 2 Workers:
80K TPS - 29%, 85K TPS - 30%... i.e. ~0.35% CPU per KTPS
For 4 Workers:
85K TPS - 33%, 90K TPS - 37%... i.e. ~0.41% CPU per KTPS
What is the explanation for this? Workers are independent processes and there is no inter-process communication between them. Also each Worker is single-threaded.
The programming language is C++
This effect is observed on any hardware, which is close to this one: 2 Intel Xeon CPU, 4-6 cores, 2-3 GHz
OS: RedHat Linux (RHEL) 5.8, 6.4
CPU load measurements are done using mpstat and top.
If either the size of the program code used by a worker or the size of the data processed by a worker (or both) is non-small, the reason could be the reduced effectiveness of the various caches: The locality-over-time of how a single worker accesses its program code and/or its data is disturbed by other workers intervening.
The effect can be quite complicated to understand, because:
it depends massively on the structure of your code's computations,
modern CPUs have about three levels of cache,
each cache has a different size,
some caches are local to one core, others are not,
how often the workers intervene depends on your operating system's scheduling strategy
which gets even more complicated if there are multiple cores,
unless your programming language's run-time system also intervenes,
in which case it is more complicated still,
your network interface is a computer of its own and has a cache, too,
and probably more.
Caveat: Given the relatively coarse granularity of process scheduling, the effect of this ought not to be as large as it is, I think.
But then: Have you looked up how "percent of CPU" is even defined?
Until you reach CPU saturation on your machine you cannot be sure that the effect is actually as large as it looks. And when you do reach saturation, it may not be the CPU at all that is the bottleneck here, so are you sure you need to care about CPU load?
I complete agree with #Lutz Prechelt. Here I just want to add the method about how to investigate on the issue and the answer is Perf.
Perf is a performance analyzing tool in Linux which collects both kernel and userspace events and provide some nice metrics. It’s been widely used in my team to find bottom neck in CPU-bound applications.
the output of perf is like this:
Performance counter stats for './cache_line_test 0 1 2 3':
1288.050638 task-clock # 3.930 CPUs utilized
185 context-switches # 0.144 K/sec
8 cpu-migrations # 0.006 K/sec
395 page-faults # 0.307 K/sec
3,182,411,312 cycles # 2.471 GHz [39.95%]
2,720,300,251 stalled-cycles-frontend # 85.48% frontend cycles idle [40.28%]
764,587,902 stalled-cycles-backend # 24.03% backend cycles idle [40.43%]
1,040,828,706 instructions # 0.33 insns per cycle
# 2.61 stalled cycles per insn [51.33%]
130,948,681 branches # 101.664 M/sec [51.48%]
20,721 branch-misses # 0.02% of all branches [50.65%]
652,263,290 L1-dcache-loads # 506.396 M/sec [51.24%]
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
4,846,815 LLC-loads # 3.763 M/sec [40.18%]
301 LLC-load-misses # 0.01% of all LL-cache hits [39.58%]
It output your cache miss rate with will easy you to tune your program and see the effect.
I write a article about cache line effects and perf and you can read it for more details.

Resources