GNU parallel: limit CPUs and RAM per Job - parallel-processing

I have a code, for example
cat sample_name.list | parallel -j 5 --max-args=1 --progress --keep-order --results logs --joblog logs.txt echo {1}
I can not find an option that help me to limit the number of CPUs and amount of RAM assigned at each job.
With lscpu I have 12 CPUs and 16Gb RAM; I want to give at each job 2 CPUs and 1G RAM
Any help? Thanks!

I would use -j 50%.
This will run 6 jobs in parallel on a 12 core machine. This way there will be 2 cores per jobs.
It is unclear what you want to happen if a job uses more than 1 GB.
Maybe --memsuspend 1G is what you are looking for?
This will start suspending jobs when the free memory falls below 2 GB. If there is only 1 GB free, only a single job will be allowed to run. When the jobs free up more memory, the suspended jobs will be resumed.
The idea here is that suspended jobs will be swapped out, thus freeing up memory.
It is particularly useful if your program runs for a long time with low memory usage, but needs a lot of memory when finishing up. Here you optimally only want a single job to be in the finishing up state at a time.

Related

Docker Container CPU usage Monitoring

As per the documentation of docker.
We can get CPU usage of docker container with docker stats command.
The column CPU % will give the percentage of the host’s CPU the container is using.
Let say I limit the container to use 50% of hosts single CPU. I can specify 50% single CPU core limit by --cpus=0.5 option as per https://docs.docker.com/config/containers/resource_constraints/
How can we get the CPU% usage of container out of allowed CPU core by any docker command?
E.g. Out of 50% Single CPU core, 99% is used.
Is there any way to get it with cadvisor or prometheus?
How can we get the CPU% usage of container out of allowed CPU core by any docker command? E.g. Out of 50% Single CPU core, 99% is used.
Docker has docker stats command which shows CPU/Memory usage and few other stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c43f085dea8c foo_test.1.l5haec5oyr36qdjkv82w9q32r 0.00% 11.15MiB / 100MiB 11.15% 7.45kB / 0B 3.29MB / 8.19kB 9
Though it does show memory usage regarding the limit out of the box, there is no such feature for CPU yet. It is possible to solve that with a script that will calculate the value on the fly, but I'd rather chosen the second option.
Is there any way to get it with cadvisor or prometheus?
Yes, there is:
irate(container_cpu_usage_seconds_total{cpu="total"}[1m])
/ ignoring(cpu)
(container_spec_cpu_quota/container_spec_cpu_period)
The first line is a typical irate function that calculates how much of CPU seconds a container has used. It comes with a label cpu="total", which the second part does not have, and that's why there is ignoring(cpu).
The bottom line calculates how many CPU cores a container is allowed to use. There are two metrics:
container_spec_cpu_quota - the actual quota value. The value is computed of a fraction of CPU cores that you've set as the limit and multiplied by container_spec_cpu_period.
container_spec_cpu_period - comes from CFS Scheduler and it is like a unit of the quota value.
I know it may be hard to grasp at first, allow me to explain on an example:
Consider that you have container_spec_cpu_period set to the default value, which is 100,000 microseconds, and container CPU limit is set to half a core (0.5). In this case:
container_spec_cpu_period 100,000
container_spec_cpu_quota 50,000 # =container_spec_cpu_period*0.5
With CPU limit set to two cores you will have this:
container_spec_cpu_quota 200,000
And so by dividing one by another we get the fraction of CPU cores back, which is then used in another division to calculate how much of the limit is used.

Different profiling modes for different cores using perf

I have the following questions regarding perf.
a) Is it possible that I run different profiling modes on different cores simultaneously. e.g. Core 0 with event based sampling (sampling every N events) and Core 1 with free running counter based sampling?
b) In case a) is not possible. Then is it possible to get a snapshot of the PMU counters on the other cores (Core 1) for every sample (overflow at N events) on Core 0?
P.S: The platform is a RPi 3b+ based on the Arm Cortex A53
It is possible to operate different profiling modes on different cores of the CPU simultaneously.
perf also has a processor-wide mode wherein all threads running on the designated processors are monitored. Counts and samples are thus aggregated per CPU/core.
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be
provided as a comma-separated list with no space: 0,1. Ranges of
CPUs are specified with -: 0-2. In per-thread mode, this option
is ignored. The -a option is still necessary to activate
system-wide monitoring. Default is to count on all CPUs.
Running both the free-running counter as well as the sampling mechanism of perf simultaneously, is possible on different cores of the CPU like below -
eg. for cpu 0:
perf stat --cpu 0 -B dd if=/dev/zero of=/dev/null count=1000000
and for cpu 1:
perf record --cpu 1 sleep 20

Why does GNU parallel affect script speed?

I have some Fortran script. I compile with gfortran and then run as time ./a.out.
My script completes, and outputs the runtime as,
real 0m36.037s
user 0m36.028s
sys 0m0.004s
i.e. ~36 seconds
Now suppose I want to run this script multiple times, in parallel. For this I am using GNU Parallel.
Using the lscpu command tells me that I have 8 CPUs, with 2 threads per core and 4 cores per socket.
I create some file example.txt of the form,
time ./a.out
time ./a.out
time ./a.out
time ./a.out
...
which goes on for 8 lines.
I can then run these in parallel on 8 cores as,
parallel -j 8 :::: example.txt
In this case I would expect the runtime for each script to still be 36 seconds, and the total runtime to be ~36 seconds. However, in actuality what happens is the run time for each script roughly doubles.
If I instead run on 4 cores instead of 8 (-j 4) the problem disappears, and each script reverts to taking 36 seconds to run.
What is the cause of this? I have heard talk in the past on 'overheads' but I am not sure exactly what is meant by this.
What is happening is that you have only one socket with 4 physical cores in it.
Those are the real cores of your machine.
The total number of CPUs you see as output of lscpu is calculated using the following formula: #sockets * #cores_per_socket * #threads_per_core.
In your case it is 1*4*2=8.
Threads per core are a sort of virtual CPUs and they do not always perform as real CPUs, expecially for compute intense processing (this spec is called hyperthreading ).
Hence when you try to squeeze two threads per core, they get almost executed serially.
Take a look at this article for more info.

Mesos cgroups isolation not killing tasks when limit is reached

I was testing mesos cgroups isolation. To see what kind of error gets thrown.
I ran the below shell program with marathon. Assigned 1 MB memory and 1 CPU.
#!/bin/sh
temp=a
while :
do
temp=$temp$temp
echo ${#temp}
sleep 1
done
A single character takes 1B of space so the program above needs to throw an exception once the length of the temp string reaches about 1 MB. But the tasks seem to get killed randomly. The task sometimes gets killed at length 1048576 or 2097152 or 4194304.
Ideally since 1MB is the limit it should have stopped when length is 524288.
Additional info -
Slave is run with --isolation='cgroups/cpu,cgroups/mem'
Mesos version - 0.25
The variance you are seeing can be explained with the following:
The amount of memory taken up by your script is not entirely deterministic, as it depends on the implementation of the shell interpreter as well as the size of your system's shared libraries (i.e. the parts of those libraries loaded into your program's resident set).
A 1 MB task in Mesos is accompanied 32 MB for the executor. Because the executor requires slightly less than 32 MB, you will have slightly more than 1 MB for your task.

How is CPU time measured on Windows?

I am currently creating a program which identifies processes which are hung/out-of-control, and using an entire CPU core. The program then terminates them, so the CPU usage can be kept under control.
However, I have run into a problem: When I execute the 'tasklist' command on Windows, it outputs this:
Image Name: Blockland.exe
PID: 4880
Session Name: Console
Session#: 6
Mem Usage: 127,544 K
Status: Running
User Name: [removed]\[removed]
CPU Time: 0:00:22
Window Title: C:\HammerHost\Blockland\Blockland.exe
So I know that the line which says "CPU Time" is an indication of the total time, in seconds, used by the program ever since it started.
But let's suppose there are 4 CPU cores on the system. Does this mean that it used up 22 seconds of one core, and therefore used 5.5 seconds on the entire CPU in total? Or does this mean that the process used up 22 seconds on the entire CPU?
It's the total CPU time across all cores. So, if the task used 10 seconds on one core and then 15 seconds later on a different core it would report 25 seconds. If it used 5 seconds on all four cores simultaneously, it would report 20 seconds.

Resources