GNU parallel: does -k (keep output order) affect speed?

GNU parallel: does -k (keep output order) affect speed? - parallel-processing

As said in the title, I'm wondering if the -k option (strongly) affects the speed of GNU parallel.
In man parallel_tutorial there is a discussion about --ungroup and --line-buffer, which claims that --linebuffer, which unmixes output lines, is much slower than --ungroup. So maybe -k will also result in major slowdown when the job count is large?
(I didn't find this topic in man parallel or man parallel_tutorial; neither did I find anything with some Google. I haven't finished man parallel though, so if I missed something with less search, please excuse me.)

-k does not slow anything down, but it needs 4 file handles for each job. If GNU Parallel runs out of file handles, it will wait until one of the running jobs finishes.
-g compared to -u slows down around 1-2 milliseconds per job (plus the time it takes to write and read back the output from disk), so the slow down will only be noticeable if you run very short jobs or jobs with much output.
--line-buffer can be faster and can be slower than -g. It does not buffer on disk, but it takes more CPU time to run - especially if your jobs output data slowly.
My recommendation would be to use what is easiest for you to use, and only if that proves to be too slow, look into the other options.

Related

In bash, how are infinite output streams redirected to files handled?

As an experiment, I redirected the output of yes, which is a command that outputs a string repeatedly until killed, as follows:
yes > a.txt
I interrupted the process with Ctrl + C in a split of a second but even in that short time a.txt ended being a couple hundred megabytes in size.
So, I have several questions:
How can such a large file be generated at such a short time? Aren't there writing speed restrictions, especially on HDDs like the one I am using?
How can I avoid this when I unintentionally redirect the output of a program with an endless loop?
Assume the process above works for a long time. Will all the space left in storage be filled eventually or are there any "safety measures" set by bash, OS, etc.?

Output is kept in kernel buffers temporarily. When your memory fills up, then the writing will slow down to hard disk velocity, which is still pretty fast.
Most Unix filesystems reserve a percentage of space for root processes. You can see this by comparing the output of df with sudo df. But unless you have user quotas enabled, you can certainly use all your disk space up like that. Fortunately, the rm command-line tool doesn't need to consume disk storage to do its work.

Serial program runs slower with multiple instances or in parallel

I have a fortran code that I am using to calculate some quantities related to the work that I do. The code itself involves several nested loops, and requires very little disk I/O. Whenever the code is modified, I run it against a suite of several input files (just to make sure it's working properly).
To make a long story short, the most recent update has increased the run time of the program by about a factor of four, and running each input file serially with one CPU takes about 45 minutes (a long time to wait, just to see whether anything was broken). Consequently, I'd like to run each of the input files in parallel across the 4 cpus on the system. I've been attempting to implement the parallelism via a bash script.
The interesting thing I have noted is that, when only one instance of the program is running on the machine, it takes about three and a half minutes to crank through one of the input files. When four instances of the program are running, it takes more like eleven and a half minute to crank through one input file (bringing my total run time down from about 45 minutes to 36 minutes - an improvement, yes, but not quite what I had hoped for).
I've tried implementing the parallelism using gnu parallel, xargs, wait, and even just starting four instances of the program in the background from the command line. Regardless of how the instances are started, I see the same slow down. Consequently, I'm pretty sure this isn't an artifact of the shell scripting, but something going on with the program itself.
I have tried rebuilding the program with debugging symbols turned off, and also using static linking. Neither of these had any noticeable impact. I'm currently building the program with the following options:
$ gfortran -Wall -g -O3 -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow,denormal -fbounds-check -finit-real=nan -finit-integer=nan -o [program name] {sources}
Any help or guidance would be much appreciated!

On modern CPUs you cannot expect a linear speedup. There are several reasons:
Hyperthreading GNU/Linux will see hyperthreading as a core eventhough it is not a real core. It is more like 30% of a core.
Shared caches If your cores share the same cache and a single instance of your program uses the full shared cache, then you will get more cache misses if you run more instances.
Memory bandwidth A similar case as the shared cache is the shared memory bandwidth. If a single thread uses the full memory bandwidth, then running more jobs in parallel may congest the bandwidth. This can partly be solved by running on a NUMA where each CPU has some RAM that is "closer" than other RAM.
Turbo mode Many CPUs can run a single thread at a higher clock rate than multiple threads. This is due to heat.
All of these will exhibit the same symptom: Running a single thread will be faster than each of the multiple threads, but the total throughput of the multiple threads will be bigger than the single thread.
Though I must admit your case sounds extreme: With 4 cores I would have expected a speedup of at least 2.
How to identify the reason
Hyperthreading Use taskset to select which cores to run on. If you use 2 of the 4 cores is there any difference if you use #1+2 or #1+3?
Turbo mode Use cpufreq-set to force a low frequency. Is the speed now the same if you run 1 or 2 jobs in parallel?
Shared cache Not sure how to do this, but if it is somehow possible to disable the cache, then comparing 1 job to 2 jobs run at the same low frequency should give an indication.

Script that benchmarks CPU performance giving a result that can be used to compare against other servers?

Any help with the below will be very much appreciated:
This script should not overrun the box to the extent all the resources are used.
As an example, the script could run an integer equation that works out response time as a measurement of performance.
My theory is to attach it to a monitoring system to periodically run the test across multiple servers.
Any ideas?

Performance testing isn't nearly as simplistic as that. All you test doing a particular operation is how fast a system completes that particular operation on that particular day.
CPUs are fast now, and they're rarely the bottleneck any more.
So this comes back to the question - what are you trying to accomplish? If you really want a relative measure of CPU speed, then you could try something like using the system utility time.
Something like:
time ssh-keygen -b 8192 -f /tmp/test_ssh -N '' -q
Following on from comments: This will prompt you to overwrite. You can always delete /tmp/test_ssh. (I wouldn't suggest using a variable filename, as that'll be erratic)
As an alternative - if it's installed - then I'd suggest using openssl.
E.g.
time openssl genrsa 4096
Time returns 3 numbers by default:
'real' - elapsed wallclock time.
'user' - cpu-seconds spent running the user elements of the task.
'sys' - cpu-seconds spent running system stuff.
For this task - I'm not entirely sure how the metrics pan out - I'm not 100% sure if VMware 'fakes' the CPU time in the virtualisation layer. Real and user should (for this operation) be pretty similar, normally.
I've just tested this on a few of my VMs, and have had a fairly pronounced amount of 'jitter' in the results. But then, this might be what you're actually testing for. Bear in mind though - CPU isn't the only contended resource on a VM. Memory/disk/network are also shared.

How to determine a good value for --load-average using gnu Make?

In Make this flag exists:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-point number). With no argument, removes a previous load limit.
Do you have a good strategy for what value to use for the load limit ? It seems to differ a lot between my machines.

Acceptable load depends on the number of CPU cores. If there is one core, than load average more than 1 is overload. If there are four cores, than load average of more than four is overload.
People often just specify the number of cores using -j switch.
See some empirical numbers here: https://stackoverflow.com/a/17749621/412080

I recommend against using the -l option.
In principle, -l seems superior to -j. -j says, start this many jobs. -l says, make sure this many jobs are running. Often, those are almost the same thing, but when you have I/O bound jobs are other oddities, then -l should be better.
That said, the concept of load average is a bit dubious. It is necessarily a sampling of what goes on on the system. So if you run make -j -l N (for some N) and you have a well-written makefile, then make will immediately start a large number of jobs and run out of file descriptors or memory before even the first sample of the system load can be taken. Also, the accounting of the load average differs across operating systems, and some obscure ones don't have it at all.
In practice, you'll be as well off using -j and will have less headaches. To get more performance out of the build, tune your makefiles, play with compiler options, and use ccache or similar.
(I suspect the original reason for the -l option stems from a time when multiple processors were rare and I/O was really slow.)

possible to force, a script to use certain amount of CPU and memory?

is it possible to force a ruby script to use up to certain amount of CPU and memory.
i dont want the script to be killed when it exceeds this specified amount. i just want it to run within the given constraints.
EDIT:
yes its an endless recursive loop that seems to use lot of CPU.
i noticed that doing return at the end of each recursion is causing this. after i remove it, this high cpu usage is gone. what else can i use to terminate the loop ? exit ?

Yes, calling a sleep function in most programming systems (ruby included) will cause the program to wait for that amount of time, using little to no CPU power.
Alternatively, you could run your program at a lower priority (in *nix systems, this is done with nice or renice).

sleep will sleep the current thread for some period of time. Your cpu load goes down because your programme isn't doing anything for that time. The kernel should handle ensuring that your CPU has sufficient time for all the programmes running.

It really depends what you do in your script. If it's some sort of a endless loop, you are just "hibernating" your script, and allow less processing time to be spent on it.
In short, "sleeping" is not a particularly clean or proper solution. It would help if you posted details on exactly what your script does, typically there would a much more sensible solution available.

You should almost never need to do this. Why would you waste that time doing nothing? There's nothing wrong with the CPU being at high utilisation; it doesn't need to rest. If there are multiple processes running then the operating system deals with dividing the CPU time between them.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio