Can GNU Parallel execute more parallel processes? - bash

Can I for example execute:
parallel -j 200 < list0
Where "list" has:
nice -n -20 parallel -j 100 < list2
nice -n -20 parallel -j 100 < list1
Would this be feasible/possible?

The number of processes is limited. In practice, don't run more than a few dozens of processes in parallel (at least on a personal computer; on million dollars machines the limit is definitely bigger).
Read fork(2), execve(2), setrlimit(2) man pages, to understand when and how they can fail.
And before reaching the limit where fork would fail, you'll reach a threshold where doing more parallel processes would slow down the entire computation, since the scheduler inside the kernel will be overstressed.
If you have access to a network of computers (perhaps just two of them) consider also MPI. If you can recode some of your applications read also about pthreads (e.g. this tutorial).

You would normally do:
parallel -j0 < joblist
Then GNU Parallel will start as many jobs as it possibly can. It will typically be limited by either filehandles or processes.
If you cannot override the number of filehandles per process you can often cheat by:
parallel -j10 --pipe parallel -j0 < joblist
This will start 10 times as many jobs. This can also be used if the runtime of the jobs is in less than 10 ms (in which case the 2-3 ms overhead from GNU Parallel may be more than actually running the job).

Related

GNU parallel: limit CPUs and RAM per Job

I have a code, for example
cat sample_name.list | parallel -j 5 --max-args=1 --progress --keep-order --results logs --joblog logs.txt echo {1}
I can not find an option that help me to limit the number of CPUs and amount of RAM assigned at each job.
With lscpu I have 12 CPUs and 16Gb RAM; I want to give at each job 2 CPUs and 1G RAM
Any help? Thanks!
I would use -j 50%.
This will run 6 jobs in parallel on a 12 core machine. This way there will be 2 cores per jobs.
It is unclear what you want to happen if a job uses more than 1 GB.
Maybe --memsuspend 1G is what you are looking for?
This will start suspending jobs when the free memory falls below 2 GB. If there is only 1 GB free, only a single job will be allowed to run. When the jobs free up more memory, the suspended jobs will be resumed.
The idea here is that suspended jobs will be swapped out, thus freeing up memory.
It is particularly useful if your program runs for a long time with low memory usage, but needs a lot of memory when finishing up. Here you optimally only want a single job to be in the finishing up state at a time.

Different profiling modes for different cores using perf

I have the following questions regarding perf.
a) Is it possible that I run different profiling modes on different cores simultaneously. e.g. Core 0 with event based sampling (sampling every N events) and Core 1 with free running counter based sampling?
b) In case a) is not possible. Then is it possible to get a snapshot of the PMU counters on the other cores (Core 1) for every sample (overflow at N events) on Core 0?
P.S: The platform is a RPi 3b+ based on the Arm Cortex A53
It is possible to operate different profiling modes on different cores of the CPU simultaneously.
perf also has a processor-wide mode wherein all threads running on the designated processors are monitored. Counts and samples are thus aggregated per CPU/core.
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be
provided as a comma-separated list with no space: 0,1. Ranges of
CPUs are specified with -: 0-2. In per-thread mode, this option
is ignored. The -a option is still necessary to activate
system-wide monitoring. Default is to count on all CPUs.
Running both the free-running counter as well as the sampling mechanism of perf simultaneously, is possible on different cores of the CPU like below -
eg. for cpu 0:
perf stat --cpu 0 -B dd if=/dev/zero of=/dev/null count=1000000
and for cpu 1:
perf record --cpu 1 sleep 20

Parallel computing in Julia - running a simple for-loop on multiple cores

For starters, I have to say I'm completely new to parallel computing (and know close to nothing about computer science), so my understanding of what things like "workers" or "processes" actually are is very limited. I do however have a question about running a simple for-loop that presumably has no dependencies between the iterations in parallel.
Let's say I wanted to do the following:
for N in 1:5:20
println("The N of this iteration in $N")
end
If I simply wanted these messages to appear on screen and the order of appearance didn't matter, how could one achieve this in Julia 0.6, and for future reference in Julia 0.7 (and therefore 1.0)?
Just to add the example to the answer of Chris. Since the release of julia 1.3 you do this easily with Threads.#threads
Threads.#threads for N in 1:5:20
println("The number of this iteration is $N")
end
Here you are running only one julia session with multiple threads instead of using Distributed where you run multiple julia sessions.
See, e.g. multithreading blog post for more information.
Distributed Processing
Start julia with e.g. julia -p 4 if you want to use 4 cpus (or use the function addprocs(4)). In Julia 1.x, you make a parallel loop as following:
using Distributed
#distributed for N in 1:5:20
println("The N of this iteration in $N")
end
Note that every process have its own variables per default.
For any serious work, have a look at the manual https://docs.julialang.org/en/v1.4/manual/parallel-computing/, in particular the section about SharedArrays.
Another option for distributed computing are the function pmap or the package MPI.jl.
Threads
Since Julia 1.3, you can also use Threads as noted by wueli.
Start julia with e.g. julia -t 4 to use 4 threads. Alternatively you can or set the environment variable JULIA_NUM_THREADS before starting julia.
For example Linux/Mac OS:
export JULIA_NUM_THREADS=4
In windows, you can use set JULIA_NUM_THREADS 4 in the cmd prompt.
Then in julia:
Threads.#threads for N = 1::20
println("N = $N (thread $(Threads.threadid()) of out $(Threads.nthreads()))")
end
All CPUs are assumed to have access to shared memory in the examples above (e.g. "OpenMP style" parallelism) which is the common case for multi-core CPUs.

Why does GNU parallel affect script speed?

I have some Fortran script. I compile with gfortran and then run as time ./a.out.
My script completes, and outputs the runtime as,
real 0m36.037s
user 0m36.028s
sys 0m0.004s
i.e. ~36 seconds
Now suppose I want to run this script multiple times, in parallel. For this I am using GNU Parallel.
Using the lscpu command tells me that I have 8 CPUs, with 2 threads per core and 4 cores per socket.
I create some file example.txt of the form,
time ./a.out
time ./a.out
time ./a.out
time ./a.out
...
which goes on for 8 lines.
I can then run these in parallel on 8 cores as,
parallel -j 8 :::: example.txt
In this case I would expect the runtime for each script to still be 36 seconds, and the total runtime to be ~36 seconds. However, in actuality what happens is the run time for each script roughly doubles.
If I instead run on 4 cores instead of 8 (-j 4) the problem disappears, and each script reverts to taking 36 seconds to run.
What is the cause of this? I have heard talk in the past on 'overheads' but I am not sure exactly what is meant by this.
What is happening is that you have only one socket with 4 physical cores in it.
Those are the real cores of your machine.
The total number of CPUs you see as output of lscpu is calculated using the following formula: #sockets * #cores_per_socket * #threads_per_core.
In your case it is 1*4*2=8.
Threads per core are a sort of virtual CPUs and they do not always perform as real CPUs, expecially for compute intense processing (this spec is called hyperthreading ).
Hence when you try to squeeze two threads per core, they get almost executed serially.
Take a look at this article for more info.

When/how to benefit from parallel processing of scripts?

I have a sequence of scripts to run on a computer with 1 physical & logical core.
I have tried running them in sequence, and also forking them with something like the bash script below. The background processes, running in parallel, actually took longer than running them in sequence.
My question is: Under what circumstances should a workload like this be run in parallel? That is, must I have particular hardware or can I do this more efficiently with the single-processor computer that I have?
#!/bin/bash
# Run them in sequence...
T1=$(date +%s)
python proc_test1.py
python proc_test2.py
python proc_test3.py
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
# Now fork them...
T1=$(date +%s)
python proc_test1.py &
python proc_test2.py &
python proc_test3.py &
wait
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
exit 0
Parallel processing usually becomes beneficial when hardware is available. For example, if you had three logical CPUs, then three computations could occur simultaneously. Thus, ideally, if you ran proc_test1.py three times with forks with three available processors, all three would finish in the same time it would take to run just one instance of proc_test1.py.
In other words, given sufficient hardware, running three proc_test1.py's in serial will take three times as long as running them with forks.
Now, given that you just have one hardware cpu, it makes sense that the parallel jobs would run more slowly than the serial ones, as each python program will be competing with each other for cpu time. The cpu stopping one job and resuming another costs cpu time itself.
For example, say you had 6 oranges and two hands, and you had to hold all 6 oranges for 5 seconds total. Say it takes you one second to pick up or swap out oranges. You could do this task in serial, and pick up two oranges at a time for five seconds before swapping for a new pair. This would take you
1 + 5 + 1 + 5 + 1 + 5
= 3 * 5 + 3 = 18
seconds to complete.
Now suppose the parallel analogy. Then all 6 oranges are begging to be picked up and you holding one does not mean that you wont immediately drop it for an alternative. There isn't necessarily an upper bound on how long it will take you to complete the task as we have defined it, so suppose that you have to hold the oranges for at least 2.5 seconds before you swap them out, and you only swap in pairs. Then, it will take you
1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1
= 3 * 5 + 7 = 22
seconds to complete. Note that by "forking", it takes you 22% longer to hold 6 oranges for 5 seconds each. Since you have two hands, it still takes you 15 seconds to complete the task, but there is a variable overhead in switching time based on your strategy. Note that if you had 6 hands, it would take only 7 seconds to complete the task.
Thus when you have more processors, fork processes, otherwise you're just juggling jobs on limited hardware.
Your simple question is in practice extremely hard to answer in general. In real life I would always measure to see if reality agrees with my theory.
An example where reality did not agree with my theory was on my Intel Core i7. It has 4 cores and has hyperthreading. This would suggest that running 8 threads in parallel will be optimal: You will be using the 4 processing units and use 4 additional threads for keeping the pipelines filled.
However, Core i7 has 6MB of shared cache. It just so happened, that the working set of my data fit inside the 6MB. So I saw an extreme speedup by running 1 thread instead of 2 or even 8: Running more than 1 would simply flush the cache all the time. This would not have been true if the cache had not been shared, but it just shows that it is not simple to say when parallelizing will be faster.

Resources