Gnu Parallel: Does parallel reload program for every job? - bash

Suppose I have a program that loads significant content before running...but this is a one time slowdown.
Next, I write:
cat ... | parallel -j 8 --spreadstdin --block $sz ... ./mycode
Will this induce the load overhead every single job?
If it does induce the overhead, is there a way to avoid it?

As #Barmar says, ./mycode is started for each block in your example.
But since you do not use -k in your example you may be able to use --round-robin.
... | parallel -j 8 --spreadstdin --round-robin --block $sz ... ./mycode
This will start 8 ./mycodes (but not one per block) and give blocks to any process that is ready to read.
This example shows that more blocks are given to process 11 and 10 than process 4 and 5 because 4 and 5 read slower:
seq 1000000 |
parallel -j8 --tag --roundrobin --pipe --block 1k 'pv -qL {}0000 | wc' ::: 11 4 5 6 9 8 7 10

parallel doesn't know anything about the internal workings of the program you're running with it. Each instance runs independently, there's no way that one invocation's initialization can be copied over to the others.
If you want the application to initialize once and then run multiple instances in parallel, you need to design that into the application itself. It should load the data, then use fork() to create multiple processes that use this data.

Related

parallel computing in multiple cores for data which is indepedently run with the program

I have a simulation program in fortran which takes the input from a .dat. This file has 100.000 lines which takes really long to run. The program take the first line, run all the simulations and write in a .out the result and pass to the next line. I have a computer with 16 cpu so how can I do to split my data in 16 parts and run it separatly in each of the cpus? I am running in a machine with ubuntu. It is totally independent each line from the other.
For example my data is HeadData10000.dat, then I have a file simulation.ini with the name of the input data in this case: HeadData10000.dat and with the name of the output data. So the file simulation.ini will look like that
HeadData10000.dat
outputdata.out
Then now I have two computer so I split my HeadData10000.dat y two files and I do two simulation.ini for each input data and I run it like this in each computer: ./simulation.exe<./simulation.ini.
Assuming your list of 100,000 jobs is called "jobs.txt" and looks like this:
JobA
JobB
JobC
JobD
You could run this:
parallel 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
If you want to do a dry run to see what that would do without doing anything:
parallel --dry-run 'printf "{}\n{.}.out" | ./simulation.exe' < jobs.txt
Sample Output
printf "JobA\nJobA.out" | ./simulation.exe
printf "JobB\nJobB.out" | ./simulation.exe
printf "JobC\nJobC.out" | ./simulation.exe
printf "JobD\nJobD.out" | ./simulation.exe
If you have multiple servers available, look at using the -S parameter to GNU Parallel to spread the jobs across the machines. Also, look at the --eta and --bar parameters for getting progress reports.
I used printf "line1 \n line2" to generate two lines of input in order to avoid having to create, and later delete 100,000 files.
By default, GNU Parallel will keep 1 job per CPU core running, so there will always be 16 jobs running on your 16-core machine, but you can change that to, say, 8 if you want to with parallel -j 8. You can also specify the number of jobs to run on your second (and subsequent) machines.

optimize parallelisation in SLURM cluster: the case of genome alignemnt

I would like to understand what is the best way of using bwa in parallel in a SLURM cluster. Obviously, this will depend on the computational limits that I have as user.
bwa software has an argument "-t" specifying the number of threads. Let's imagine that I use bwa mem -t 3 ref.fa sampleA.fq.gz, this will mean that bwa split the job on three tasks/threads. In other words, it will align three reads at a time in parallel (I guess).
Now, if I want to run this command on several samples and in a SLURM cluster, Shall I specify the number of tasks as for bwa mem, and specify the number of CPUs per task(for instance 2)? Which would be:
sbatch -c 2 -n 3 bwa.sh
where bwa.sh containes:
cat data.info | while read indv; do
bwa mem -t 3 ref.fa sample${indv}.fq.gz
done
Do you have any suggestion? Or can you improve/correct my reasoning?
With -c 2 you are asking to have 2 CPUs per task.
With -n 3 you are asking to have 3 tasks.
That configuration prepares a set of resources that comprises 6 CPUs in up to 3 different nodes. But your script only used 3 CPUs (-t 3), so you are wasting resources and probably using resources that does not belong to you (because the task will use 3 CPUs and you only asked for 2 CPUs per task).
For that specific script, -c 3 is the proper parameter (the other defaults to one task):
sbatch -c 3 bwa.sh

Is there a way to flush stdout on process termination for parallel processes

I'm running several independent programs on a single machine in parallel.
The processes (say 100) are all relatively short (<5 minutes) and their output is limited to a few hundred lines (~kilobytes).
Usually the output in a terminal then becomes mangled because the processes write directly to the same buffer. I would like these outputs to be un-mangled so that it's easier to debug certain processes. I could write these outputs to temporary files but I would like to limit disk IO and would prefer another method if possible. It would require cleaning up and probably won't really improve code readability.
Is there any shell native method that allows buffers to be PID separated which then flushes to stdout/stderr when the process terminates ? Do you see any other way to do this ?
Update
I ended up using the tail -n 1000000 trick from the comment of #Gem. Since the commands I'm using are long and (covering multiple lines) and I was already using subshells ( ... ) & that was a quite minimal change from ( ... ) & to ( ... ) 2>&1 | tail -n 1000000 &.
You can do that with GNU Parallel. Use -k to keep the output in order and ::: to separate the arguments you want passed to your program.
Here we run 4 instances of echo in parallel:
parallel -k echo {} ::: {0..4}
0
1
2
3
4
Now add in --tag to tag your output lines with the filenames or parameters you are using:
parallel --tag -k 'echo "Line 1, param {}"; echo "Line 2, param {}"' ::: {1..4}
1 Line 1, param 1
1 Line 2, param 1
2 Line 1, param 2
2 Line 2, param 2
3 Line 1, param 3
3 Line 2, param 3
4 Line 1, param 4
4 Line 2, param 4
You should notice that each line is tagged on the left side with the parameters and that the two lines from each job are kept together.
You can now specify how your output is organised.
Use --group to group output by job
Use --line-buffer to buffer a line at a time
Use --ungroup if you want output all mixed up, but as soon as available
Sounds like you just want syslog, or rather logger its Bash interface. Example:
echo "Something happened!" | logger -i -p local0.notice
If you insist on getting output to stderr too use --stderr. rsyslog will handle buffering, atomic writes, etc, and is presumably pretty good at optimizing disk I/O. However you could also easily configure rsyslog to route the log facility (i.e. local0 or what ever you choose to use) where ever you want, such as on a tmpfs or dedicated disk, or even over TCP. See /etc/rsyslog.conf.

Why does GNU parallel become less and less effective?

I have a file containing 1 000 000 domain names and I'm currently launching the script testssl.sh (http://testssl.sh) on each domain of the list (i.e each line of the file). I'm using GNU parallel to improve performance. Here is how I launch testssl.sh with GNU parallel :
cat listDomainNames.txt | parallel --no-notice -j0 --workdir $PWD ./testMX.sh
Where testMX.sh launchs testssl.sh :
./testssl.sh --starttls smtp --vulnerable --server-preference -mx --append --csvfile result.csv $1
At the begin, my script is testing domain names very quickly (5 000 in 1 single hour) and after several hours, it becomes really slow (like 1 domain per min). Any idea what is happening ? Thanks in advance !
More and more processes will be hanging waiting for timeout.

GNU Parallel with processes that further fork

Consider the file Processes.txt
./MyProcess 1 -nbThreads 2
./MyProcess 2 -nbThreads 2
./MyProcess 3 -nbThreads 2
, where each MyProcess will attempt to use two cores. Now consider running
parallel -j 3 :::: Processes.txt
The call to parallel specifically indicate to use no more than 3 cores. Will parallel allow MyProcess to further fork and the whole thing will use 6 cores or will it somehow enforce the three processes MyProcess to using one core each only?
It will run three processes at once and if they choose to create further processes it will neither know nor care.
(Hattip to: Mark Setchell)

Resources