bash script only starts new jobs when old batch is finished - bash

I am trying to run 24 versions of the same code on an 8 core machine. The code takes many many hours to run and I only want to run 8 at a time so I was wondering if it was possible to write a bash script which would run 8 and then when those were complete immediately start the next 8 and so on?
I basically dont want all 24 to start and then run incredibly slowly!
Thanks,
Jack
EDIT 1: (More details on the run)
The code runs with the following command:
nohup ./MyCode MyInputFile 2> Myoutput

You could use gnu parallel
seq 1 24 | parallel -P 8 ./myscript
Or with xargs:
seq 1 24 | xargs -l -P 8 ./myscript
Update:
If you want to run the script with Myinput1 Myinput2 Myinput3 .. as parameters you can do
find . -name 'Myinput*' -print0 | parallel -0 -P 8 ./myscript {1}
or with your command:
find . -name 'Myinput*' -print0 | parallel -0 -P 8 nohup ./myscript {1} 2> Myoutput

yet another way :
for f1 in {1..3};do
for f2 in {1..8};do
echo "$f1,$f2;"
nohup ./MyCode MyInputFile 2> Myoutput &
done
wait
done

Related

GNU Parallel: Run bash code that reads (seq number) from pipe?

I would like parallel to read the (seq numbers) pipe, so I would like running something like that:
seq 2000 | parallel --max-args 0 --jobs 10 "{ read test; echo $test; }"
Would be equivalent to running:
echo 1
echo 2
echo 3
echo 4
...
echo 2000
But unfortunately, the pipe was not read by parallel, meaning that it was instead ran like:
echo
echo
echo
...
echo
And the output is empty.
Does anyone know how to make parallel read (seq numbers) pipe? Thanks.
An alternative with GNU xargs that does not require GNU parallel:
seq 2000 | xargs -P 10 -I {} "echo" "hello world {}"
Output:
hello world 1
hello world 2
hello world 3
hello world 4
hello world 5
.
.
.
From man xargs:
-P max-procs: Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input.
You want the input to be piped into the command you run, so use --pipe:
seq 2000 |
parallel --pipe -N1 --jobs 10 'read test; echo $test;'
But if you really just need it for a variable, I would do one of these:
seq 2000 | parallel --jobs 10 echo
seq 2000 | parallel --jobs 10 echo {}
seq 2000 | parallel --jobs 10 'test={}; echo $test'
I will encourage you to spend 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.
Using xargs instead of parallel while still using a shell (instead of starting up a new copy of the /bin/echo executable per line to run) would look like:
seq 2000 | xargs -P 10 \
sh -c 'for arg in "$#"; do echo "hello world $arg"; done' _
This is likely to be faster than the existing answer by Cyrus, because starting executables takes time; even though starting a new copy of /bin/sh takes longer than starting a copy of /bin/echo, because this isn't using -I {}, it's able to pass many arguments to each copy of /bin/sh, thus amortizing that time cost over more numbers; and that way we're able to used the copy of echo built into sh, instead of the separate echo executable.

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

Running programs in parallel using xargs

I currently have the current script.
#!/bin/bash
# script.sh
for i in {0..99}; do
script-to-run.sh input/ output/ $i
done
I wish to run it in parallel using xargs. I have tried
script.sh | xargs -P8
But doing the above only executed once at the time. No luck with -n8 as well.
Adding & at the end of the line to be executed in the script for loop would try to run the script 99 times at once. How do I execute the loop only 8 at the time, up to 100 total.
From the xargs man page:
This manual page documents the GNU version of xargs. xargs reads items
from the standard input, delimited by blanks (which can be protected
with double or single quotes or a backslash) or newlines, and executes
the command (default is /bin/echo) one or more times with any initial-
arguments followed by items read from standard input. Blank lines on
the standard input are ignored.
Which means that for your example xargs is waiting and collecting all of the output from your script and then running echo <that output>. Not exactly all that useful nor what you wanted.
The -n argument is how many items from the input to use with each command that gets run (nothing, by itself, about parallelism here).
To do what you want with xargs you would need to do something more like this (untested):
printf %s\\n {0..99} | xargs -n 1 -P 8 script-to-run.sh input/ output/
Which breaks down like this.
printf %s\\n {0..99} - Print one number per-line from 0 to 99.
Run xargs
taking at most one argument per run command line
and run up to eight processes at a time
With GNU Parallel you would do:
parallel script-to-run.sh input/ output/ {} ::: {0..99}
Add in -P8 if you do not want to run one job per CPU core.
Opposite xargs it will do The Right Thing, even if the input contain space, ', or " (not the case here, though). It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
You can use this simple 1 line command
seq 1 500 | xargs -n 1 -P 8 script-to-run.sh input/ output/
Here's an example running commands in parallel in conjuction with find:
find -name "*.wav" -print0 | xargs -0 -t -I % -P $(nproc) flac %
-print0 terminates filenames with a null byte rather than a newline so we can use -0 in xargs to prevent filenames with spaces being treated as two seperate arguments.
-t means verbose, makes xargs print every command it's executing, can be useful, remove if not needed.
-I % means replace occurrences of % in the command with arguments read from standard input.
-P $(nproc) means run a maximum of nproc instances of our command in parallel (nproc prints the number of available processing units).
flac % is our command, the -I % from earlier means this will become flac foo.wav
See also: Manual for xargs(1)

How to run parallel for loops

I'm not very familiar with bash, but I would like split up this code such that I can run it on a server with 12 processors:
#!/bin/bash
#bashScript.sh
for i in {1..209}
do
Rscript Compute.R $i
done
How would I go about achieving this?
Thanks!
Use xargs with the option --max-procs (-P). If there are enough arguments, xargs will use exactly this number of concurrent processes to process the input:
#! /bin/bash
seq 209 |
xargs -P12 -r -n1 Rscript Compute.R
Try:
#!/bin/bash
#bashScript.sh
for i in {1..209}
do
Rscript Compute.R $i &
done
Use GNU Parallel:
parallel Rscript Compute.R ::: {1..209}
10 seconds installation:
wget -O - pi.dk/3 | sh
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Shell Scripting: Using xargs to execute parallel instances of a shell function

I'm trying to use xargs in a shell script to run parallel instances of a function I've defined in the same script. The function times the fetching of a page, and so it's important that the pages are actually fetched concurrently in parallel processes, and not in background processes (if my understanding of this is wrong and there's negligible difference between the two, just let me know).
The function is:
function time_a_url ()
{
oneurltime=$($time_command -p wget -p $1 -O /dev/null 2>&1 1>/dev/null | grep real | cut -d" " -f2)
echo "Fetching $1 took $oneurltime seconds."
}
How does one do this with an xargs pipe in a form that can take number of times to run time_a_url in parallel as an argument? And yes, I know about GNU parallel, I just don't have the privilege to install software where I'm writing this.
Here's a demo of how you might be able to get your function to work:
$ f() { echo "[$#]"; }
$ export -f f
$ echo -e "b 1\nc 2\nd 3 4" | xargs -P 0 -n 1 -I{} bash -c f\ \{\}
[b 1]
[d 3 4]
[c 2]
The keys to making this work are to export the function so the bash that xargs spawns will see it and to escape the space between the function name and the escaped braces. You should be able to adapt this to work in your situation. You'll need to adjust the arguments for -P and -n (or remove them) to suit your needs.
You can probably get rid of the grep and cut. If you're using the Bash builtin time, you can specify an output format using the TIMEFORMAT variable. If you're using GNU /usr/bin/time, you can use the --format argument. Either of these will allow you to drop the -p also.
You can replace this part of your wget command: 2>&1 1>/dev/null with -q. In any case, you have those reversed. The correct order would be >/dev/null 2>&1.
On Mac OS X:
xargs: max. processes must be >0 (for: xargs -P [>0])
f() { echo "[$#]"; }
export -f f
echo -e "b 1\nc 2\nd 3 4" | sed 's/ /\\ /g' | xargs -P 10 -n 1 -I{} bash -c f\ \{\}
echo -e "b 1\nc 2\nd 3 4" | xargs -P 10 -I '{}' bash -c 'f "$#"' arg0 '{}'
If you install GNU Parallel on another system, you will see the functionality is in a single file (called parallel).
You should be able to simply copy that file to your own ~/bin.

Resources