GNU Parallel: Run bash code that reads (seq number) from pipe? - bash

I would like parallel to read the (seq numbers) pipe, so I would like running something like that:
seq 2000 | parallel --max-args 0 --jobs 10 "{ read test; echo $test; }"
Would be equivalent to running:
echo 1
echo 2
echo 3
echo 4
...
echo 2000
But unfortunately, the pipe was not read by parallel, meaning that it was instead ran like:
echo
echo
echo
...
echo
And the output is empty.
Does anyone know how to make parallel read (seq numbers) pipe? Thanks.

An alternative with GNU xargs that does not require GNU parallel:
seq 2000 | xargs -P 10 -I {} "echo" "hello world {}"
Output:
hello world 1
hello world 2
hello world 3
hello world 4
hello world 5
.
.
.
From man xargs:
-P max-procs: Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
-I replace-str: Replace occurrences of replace-str in the initial-arguments with names read from standard input.

You want the input to be piped into the command you run, so use --pipe:
seq 2000 |
parallel --pipe -N1 --jobs 10 'read test; echo $test;'
But if you really just need it for a variable, I would do one of these:
seq 2000 | parallel --jobs 10 echo
seq 2000 | parallel --jobs 10 echo {}
seq 2000 | parallel --jobs 10 'test={}; echo $test'
I will encourage you to spend 20 minutes on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 Your command line will love you for it.

Using xargs instead of parallel while still using a shell (instead of starting up a new copy of the /bin/echo executable per line to run) would look like:
seq 2000 | xargs -P 10 \
sh -c 'for arg in "$#"; do echo "hello world $arg"; done' _
This is likely to be faster than the existing answer by Cyrus, because starting executables takes time; even though starting a new copy of /bin/sh takes longer than starting a copy of /bin/echo, because this isn't using -I {}, it's able to pass many arguments to each copy of /bin/sh, thus amortizing that time cost over more numbers; and that way we're able to used the copy of echo built into sh, instead of the separate echo executable.

Related

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

xargs output buffering -P parallel

I have a bash function that i call in parallel using xargs -P like so
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction #'
Everything works fine but output is messed up for obvious reasons (no buffering)
Trying to figure out a way to buffer output effectively. I was thinking I could use awk, but I'm not good enough to write such a script and I can't find anything worthwhile on google? Can someone help me write this "output buffer" in sed or awk? Nothing fancy, just accumulate output and spit it out after process terminates. I don't care the order that shell functions execute, just need their output buffered... Something like:
echo ${list} | xargs -n 1 -P 24 -I# bash -l -c 'myAwesomeShellFunction # | sed -u ""'
P.s. I tried to use stdbuf as per
https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe but did not work, i specified buffering on o and e but output still unbuffered:
echo ${list} | xargs -n 1 -P 24 -I# stdbuf -i0 -oL -eL bash -l -c 'myAwesomeShellFunction #'
Here's my first attempt, this only captures first line of output:
$ bash -c "echo stuff;sleep 3; echo more stuff" | awk '{while (( getline line) > 0 )print "got ",$line;}'
$ got stuff
This isn't quite atomic if your output is longer than a page (4kb typically), but for most cases it'll do:
xargs -P 24 bash -c 'for arg; do printf "%s\n" "$(myAwesomeShellFunction "$arg")"; done' _
The magic here is the command substitution: $(...) creates a subshell (a fork()ed-off copy of your shell), runs the code ... in it, and then reads that in to be substituted into the relevant position in the outer script.
Note that we don't need -n 1 (if you're dealing with a large number of arguments -- for a small number it may improve parallelization), since we're iterating over as many arguments as each of your 24 parallel bash instances is passed.
If you want to make it truly atomic, you can do that with a lockfile:
# generate a lockfile, arrange for it to be deleted when this shell exits
lockfile=$(mktemp -t lock.XXXXXX); export lockfile
trap 'rm -f "$lockfile"' 0
xargs -P 24 bash -c '
for arg; do
{
output=$(myAwesomeShellFunction "$arg")
flock -x 99
printf "%s\n" "$output"
} 99>"$lockfile"
done
' _

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

How to correctly wrap multiple command calls in bash?

My problem can be summed up by making this simple command works :
nice -n 10 "ls|xargs -I% echo \"%\""
Which fails :
nice: ls|xargs -I% echo "%": No such file or directory
Removing the quotes makes it works, but my point is to wrap multiple quoted commands into one to do something more complex like :
ftphost="192.168.1.1"
dirinputtopush="/tmp/archivedir/"
ftpoutputdir="mydir/"
nice -n 19 ls $dirinputtopush | xargs -I% "lftp $ftphost -e \"mirror -R $dirinputtopush% $ftpoutputdirrecent ;quit\"; sleep 10"
Try using nice -n 10 bash -c 'your; commands | or_complex pipelines' as command. This way bash is the binary and the string after -c contains a sequence interpreted by bash so it can contain pipelines, loops etc. Watch out for proper quoting. You need to do it this way because nice expects a binary, not expressions interpreted by the shell. In contrast, shell builtins such as time (but not /usr/bin/time which is a separate binary) will accept shell expressions as the command to execute. They can because they're built into the shell. nice is not, so it requires a binary to execute.
Children inherit nice value:
nice -n 10 bash -c 'ls | xargs -I% echo %'
Nice each command separately:
nice -n 10 ls | nice -n 10 xargs -I% echo %

Shell Scripting: Using xargs to execute parallel instances of a shell function

I'm trying to use xargs in a shell script to run parallel instances of a function I've defined in the same script. The function times the fetching of a page, and so it's important that the pages are actually fetched concurrently in parallel processes, and not in background processes (if my understanding of this is wrong and there's negligible difference between the two, just let me know).
The function is:
function time_a_url ()
{
oneurltime=$($time_command -p wget -p $1 -O /dev/null 2>&1 1>/dev/null | grep real | cut -d" " -f2)
echo "Fetching $1 took $oneurltime seconds."
}
How does one do this with an xargs pipe in a form that can take number of times to run time_a_url in parallel as an argument? And yes, I know about GNU parallel, I just don't have the privilege to install software where I'm writing this.
Here's a demo of how you might be able to get your function to work:
$ f() { echo "[$#]"; }
$ export -f f
$ echo -e "b 1\nc 2\nd 3 4" | xargs -P 0 -n 1 -I{} bash -c f\ \{\}
[b 1]
[d 3 4]
[c 2]
The keys to making this work are to export the function so the bash that xargs spawns will see it and to escape the space between the function name and the escaped braces. You should be able to adapt this to work in your situation. You'll need to adjust the arguments for -P and -n (or remove them) to suit your needs.
You can probably get rid of the grep and cut. If you're using the Bash builtin time, you can specify an output format using the TIMEFORMAT variable. If you're using GNU /usr/bin/time, you can use the --format argument. Either of these will allow you to drop the -p also.
You can replace this part of your wget command: 2>&1 1>/dev/null with -q. In any case, you have those reversed. The correct order would be >/dev/null 2>&1.
On Mac OS X:
xargs: max. processes must be >0 (for: xargs -P [>0])
f() { echo "[$#]"; }
export -f f
echo -e "b 1\nc 2\nd 3 4" | sed 's/ /\\ /g' | xargs -P 10 -n 1 -I{} bash -c f\ \{\}
echo -e "b 1\nc 2\nd 3 4" | xargs -P 10 -I '{}' bash -c 'f "$#"' arg0 '{}'
If you install GNU Parallel on another system, you will see the functionality is in a single file (called parallel).
You should be able to simply copy that file to your own ~/bin.

Resources