Running bash script in parallel - bash

I have a very simple command that I would like to execute in parallel rather than sequential.
>for i in ../data/*; do ./run.sh $i done
run.sh processes the input files from the ../data directory and I would like to perform this process all at the same time using a shell script rather than a Python program or something like that. Is there a way to do this using GNU Parallel?

You can try this:
shopt -s nullglob
FILES=(../data/*)
[[ ${#FILES[#]} -gt 0 ]] && printf '%s\0' "${FILES[#]}" | parallel -0 --jobs 2 ./run.sh

I have not used GNU Parallel but you can use & to run your script in the background. Add a wait (optional) later if you want to wait for all the scripts to finish.
for i in ../data/*; do ./run.sh $i & done
# Below wait command is optional
wait
echo "All scripts executed"

You can try this:
find ../data -maxdepth 1 -name '[^.]*' -print0 | parallel -0 --jobs 2 ./run.sh
The name argument of the find command is needed because you used shell globbing ../data/* in your example and so we need to ignore files starting with a dot.

Related

How to run on BASH a python script that takes multiple arguments using GNU parallel?

I have a python script which I normally execute from BASH shell like this:
pychimera $(which dockprep.py) -rec receptor1.pdb -lig ligand1.mol -cmethod gas -neut
As you see some of the arguments need an input (e.g. -rec) while others don't (e.g. -neut). I must execute this script 154 times with different inputs. How is it possible to run 8 threads in parallel using GNU parallel script?
pychimera $(which dockprep.py) -rec receptor1.pdb -lig ligand1.mol -cmethod gas -neut
pychimera $(which dockprep.py) -rec receptor2.pdb -lig ligand2.mol -cmethod gas -neut
pychimera $(which dockprep.py) -rec receptor3.pdb -lig ligand3.mol -cmethod gas -neut
...
I think you want this:
parallel 'pychimera $(which dockprep.py) -rec receptor{}.pdb -lig ligand{}.mol -cmethod gas -neut' ::: {1..154}
If you have other than 8 CPU cores, and specifically want 8 processes at a time, use:
parallel -j8 ...
If you want to see the commands that would be run without actually running anything, use:
parallel --dry-run ...
Example commands.txt generator script:
#!/usr/bin/env bash
if [ "$#" -ne 1 ]; then
echo "missing parameter: n"
exit 1
fi
rm commands.txt 2> /dev/null
dockp=$(which dockprep.py)
for((i=1;i<=$1;i++)); do
echo "pychimera $dockp -rec receptor$i.pdb -lig ligand$i.mol -cmethod gas -neut" >> commands.txt
done
If you save above bash script as cmdgen.sh you can run it as:
bash cmdgen.sh 100
if you need n to be 100.
To run commands in parallel:
$ module load parallel
$ parallel < commands.txt

running each element in array in parallel in bash script

Lets say I have a bash script that looks like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
command --arg1 $each
done
If I want to run the everything in the loop in parallel, I could just change command --arg1 $each to command --arg1 $each &.
But now lets say I want to take the results of command --arg1 $each and do something with those results like this:
array=( 1 2 3 4 5 6 )
for each in "${array[#]}"
do
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
done
If I just add a & to the end of command --arg1 $each, everything after command --arg1 $each will run without command --arg1 $each finishing first. How do I prevent that from happening? Also, how do I also limit the amount of threads the loop can occupy?
Essentially, this block should run in parallel for 1,2,3,4,5,6
echo "$each"
lags=($(command --arg1 $each)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
-----EDIT--------
Here is the original code:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
get_lags () {
echo "$1"
lags=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --group $1 --command-config /etc/kafka/client.properties --new-consumer))
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
get_lags $each &
done
------EDIT 2-----------
Trying with answer below:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/kafka/kafka.client.jaas.conf"
IFS=$'\n'
array=($(kafka-consumer-groups --bootstrap-server kafka1:9092 --list --command-config /etc/kafka/client.properties --new-consumer))
lngth=${#array[*]}
echo "array length: " $lngth
timestamp=$(($(date +%s%N)/1000000))
log_time=`date +%Y-%m-%d:%H`
echo "log time: " $log_time
log_file="/home/ec2-user/laglogs/laglog.$log_time.log"
echo "log file: " $log_file
echo "timestamp: " $timestamp
max_proc_count=8
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$'\n' read -r -d '' -a lags < <(kafka-consumer-groups --bootstrap-server kafka1:9092 --describe --command-config /etc/kafka/client.properties --new-consumer --group "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
The convenient thing to do is to push your background code into a separate script -- or an exported function. That way xargs can create a new shell, and access the function from its parent. (Be sure to export any other variables that need to be available in the child as well).
array=( 1 2 3 4 5 6 )
max_proc_count=8
log_file=out.txt
run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$' \t\n' read -r -d '' -a lags < <(yourcommand --arg1 "$each" && printf '\0')
for result in "${lags[#]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}
export -f run_for_each
export log_file # make log_file visible to subprocesses
printf '%s\0' "${array[#]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$#"'
Some notes:
Using echo -e is bad form. See the APPLICATION USAGE and RATIONALE sections in the POSIX spec for echo, explicitly advising using printf instead (and not defining an -e option, and explicitly defining than echo must not accept any options other than -n).
We're including the each value in the log file so it can be extracted from there later.
You haven't specified whether the output of yourcommand is space-delimited, tab-delimited, line-delimited, or otherwise. I'm thus accepting all these for now; modify the value of IFS passed to the read to taste.
printf '%(...)T' to get a timestamp without external tools such as date requires bash 4.2 or newer. Replace with your own code if you see fit.
read -r -a arrayname < <(...) is much more robust than arrayname=( $(...) ). In particular, it avoids treating emitted values as globs -- replacing *s with a list of files in the current directory, or Foo[Bar] with FooB should any file by that name exist (or, if the failglob or nullglob options are set, triggering a failure or emitting no value at all in that case).
Redirecting stdout to your log_file once for the entire loop is somewhat more efficient than redirecting it every time you want to run printf once. Note that having multiple processes writing to the same file at the same time is only safe if all of them opened it with O_APPEND (which >> will do), and if they're writing in chunks small enough to individually complete as single syscalls (which is probably happening unless the individual lags values are quite large).
A lot of lenghty and theoretical answers here, I'll try to keep it simple - what about using | (pipe) to connect the commands as usual ?;) (And GNU parallel, which excels for these type of tasks).
seq 6 | parallel -j4 "command --arg1 {} | command2 > results/{}"
The -j4 will limit number of threads (jobs) as requested. You DON'T want to write to a single file from multiple jobs, output one file per job and join them after the parallel processing is finished.
Using GNU Parallel it looks like this:
array=( 1 2 3 4 5 6 )
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output
GNU Parallel makes sure output from different jobs is not mixed.
If you prefer the output from jobs mixed:
parallel -0 --bar --line-buffer --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > output-linebuffer
Again GNU Parallel makes sure to only mix with full lines: You will not see half a line from one job and half a line from another job.
It also works if the array is a bit more nasty:
array=( "new
line" 'quotes" '"'" 'echo `do not execute me`')
Or if the command prints long lines half-lines:
command() {
echo Input: "$#"
echo '" '"'"
sleep 1
echo -n 'Half a line '
sleep 1
echo other half
superlong_a=$(perl -e 'print "a"x1000000')
superlong_b=$(perl -e 'print "b"x1000000')
echo -n $superlong_a
sleep 1
echo $superlong_b
}
export -f command
GNU Parallel strives to be a general solution. This is because I have designed GNU Parallel to care about correctness and try vehemently to deal correctly with corner cases, too, while staying reasonably fast.
GNU Parallel guards against race conditions and does not split words in the output on each their line.
array=( $(seq 30) )
max_proc_count=30
command() {
# If 'a', 'b' and 'c' mix: Very bad
perl -e 'print "a"x3000_000," "'
perl -e 'print "b"x3000_000," "'
perl -e 'print "c"x3000_000," "'
echo
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
# 'abc' should always stay together
# and there should only be a single line per job
cat parallel.out | tr -s abc
GNU Parallel works fine if the output has a lot of words:
array=(1)
command() {
yes "`seq 1000`" | head -c 10M
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
GNU Parallel does not eat all your memory - even if the output is bigger than your RAM:
array=(1)
outputsize=1000M
export outputsize
command() {
yes "`perl -e 'print \"c\"x30_000'`" | head -c $outputsize
}
export -f command
parallel -0 --bar --tagstring '{= $_=localtime(time)."\t".$_; =}' \
command --arg1 {} ::: "${array[#]}" > parallel.out
You know how to execute commands in separate processes. The missing part is how to allow those processes to communicate, as separate processes cannot share variables.
Basically, you must chose whether to communicate using regular files, or inter-process communication/FIFOs (which still boils down to using files).
The general approach :
Decide how you want to present tasks to be executed. You could have them as separate files on the filesystem, as a FIFO special file that can be read from, etc. This could be a simple as writing to a separate file each command to be executed, or writing each command to a FIFO (one command per line).
In the main process, prepare the files describing tasks to perform or launch a separate process in the background that will feed the FIFO.
Then, still in the main process, launch worker processes in the background (with &), as many of them as you want parallel tasks being executed (not one per task to perform). Once they have been launched, use wait to, well, wait until all processes are finished. Separate processes cannot share variables, you will have to write any output that needs to be used later to separate files, or a FIFO, etc. If using a FIFO, remember more than one process can write to a FIFO at the same time, so use some kind of mutex mechanism (I suggest looking into the use of mkdir/rmdir for that purpose).
Each worker process must fetch the next task (from a file/FIFO), execute it, generate the output (to a file/FIFO), loop until there are no new tasks, then exit. If using files, you will need to use a mutex to "reserve" a file, read it, and then delete it to mark it as taken care of. This would not be needed for a FIFO.
Depending on the case, your main process may have to wait until all tasks are finished before handling the output, or in some cases may launch a worker process that will detect and handle output as it appears. This worker process would have to either be stopped by the main process once all tasks have been executed, or figure out for itself when all tasks have been executed and exit (while being waited on by the main process).
This is not detailed code, but I hope it gives you an idea of how to approach problems like this.
(Community Wiki answer with the OP's proposed self-answer from the question -- now edited out):
So here is one way I can think of doing this, not sure if this is the most efficient way and also, I can't control the amount of threads (I think, or processes?) this would use:
array=( 1 2 3 4 5 6 )
lag_func () {
echo "$1"
lags=($(command --arg1 $1)
lngth_lags=${#lags[*]}
for (( i=1; i<=$(( $lngth_lags -1 )); i++))
do
result=${lags[$i]}
echo -e "$timestamp\t$result" >> $log_file
echo "result piped"
done
}
for each in "${array[#]}"
do
lag_func $each &
done

How to process files concurrently with bash?

Suppose I have 10K filesa and a bash script which processes a single file. Now I would like to process all these files concurrently with only K script running in parallel. I do not want (obviously) to process any file more than once.
How would you suggest implement it in bash ?
One way of executing a limited number of parallel jobs is with GNU parallel. For example, with this command:
find . -type f -print0 | parallel -0 -P 3 ./myscript {1}
You will pass all files in the current directory (and its subdirectories) as parameters to myscript, one at a time. The -0 option sets the delimiter to be the null character, and the -P option sets the number of jobs that are executed in parallel. The default number of parallel processes is equal to the number of cores in the system. There are other options for parallel processing in clusters etc, which are documented here.
I bash you can easily run part of the script in a different process just by using '(' and ')'. If you add &, then the parent process will not wait for the child. So you in fact use ( command1; command2; command3; ... ) &:
while ... do
(
your script goes here, executed in a separate process
) &
CHILD_PID = $!
done
And also the $! gives you the PID of the child process. What else you need to know? When you reach the k processes launched, you need to wait for the others. This is done using wait <PID>:
wait $CHILD_PID
If you want to wait for all of them, just use wait.
This should be sufficient for you to implement the system.
for f1 in *;do
(( cnt = cnt +1 ))
if [ cnt -le $k ];then
nohup ./script1 $f1 &
continue
fi
wait
cnt=0
done
please test it . dont' have time to

GNU parallel processing

I have the following script that I want to run using GNU parallel, it is a for loop that needs to be run n times. How can I do this using GNU parallel?
SHARK=tshark
# Create file list
FILELIST=`ls $1`
TEMPDIR=/tmp/foobar
mkdir $TEMPDIR
i=1
for I in $FILELIST; do
echo "$i $I $2"
$SHARK -r $I -w $TEMPDIR/~$I-$i -R "$2" &>/dev/null
i=`echo $i+1|bc`
done
There are a number of ways of doing this, either with sub-shells and sub-processes, see e.g.
Running shell script in parallel
or by installing neat utilities designed to do this, e.g:
|P|P|S|S| - (Distributed) Parallel Processing Shell Script
GNU Parallel
I would try to get it done first with sub-shells, and then try the others if you still need better power.

How to correctly wrap multiple command calls in bash?

My problem can be summed up by making this simple command works :
nice -n 10 "ls|xargs -I% echo \"%\""
Which fails :
nice: ls|xargs -I% echo "%": No such file or directory
Removing the quotes makes it works, but my point is to wrap multiple quoted commands into one to do something more complex like :
ftphost="192.168.1.1"
dirinputtopush="/tmp/archivedir/"
ftpoutputdir="mydir/"
nice -n 19 ls $dirinputtopush | xargs -I% "lftp $ftphost -e \"mirror -R $dirinputtopush% $ftpoutputdirrecent ;quit\"; sleep 10"
Try using nice -n 10 bash -c 'your; commands | or_complex pipelines' as command. This way bash is the binary and the string after -c contains a sequence interpreted by bash so it can contain pipelines, loops etc. Watch out for proper quoting. You need to do it this way because nice expects a binary, not expressions interpreted by the shell. In contrast, shell builtins such as time (but not /usr/bin/time which is a separate binary) will accept shell expressions as the command to execute. They can because they're built into the shell. nice is not, so it requires a binary to execute.
Children inherit nice value:
nice -n 10 bash -c 'ls | xargs -I% echo %'
Nice each command separately:
nice -n 10 ls | nice -n 10 xargs -I% echo %

Resources