mpi resubmit script in bash shell error [closed] - bash

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This is my job script i use for a resubmit mpi job. I have this script originally written for tcsh shell. I tried to rewrite it for bash shell, i get errors. Kindly help me with correcting the script.
##============================================================================
#!/bin/bash
#PBS -l mem=10GB
#PBS -l walltime=12:00:00
#PBS -l nodes=2:ppn=6
#PBS -v NJOBS,NJOB
if [ X$NJOBS == X ]; then
$ECHO "NJOBS (total number of jobs in sequence) is not set - defaulting to 1"
export NJOBS=1
fi
if [ X$NJOB == X ]; then
$ECHO "NJOB (current job number in sequence) is not set - defaulting to 1"
export NJOB=1
fi
#
# Quick termination of job sequence - look for a specific file
#
if [ -f STOP_SEQUENCE ] ; then
$ECHO "Terminating sequence at job number $NJOB"
exit 0
fi
#
# Pre-job file manipulation goes here ...
# =============================================================================
# INSERT CODE
# =============================================================================
module load openmpi/1.4.3
startnum= 0
x=1
i= $(($NJOB + $startnum - $x))
j= $(($i + $x))
$ECHO "This is job $i"
#$ECHO floobuks.$i.blah
#$ECHO flogwhilp.$j.txt
#===========================================================================
# actual execution code
#===========================================================================
# this is just a sample
echo "job $i is followed by $j"
#===========================================================================
RUN COMPLETE
#===========================================================================
#
# Check the exit status
#
errstat=$?
if [ $errstat -ne 0 ]; then
# A brief nap so PBS kills us in normal termination
# If execution line above exceeded some limit we want PBS
# to kill us hard
sleep 5
$ECHO "Job number $NJOB returned an error status $errstat - stopping job sequence."
exit $errstat
fi
#
# Are we in an incomplete job sequence - more jobs to run ?
#
if [ $NJOB -lt $NJOBS ]; then
#
# Now increment counter and submit the next job
#
NJOB=$(($NJOB+1))
$ECHO "Submitting job number $NJOB in sequence of $NJOBS jobs"
qsub recur2.bash
else
$ECHO "Finished last job in sequence of $NJOBS jobs"
fi
#==============================================================================
I get the following errors when i run
qsub -v NJOBS=4 recur2.bash
ModuleCmd_Load.c(200):ERROR:105: Unable to locate a modulefile for 'openmpi/1.4.3'
/var/spool/PBS/mom_priv/jobs/1833549.epic.SC: line 115: 0: command not found
/var/spool/PBS/mom_priv/jobs/1833549.epic.SC: line 117: 0: command not found
/var/spool/PBS/mom_priv/jobs/1833549.epic.SC: line 118: 1: command not found
/home/nsubramanian/bin/gromacs_3.3.3/bin/grompp_mpi: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such\
file or directory
/var/spool/PBS/mom_priv/jobs/1833549.epic.SC: line 128: mpirun: command not found
i was able to figure out the error for the openmpi but rest i could not. i do not know how to make it working.
Note: pls ignore the line numbers, its different from the original files.

There's no such module as openmpi/1.4.3 on your system; and in these lines
startnum= 0
i= $(($NJOB + $startnum - $x))
j= $(($i + $x))
there shouldn't be a space after the equals sign.
All you would have had to do to find this out is to try to run the script line by line in a bash shell.

Related

Exit script only AFTER running all commands [duplicate]

This question already has answers here:
How to wait in bash for several subprocesses to finish, and return exit code !=0 when any subprocess ends with code !=0?
(35 answers)
Closed 2 months ago.
I would like my shell script to fail, if a specific command fails. BUT in any case, run the entire script. So i thought about using return 1 at the end of the command i want to "catch" and maybe add a condition at the end like: if return 1; then exit 1. I'm a bit lost how this should look like.
#!/bin/bash
command 1
# I want THIS command to make the script fail.
# It runs test and in parallel regex
command 2 &
# bash regex that HAS to run in parallel with command 2
regex='^ID *\| *([0-9]+)'
while ! [[ $(command 3) =~ $regex ]] && jobs -rp | awk 'END{exit(NR==0)}'
do
sleep 1
done
...
# Final command for creation of a report
# Script has to run this command also if command 2 fails
command 4
trap is your friend, but you have to be careful of those background tasks.
$: cat tst
#!/bin/bash
trap 'let err++' ERR
{ trap 'let err++' ERR; sleep 1; false; } & pid=$!
for str in foo bar baz ;do echo $str; done
wait $pid
echo happily done
exit $err
$: ./tst && echo ok || echo no
foo
bar
baz
happily done
no
Just make sure you test all your logic.

Get exit code from multiple bash scripts running in parallel

I am running a 4 bash scripts in parallel, all 4 scripts are running at the same time:
./script1.sh & ./script2.sh & ./script3.sh & ./script4.sh
I would like to exit once either of them fail. I was trying to use something like an exit code , but it doesn't seem to run for parallel scripts. Is there a workaround?
Any bash/python solution would be welcome.
TL;DR
parallel --line-buffer --halt now,fail=1 ::: ./script?.sh
echo $?
42
Actual answer
When running jobs in parallel, I find it useful to consider GNU Parallel because it makes so many aspects easy for you:
resource allocation
load spreading across multiple CPUs and across networks
logging and output tagging
error-handling - this aspect is of particular interest here
scheduling, restarting
input & output file name derivation and renaming
progress reporting
So, I have made 4 dummy jobs script1.sh through script4.sh like this:
#!/bin/bash
echo "script1.sh starting..."
sleep 5
echo "script1.sh complete"
Except script3.sh which fails before the others:
#!/bin/bash
echo "script3.sh starting..."
sleep 2
echo "script3.sh dying"
exit 42
So, here's the default way to run 4 jobs in parallel, with the outputs of each all gathered and presented one after the other:
parallel ::: ./script*.sh
script3.sh starting...
script3.sh dying
script1.sh starting...
script1.sh complete
script4.sh starting...
script4.sh complete
script2.sh starting...
script2.sh complete
You can see script3.sh dies first and all its output is gathered and shown first, followed by the grouped output of the others. In simple terms, output is grouped by job and presented as each job finishes.
Now let's do it again, but only buffer the output by line rather than waiting for the jobs to finish and gather it on a per-job basis:
parallel --line-buffer ::: ./script*.sh
script1.sh starting...
script2.sh starting...
script3.sh starting...
script4.sh starting...
script3.sh dying
script1.sh complete
script2.sh complete
script4.sh complete
We can clearly see that script3.sh dies and exits before the others, but they still run to completion. In simple terms, output is presented line-by-line in the order it occurs.
Now we want GNU Parallel to kill any running jobs the moment any single one dies:
parallel --line-buffer --halt now,fail=1 ::: ./script?.sh
script2.sh starting...
script1.sh starting...
script3.sh starting...
script4.sh starting...
script3.sh dying
parallel: This job failed:
./script3.sh
You can see that script3.sh died and none of the other jobs completed because GNU Parallel killed them.
You can also get the failing exit status:
echo $?
42
It is far more flexible than I have shown. You can change now to soon and instead of killing other jobs, it will just not start any new ones. You can change fail=1 to success=50% so it will stop when half the jobs exit successfully, and so on.
You can also add --eta or --bar for progress reports and distribute jobs across your network and so on. Well worth reading up, in these days where CPUs are getting fatter (more cores) rather than taller (more GHz) - there is an excellent PDF available here.
Note: By default, GNU Parallel will keep as many jobs running in parallel as you have CPU cores. So, if you have fewer than 4 cores, you should probably add -j 4 to my suggested answer to tell it to run up to 4 jobs in parallel even if only 1 or 2 cores are present.
Here is a script that will do it for you.
I borrowed (and modified) non_blocking_wait function from here.
#!/bin/bash
# Run your scripts here... Following sleep commands as an example
sleep 5 &
sleep 3 &
sleep 3 &
# Here, we get the pid of each running process an put in the array "pids"
pids=( $(jobs -p | tr '\n' ' ') )
echo "pids = ${pids[#]}"
non_blocking_wait()
{
PID=$1
if [ ! -d "/proc/$PID" ]; then
wait $PID
CODE=$?
else
CODE=127
fi
echo $CODE
}
while true; do
# Check if all processes are still running
n_running=$(jobs -l | grep -c "Running")
if [ "${n_running}" -ne "3" ]; then
# At least one processes finished/returned here,
# check if exited in error
for pid in ${pids[#]}; do
ret=$(non_blocking_wait ${pid})
echo "non_blocking_wait ${pid} ret = ${ret}"
if [ "${ret}" -ne "0" ] && [ "${ret}" -ne "127" ]; then
echo "Process ${pid} exited with error ${ret}"
# Here we can take any desirable action such as
# killing all children and exiting the program:
kill $(jobs -p) > /dev/null 2>&1
exit 1
fi
done
if [ "${n_running}" -eq "0" ]; then
echo "All processes finished successfully"
exit 0
fi
fi
sleep 1
done
If you simply run it, it will exit 0 when all processes ends:
$ ./script.sh
pids = 17913 17914 17915
non_blocking_wait 17913 ret = 127
non_blocking_wait 17914 ret = 0
non_blocking_wait 17915 ret = 0
non_blocking_wait 17913 ret = 127
non_blocking_wait 17914 ret = 0
non_blocking_wait 17915 ret = 0
non_blocking_wait 17913 ret = 0
All processes finished successfully
You can remove the parameter from one of the sleep commands to make it fail and see the program returning immediately:
$ ./script.sh
sleep: missing operand
Try 'sleep --help' for more information.
pids = 18005 18006 18007
non_blocking_wait 18005 ret = 127
non_blocking_wait 18006 ret = 1
Process 18006 exited with error 1
One solution is to use subprocess:
import subprocess
import time
def do_that(scripts):
ps = [subprocess.Popen('./'+s, shell=True) for s in scripts]
while True:
done = True
for p in ps:
rc = p.poll()
if rc is None: # Script is still running
done = False
elif rc:
# if rc==0, script success to finish
# otherwise it failed
print('This script run failed:', p.args)
running = set(ps) - {p}
for i in running:
i.terminate()
print('Force terminate', i.args)
return 1
if done:
print('All done.')
return 0
def timeit(func):
def runner(*args, **kwargs):
start = time.time()
res = func(*args, **kwargs)
end = time.time()
print(func.__name__, 'cost:', round(end-start,1))
return res
return runner
#timeit
def main():
scripts = ('script1.sh', 'script2.sh')
do_that(scripts)
if __name__ == '__main__':
main()
wait -n waits for the next program to exit and returns its exit status.
pids=( )
./script1.sh & pids+=( $! )
./script2.sh & pids+=( $! )
./script3.sh & pids+=( $! )
./script4.sh & pids+=( $! )
for _ in "${pids[#]}"; do
wait -n || { rc=$?; kill "${pids[#]}"; exit "$rc"; }
done

Does pushing a block of code to background in Bash result in parallelization? [duplicate]

Lets say I have a loop in Bash:
for foo in `some-command`
do
do-something $foo
done
do-something is cpu bound and I have a nice shiny 4 core processor. I'd like to be able to run up to 4 do-something's at once.
The naive approach seems to be:
for foo in `some-command`
do
do-something $foo &
done
This will run all do-somethings at once, but there are a couple downsides, mainly that do-something may also have some significant I/O which performing all at once might slow down a bit. The other problem is that this code block returns immediately, so no way to do other work when all the do-somethings are finished.
How would you write this loop so there are always X do-somethings running at once?
Depending on what you want to do xargs also can help (here: converting documents with pdf2ps):
cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w )
find . -name \*.pdf | xargs --max-args=1 --max-procs=$cpus pdf2ps
From the docs:
--max-procs=max-procs
-P max-procs
Run up to max-procs processes at a time; the default is 1.
If max-procs is 0, xargs will run as many processes as possible at a
time. Use the -n option with -P; otherwise chances are that only one
exec will be done.
With GNU Parallel http://www.gnu.org/software/parallel/ you can write:
some-command | parallel do-something
GNU Parallel also supports running jobs on remote computers. This will run one per CPU core on the remote computers - even if they have different number of cores:
some-command | parallel -S server1,server2 do-something
A more advanced example: Here we list of files that we want my_script to run on. Files have extension (maybe .jpeg). We want the output of my_script to be put next to the files in basename.out (e.g. foo.jpeg -> foo.out). We want to run my_script once for each core the computer has and we want to run it on the local computer, too. For the remote computers we want the file to be processed transferred to the given computer. When my_script finishes, we want foo.out transferred back and we then want foo.jpeg and foo.out removed from the remote computer:
cat list_of_files | \
parallel --trc {.}.out -S server1,server2,: \
"my_script {} > {.}.out"
GNU Parallel makes sure the output from each job does not mix, so you can use the output as input for another program:
some-command | parallel do-something | postprocess
See the videos for more examples: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
maxjobs=4
parallelize () {
while [ $# -gt 0 ] ; do
jobcnt=(`jobs -p`)
if [ ${#jobcnt[#]} -lt $maxjobs ] ; then
do-something $1 &
shift
else
sleep 1
fi
done
wait
}
parallelize arg1 arg2 "5 args to third job" arg4 ...
Here an alternative solution that can be inserted into .bashrc and used for everyday one liner:
function pwait() {
while [ $(jobs -p | wc -l) -ge $1 ]; do
sleep 1
done
}
To use it, all one has to do is put & after the jobs and a pwait call, the parameter gives the number of parallel processes:
for i in *; do
do_something $i &
pwait 10
done
It would be nicer to use wait instead of busy waiting on the output of jobs -p, but there doesn't seem to be an obvious solution to wait till any of the given jobs is finished instead of a all of them.
Instead of a plain bash, use a Makefile, then specify number of simultaneous jobs with make -jX where X is the number of jobs to run at once.
Or you can use wait ("man wait"): launch several child processes, call wait - it will exit when the child processes finish.
maxjobs = 10
foreach line in `cat file.txt` {
jobsrunning = 0
while jobsrunning < maxjobs {
do job &
jobsrunning += 1
}
wait
}
job ( ){
...
}
If you need to store the job's result, then assign their result to a variable. After wait you just check what the variable contains.
If you're familiar with the make command, most of the time you can express the list of commands you want to run as a a makefile. For example, if you need to run $SOME_COMMAND on files *.input each of which produces *.output, you can use the makefile
INPUT = a.input b.input
OUTPUT = $(INPUT:.input=.output)
%.output : %.input
$(SOME_COMMAND) $< $#
all: $(OUTPUT)
and then just run
make -j<NUMBER>
to run at most NUMBER commands in parallel.
While doing this right in bash is probably impossible, you can do a semi-right fairly easily. bstark gave a fair approximation of right but his has the following flaws:
Word splitting: You can't pass any jobs to it that use any of the following characters in their arguments: spaces, tabs, newlines, stars, question marks. If you do, things will break, possibly unexpectedly.
It relies on the rest of your script to not background anything. If you do, or later you add something to the script that gets sent in the background because you forgot you weren't allowed to use backgrounded jobs because of his snippet, things will break.
Another approximation which doesn't have these flaws is the following:
scheduleAll() {
local job i=0 max=4 pids=()
for job; do
(( ++i % max == 0 )) && {
wait "${pids[#]}"
pids=()
}
bash -c "$job" & pids+=("$!")
done
wait "${pids[#]}"
}
Note that this one is easily adaptable to also check the exit code of each job as it ends so you can warn the user if a job fails or set an exit code for scheduleAll according to the amount of jobs that failed, or something.
The problem with this code is just that:
It schedules four (in this case) jobs at a time and then waits for all four to end. Some might be done sooner than others which will cause the next batch of four jobs to wait until the longest of the previous batch is done.
A solution that takes care of this last issue would have to use kill -0 to poll whether any of the processes have disappeared instead of the wait and schedule the next job. However, that introduces a small new problem: you have a race condition between a job ending, and the kill -0 checking whether it's ended. If the job ended and another process on your system starts up at the same time, taking a random PID which happens to be that of the job that just finished, the kill -0 won't notice your job having finished and things will break again.
A perfect solution isn't possible in bash.
Maybe try a parallelizing utility instead rewriting the loop? I'm a big fan of xjobs. I use xjobs all the time to mass copy files across our network, usually when setting up a new database server.
http://www.maier-komor.de/xjobs.html
function for bash:
parallel ()
{
awk "BEGIN{print \"all: ALL_TARGETS\\n\"}{print \"TARGET_\"NR\":\\n\\t#-\"\$0\"\\n\"}END{printf \"ALL_TARGETS:\";for(i=1;i<=NR;i++){printf \" TARGET_%d\",i};print\"\\n\"}" | make $# -f - all
}
using:
cat my_commands | parallel -j 4
Really late to the party here, but here's another solution.
A lot of solutions don't handle spaces/special characters in the commands, don't keep N jobs running at all times, eat cpu in busy loops, or rely on external dependencies (e.g. GNU parallel).
With inspiration for dead/zombie process handling, here's a pure bash solution:
function run_parallel_jobs {
local concurrent_max=$1
local callback=$2
local cmds=("${#:3}")
local jobs=( )
while [[ "${#cmds[#]}" -gt 0 ]] || [[ "${#jobs[#]}" -gt 0 ]]; do
while [[ "${#jobs[#]}" -lt $concurrent_max ]] && [[ "${#cmds[#]}" -gt 0 ]]; do
local cmd="${cmds[0]}"
cmds=("${cmds[#]:1}")
bash -c "$cmd" &
jobs+=($!)
done
local job="${jobs[0]}"
jobs=("${jobs[#]:1}")
local state="$(ps -p $job -o state= 2>/dev/null)"
if [[ "$state" == "D" ]] || [[ "$state" == "Z" ]]; then
$callback $job
else
wait $job
$callback $job $?
fi
done
}
And sample usage:
function job_done {
if [[ $# -lt 2 ]]; then
echo "PID $1 died unexpectedly"
else
echo "PID $1 exited $2"
fi
}
cmds=( \
"echo 1; sleep 1; exit 1" \
"echo 2; sleep 2; exit 2" \
"echo 3; sleep 3; exit 3" \
"echo 4; sleep 4; exit 4" \
"echo 5; sleep 5; exit 5" \
)
# cpus="$(getconf _NPROCESSORS_ONLN)"
cpus=3
run_parallel_jobs $cpus "job_done" "${cmds[#]}"
The output:
1
2
3
PID 56712 exited 1
4
PID 56713 exited 2
5
PID 56714 exited 3
PID 56720 exited 4
PID 56724 exited 5
For per-process output handling $$ could be used to log to a file, for example:
function job_done {
cat "$1.log"
}
cmds=( \
"echo 1 \$\$ >\$\$.log" \
"echo 2 \$\$ >\$\$.log" \
)
run_parallel_jobs 2 "job_done" "${cmds[#]}"
Output:
1 56871
2 56872
The project I work on uses the wait command to control parallel shell (ksh actually) processes. To address your concerns about IO, on a modern OS, it's possible parallel execution will actually increase efficiency. If all processes are reading the same blocks on disk, only the first process will have to hit the physical hardware. The other processes will often be able to retrieve the block from OS's disk cache in memory. Obviously, reading from memory is several orders of magnitude quicker than reading from disk. Also, the benefit requires no coding changes.
This might be good enough for most purposes, but is not optimal.
#!/bin/bash
n=0
maxjobs=10
for i in *.m4a ; do
# ( DO SOMETHING ) &
# limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait # wait until all have finished (not optimal, but most times good enough)
echo $n wait
fi
done
Here is how I managed to solve this issue in a bash script:
#! /bin/bash
MAX_JOBS=32
FILE_LIST=($(cat ${1}))
echo Length ${#FILE_LIST[#]}
for ((INDEX=0; INDEX < ${#FILE_LIST[#]}; INDEX=$((${INDEX}+${MAX_JOBS})) ));
do
JOBS_RUNNING=0
while ((JOBS_RUNNING < MAX_JOBS))
do
I=$((${INDEX}+${JOBS_RUNNING}))
FILE=${FILE_LIST[${I}]}
if [ "$FILE" != "" ];then
echo $JOBS_RUNNING $FILE
./M22Checker ${FILE} &
else
echo $JOBS_RUNNING NULL &
fi
JOBS_RUNNING=$((JOBS_RUNNING+1))
done
wait
done
You can use a simple nested for loop (substitute appropriate integers for N and M below):
for i in {1..N}; do
(for j in {1..M}; do do_something; done & );
done
This will execute do_something N*M times in M rounds, each round executing N jobs in parallel. You can make N equal the number of CPUs you have.
My solution to always keep a given number of processes running, keep tracking of errors and handle ubnterruptible / zombie processes:
function log {
echo "$1"
}
# Take a list of commands to run, runs them sequentially with numberOfProcesses commands simultaneously runs
# Returns the number of non zero exit codes from commands
function ParallelExec {
local numberOfProcesses="${1}" # Number of simultaneous commands to run
local commandsArg="${2}" # Semi-colon separated list of commands
local pid
local runningPids=0
local counter=0
local commandsArray
local pidsArray
local newPidsArray
local retval
local retvalAll=0
local pidState
local commandsArrayPid
IFS=';' read -r -a commandsArray <<< "$commandsArg"
log "Runnning ${#commandsArray[#]} commands in $numberOfProcesses simultaneous processes."
while [ $counter -lt "${#commandsArray[#]}" ] || [ ${#pidsArray[#]} -gt 0 ]; do
while [ $counter -lt "${#commandsArray[#]}" ] && [ ${#pidsArray[#]} -lt $numberOfProcesses ]; do
log "Running command [${commandsArray[$counter]}]."
eval "${commandsArray[$counter]}" &
pid=$!
pidsArray+=($pid)
commandsArrayPid[$pid]="${commandsArray[$counter]}"
counter=$((counter+1))
done
newPidsArray=()
for pid in "${pidsArray[#]}"; do
# Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
if kill -0 $pid > /dev/null 2>&1; then
pidState=$(ps -p$pid -o state= 2 > /dev/null)
if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
newPidsArray+=($pid)
fi
else
# pid is dead, get it's exit code from wait command
wait $pid
retval=$?
if [ $retval -ne 0 ]; then
log "Command [${commandsArrayPid[$pid]}] failed with exit code [$retval]."
retvalAll=$((retvalAll+1))
fi
fi
done
pidsArray=("${newPidsArray[#]}")
# Add a trivial sleep time so bash won't eat all CPU
sleep .05
done
return $retvalAll
}
Usage:
cmds="du -csh /var;du -csh /tmp;sleep 3;du -csh /root;sleep 10; du -csh /home"
# Execute 2 processes at a time
ParallelExec 2 "$cmds"
# Execute 4 processes at a time
ParallelExec 4 "$cmds"
$DOMAINS = "list of some domain in commands"
for foo in some-command
do
eval `some-command for $DOMAINS` &
job[$i]=$!
i=$(( i + 1))
done
Ndomains=echo $DOMAINS |wc -w
for i in $(seq 1 1 $Ndomains)
do
echo "wait for ${job[$i]}"
wait "${job[$i]}"
done
in this concept will work for the parallelize. important thing is last line of eval is '&'
which will put the commands to backgrounds.

"Argument list too long" for every command [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
Occasionally, when I have a program that generates large arrays I get this bug where every command throws the error
"Argument list too long"
even if I just type:
$ cp
-bash: /bin/cp: Argument list too long
$
I can't use ls, or even open a new file with vim:
$ vim test.txt
-bash: /usr/bin/vim: Argument list too long
$
I tried using "wait" for all bg processes to finish, but no change. It seems to happen inconsistently, but when it does, the only fix is to restart the shell.
Any ideas what might be going on?
Update: I did some further testing and i got the error to be repeatable. It happens when a recursively defined array reaches 85 elements in length. The first command which throws the error is a bc that doesnt even depend on the array! and then from there on out, almost every other command throws the same error.
Update: The program I'm using has many bash scripts working together, but I've determined the problem always arises in this one:
function MPMDrun_prop()
{
PARDIR=$1
COMPDIR=$2
runSTR=$3
NUMNODES=$4
ForceRun=$5
if [ $# -le 3 ] ; then
echo "USAGE: MPMDrun_prop \$PARDIR \$COMPDIR \$runSTR \$NUMNODES \$ForceRun"
fi
echo "in MPMDrun_Prop"
. $PARDIR/ParameterScan.inp
. $MCTDHBDIR/Scripts/get_NumberOfJobs.sh
if [ "$MPMD" != "T" ]; then
MPMDnodes=1
fi
## If no runscripts in the $PARDIR, copy one and strip of the line which runs the program
if [ -z "$(ls $PARDIR/run*.sh 2> /dev/null)" ] ; then
if [ "$forhost" == "maia" ]; then
cp $MCTDHBDIR/../PBS_Scripts/run-example-maia.sh $PARDIR/run.tmp
sed 's|mpirun.*||' < $PARDIR/run.tmp > $PARDIR/run.sh
jobtime=86400
elif [ "$forhost" == "hermit" ]; then
cp $MCTDHBDIR/../PBS_Scripts/run-example-hermit.sh $PARDIR/run.tmp
sed 's|aprun.*||' < $PARDIR/run.tmp > $PARDIR/run.sh
jobtime=86400
elif [ "$forhost" == "hornet" ]; then
cp $MCTDHBDIR/../PBS_Scripts/run-example-hornet.sh $PARDIR/run.tmp
sed 's|aprun.*||' < $PARDIR/run.tmp > $PARDIR/run.sh
jobtime=86400
elif [ "$forhost" == "bwgrid" ]; then
cp $MCTDHBDIR/../PBS_Scripts/run-example-BWGRID.sh $PARDIR/run.tmp
sed 's|mpirun.*||' < $PARDIR/run.tmp > $PARDIR/run.sh
jobtime=86400
fi
sed 's|nodes=[0-9]*|nodes=0|' < $PARDIR/run.sh > $PARDIR/run.tmp
sed 's|#PBS -N.*|#PBS -N MONSTER_'$MonsterName'|' < $PARDIR/run.tmp > $PARDIR/run.sh_
rm $PARDIR/run.sh
rm $PARDIR/run.tmp
chmod 755 $PARDIR/run.sh_
echo ". $MCTDHBDIR/Scripts/RunFlagSleeper.sh" >> $PARDIR/run.sh_
## Include check_convergence.sh for mixed relax/prop compatibility
echo ". $MCTDHBDIR/Scripts/check_convergence.sh" >> $PARDIR/run.sh_
echo "RunFlagSleeper $jobtime " >> $PARDIR/run.sh_
echo "(" >> $PARDIR/run.sh_
cp $PARDIR/run.sh_ $PARDIR/run1.sh
fi
### Add $runSTR to the most recent runscript
### find runscript$N.sh (run1.sh, run 2.sh, etc) that has numnodes less than $MPMDnodes
for qq in $(ls $PARDIR/run[0-9]*.sh | sort -g ); do
NodesInRun=$(cat $qq | grep -o "nodes *= *[0-9]*" | grep -o "[0-9]*")
if [ "$NodesInRun" -lt "$MPMDnodes" ]; then
## The number of nodes already specified in the runscript doesnt exceed the maximum, so add on another job
NewNodes=$(echo "$NodesInRun+$NUMNODES" | bc)
## Start each aprun command in its own subshell
## wait for 24 hrs after aprun, to guarantee that no subshell finishes before the job is done
sed 's|nodes=[0-9]*|nodes='$NewNodes'|' < $qq > $qq-1
sed 's|\(RunFlagSleeper .*\)|\1 '$COMPDIR'|' <$qq-1 >$qq
rm $qq-1
echo " (" >> $qq
## Sleeps for $jobtime - 5 mins, then removes runflag. in case aprun doesnt finish in $jobtime
echo " cd $COMPDIR" >> $qq
echo " $runSTR" >> $qq
## remove runflag after aprun command has finished
echo " rm $COMPDIR/RunFlag" >> $qq
# echo "sleep $jobtime" >> $qq-1
echo " ) &" >> $qq
# mv $qq-1 $qq
## put a flag in the computation directory so it isnt computed multiple times
touch $COMPDIR/RunFlag
if [[ "$NewNodes" -ge "$MPMDnodes" || "$ForceRun" == "T" ]]; then
## This last process made the nodecount exceed the maximum, or there is a ForceRun flag passed
## So now, exceute the runscript and start another
echo " wait" >> $qq
echo ") &" >> $qq
echo "PID=\$!" >> $qq
echo "wait \$PID" >> $qq
## Ensure the queue has room for the next job, if not, wait for it
Njobs=$(get_NumberOfJobs $runhost)
while [ "$Njobs" -ge "$maxjobs" ]; do
echo "Njobs=$Njobs and maxjobs=$maxjobs"
echo "Waiting 30 minutes for que to clear"
sleep 1800
done
echo "qsub $qq"
# qsub $qq
RunCount=$(echo $qq | grep -o 'run[0-9]*.sh' | grep -o '[0-9]*')
let "RunCount++"
cp $PARDIR/run.sh_ $PARDIR/run$RunCount.sh
fi
fi
done
}
The error typically starts at the 80-90'th call of this function at the first cp or bc. I've commented ALL array manipulations, so there is zero chance this is caused by the array being too large. The environment stays at ~100-200 Kb so that isn't the problem either.
That error message is a bit misleading. It should say something like "Argument list and environment use too much space".
The environment consists of all the environment variables you have exported, plus the environment your shell was started with. Normally, the environment should only be a few kilobytes, but there is nothing stopping you from exporting a million-byte string, and if you do that, you'll use up all the space allowed.
It's not totally obvious how much space the system allows for arguments + environment. You should be able to query the limit with getconf ARG_MAX, and with Gnu xargs you can get more information from xargs --show-limits </dev/null (in both cases, assuming you haven't exceeded the limit :) ), but sometimes the actual space available will turn out to be less than what is indicated.
In any event, it's not a good idea to try to stuff megabytes into the environment. If you're tempted to do that, put the data in a temporary file instead, and just export the name of the file.
Since you stated that when you have a program that generates large arrays you get this bug where every command throws the error "Argument list too long". So, I presume that last command you executed is causing problem for next command. My suggestion is that don't use large argument list for any command. This could cause an overflow in the environment causing problems even for next command. Instead of large arg list, use a file having list of data and use the file redirected for input as in:
command < inputfile

how to write a process-pool bash shell

I have more than 10 tasks to execute, and the system restrict that there at most 4 tasks can run at the same time.
My task can be started like:
myprog taskname
How can I write a bash shell script to run these task. The most important thing is that when one task finish, the script can start another immediately, making the running tasks count remain 4 all the time.
Use xargs:
xargs -P <maximum-number-of-process-at-a-time> -n <arguments-per-process> <command>
Details here.
I chanced upon this thread while looking into writing my own process pool and particularly liked Brandon Horsley's solution, though I couldn't get the signals working right, so I took inspiration from Apache and decided to try a pre-fork model with a fifo as my job queue.
The following function is the function that the worker processes run when forked.
# \brief the worker function that is called when we fork off worker processes
# \param[in] id the worker ID
# \param[in] job_queue the fifo to read jobs from
# \param[in] result_log the temporary log file to write exit codes to
function _job_pool_worker()
{
local id=$1
local job_queue=$2
local result_log=$3
local line=
exec 7<> ${job_queue}
while [[ "${line}" != "${job_pool_end_of_jobs}" && -e "${job_queue}" ]]; do
# workers block on the exclusive lock to read the job queue
flock --exclusive 7
read line <${job_queue}
flock --unlock 7
# the worker should exit if it sees the end-of-job marker or run the
# job otherwise and save its exit code to the result log.
if [[ "${line}" == "${job_pool_end_of_jobs}" ]]; then
# write it one more time for the next sibling so that everyone
# will know we are exiting.
echo "${line}" >&7
else
_job_pool_echo "### _job_pool_worker-${id}: ${line}"
# run the job
{ ${line} ; }
# now check the exit code and prepend "ERROR" to the result log entry
# which we will use to count errors and then strip out later.
local result=$?
local status=
if [[ "${result}" != "0" ]]; then
status=ERROR
fi
# now write the error to the log, making sure multiple processes
# don't trample over each other.
exec 8<> ${result_log}
flock --exclusive 8
echo "${status}job_pool: exited ${result}: ${line}" >> ${result_log}
flock --unlock 8
exec 8>&-
_job_pool_echo "### _job_pool_worker-${id}: exited ${result}: ${line}"
fi
done
exec 7>&-
}
You can get a copy of my solution at Github. Here's a sample program using my implementation.
#!/bin/bash
. job_pool.sh
function foobar()
{
# do something
true
}
# initialize the job pool to allow 3 parallel jobs and echo commands
job_pool_init 3 0
# run jobs
job_pool_run sleep 1
job_pool_run sleep 2
job_pool_run sleep 3
job_pool_run foobar
job_pool_run foobar
job_pool_run /bin/false
# wait until all jobs complete before continuing
job_pool_wait
# more jobs
job_pool_run /bin/false
job_pool_run sleep 1
job_pool_run sleep 2
job_pool_run foobar
# don't forget to shut down the job pool
job_pool_shutdown
# check the $job_pool_nerrors for the number of jobs that exited non-zero
echo "job_pool_nerrors: ${job_pool_nerrors}"
Hope this helps!
Using GNU Parallel you can do:
cat tasks | parallel -j4 myprog
If you have 4 cores, you can even just do:
cat tasks | parallel myprog
From http://git.savannah.gnu.org/cgit/parallel.git/tree/README:
Full installation
Full installation of GNU Parallel is as simple as:
./configure && make && make install
Personal installation
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
Minimal installation
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Test the installation
After this you should be able to do:
parallel -j0 ping -nc 3 ::: foss.org.my gnu.org freenetproject.org
This will send 3 ping packets to 3 different hosts in parallel and print
the output when they complete.
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
I would suggest writing four scripts, each one of which executes a certain number of tasks in series. Then write another script that starts the four scripts in parallel. For instance, if you have scripts, script1.sh, script2.sh, script3.sh, and script4.sh, you could have a script called headscript.sh like so.
#!/bin/sh
./script1.sh &
./script2.sh &
./script3.sh &
./script4.sh &
I found the best solution proposed in A Foo Walks into a Bar... blog using build-in functionality of well know xargs tool
First create a file commands.txt with list of commands you want to execute
myprog taskname1
myprog taskname2
myprog taskname3
myprog taskname4
...
myprog taskname123
and then pipe it to xargs like this to execute in 4 processes pool:
cat commands.txt | xargs -I CMD --max-procs=4 bash -c CMD
you can modify no of process
Following #Parag Sardas' answer and the documentation linked here's a quick script you might want to add on your .bash_aliases.
Relinking the doc link because it's worth a read
#!/bin/bash
# https://stackoverflow.com/a/19618159
# https://stackoverflow.com/a/51861820
#
# Example file contents:
# touch /tmp/a.txt
# touch /tmp/b.txt
if [ "$#" -eq 0 ]; then
echo "$0 <file> [max-procs=0]"
exit 1
fi
FILE=${1}
MAX_PROCS=${2:-0}
cat $FILE | while read line; do printf "%q\n" "$line"; done | xargs --max-procs=$MAX_PROCS -I CMD bash -c CMD
I.e.
./xargs-parallel.sh jobs.txt 4 maximum of 4 processes read from jobs.txt
You could probably do something clever with signals.
Note this is only to illustrate the concept, and thus not thoroughly tested.
#!/usr/local/bin/bash
this_pid="$$"
jobs_running=0
sleep_pid=
# Catch alarm signals to adjust the number of running jobs
trap 'decrement_jobs' SIGALRM
# When a job finishes, decrement the total and kill the sleep process
decrement_jobs()
{
jobs_running=$(($jobs_running - 1))
if [ -n "${sleep_pid}" ]
then
kill -s SIGKILL "${sleep_pid}"
sleep_pid=
fi
}
# Check to see if the max jobs are running, if so sleep until woken
launch_task()
{
if [ ${jobs_running} -gt 3 ]
then
(
while true
do
sleep 999
done
) &
sleep_pid=$!
wait ${sleep_pid}
fi
# Launch the requested task, signalling the parent upon completion
(
"$#"
kill -s SIGALRM "${this_pid}"
) &
jobs_running=$((${jobs_running} + 1))
}
# Launch all of the tasks, this can be in a loop, etc.
launch_task task1
launch_task tast2
...
launch_task task99
This tested script runs 5 jobs at a time and will restart a new job as soon as it does (due to the kill of the sleep 10.9 when we get a SIGCHLD. A simpler version of this could use direct polling (change the sleep 10.9 to sleep 1 and get rid of the trap).
#!/usr/bin/bash
set -o monitor
trap "pkill -P $$ -f 'sleep 10\.9' >&/dev/null" SIGCHLD
totaljobs=15
numjobs=5
worktime=10
curjobs=0
declare -A pidlist
dojob()
{
slot=$1
time=$(echo "$RANDOM * 10 / 32768" | bc -l)
echo Starting job $slot with args $time
sleep $time &
pidlist[$slot]=`jobs -p %%`
curjobs=$(($curjobs + 1))
totaljobs=$(($totaljobs - 1))
}
# start
while [ $curjobs -lt $numjobs -a $totaljobs -gt 0 ]
do
dojob $curjobs
done
# Poll for jobs to die, restarting while we have them
while [ $totaljobs -gt 0 ]
do
for ((i=0;$i < $curjobs;i++))
do
if ! kill -0 ${pidlist[$i]} >&/dev/null
then
dojob $i
break
fi
done
sleep 10.9 >&/dev/null
done
wait
Other answer about 4 shell scripts does not fully satisfies me as it assumes that all tasks take approximatelu the same time and because it requires manual set up. But here is how I would improve it.
Main script will create symbolic links to executables following certain namimg convention. For example,
ln -s executable1 ./01-task.01
first prefix is for sorting and suffix identifies batch (01-04).
Now we spawn 4 shell scripts that would take batch number as input and do something like this
for t in $(ls ./*-task.$batch | sort ; do
t
rm t
done
Look at my implementation of job pool in bash: https://github.com/spektom/shell-utils/blob/master/jp.sh
For example, to run at most 3 processes of cURL when downloading from a lot of URLs, you can wrap your cURL commands as follows:
./jp.sh "My Download Pool" 3 curl http://site1/...
./jp.sh "My Download Pool" 3 curl http://site2/...
./jp.sh "My Download Pool" 3 curl http://site3/...
...
Here is my solution. The idea is quite simple. I create a fifo as a semaphore, where each line stands for an available resource. When reading the queue, the main process blocks if there is nothing left. And, we return the resource after the task is done by simply echoing anything to the queue.
function task() {
local task_no="$1"
# doing the actual task...
echo "Executing Task ${task_no}"
# which takes a long time
sleep 1
}
function execute_concurrently() {
local tasks="$1"
local ps_pool_size="$2"
# create an anonymous fifo as a Semaphore
local sema_fifo
sema_fifo="$(mktemp -u)"
mkfifo "${sema_fifo}"
exec 3<>"${sema_fifo}"
rm -f "${sema_fifo}"
# every 'x' stands for an available resource
for i in $(seq 1 "${ps_pool_size}"); do
echo 'x' >&3
done
for task_no in $(seq 1 "${tasks}"); do
read dummy <&3 # blocks util a resource is available
(
trap 'echo x >&3' EXIT # returns the resource on exit
task "${task_no}"
)&
done
wait # wait util all forked tasks have finished
}
execute_concurrently 10 4
The script above will run 10 tasks and 4 each time concurrently. You can change the $(seq 1 "${tasks}") sequence to the actual task queue you want to run.
I made my modifications based on methods introduced in this Writing a process pool in Bash.
#!/bin/bash
#set -e # this doesn't work here for some reason
POOL_SIZE=4 # number of workers running in parallel
#######################################################################
# populate jobs #
#######################################################################
declare -a jobs
for (( i = 1988; i < 2019; i++ )); do
jobs+=($i)
done
echo '################################################'
echo ' Launching jobs'
echo '################################################'
parallel() {
local proc procs jobs cur
jobs=("$#") # input jobs array
declare -a procs=() # processes array
cur=0 # current job idx
morework=true
while $morework; do
# if process array size < pool size, try forking a new proc
if [[ "${#procs[#]}" -lt "$POOL_SIZE" ]]; then
if [[ $cur -lt "${#jobs[#]}" ]]; then
proc=${jobs[$cur]}
echo "JOB ID = $cur; JOB = $proc."
###############
# do job here #
###############
sleep 3 &
# add to current running processes
procs+=("$!")
# move to the next job
((cur++))
else
morework=false
continue
fi
fi
for n in "${!procs[#]}"; do
kill -0 "${procs[n]}" 2>/dev/null && continue
# if process is not running anymore, remove from array
unset procs[n]
done
done
wait
}
parallel "${jobs[#]}"
xargs with -P and -L options does the job.
You can extract the idea from the example below:
#!/usr/bin/env bash
workers_pool_size=10
set -e
function doit {
cmds=""
for e in 4 8 16; do
for m in 1 2 3 4 5 6; do
cmd="python3 ./doit.py --m $m -e $e -m $m"
cmds="$cmd\n$cmds"
done
done
echo -e "All commands:\n$cmds"
echo "Workers pool size = $workers_pool_size"
echo -e "$cmds" | xargs -t -P $workers_pool_size -L 1 time > /dev/null
}
doit
#! /bin/bash
doSomething() {
<...>
}
getCompletedThreads() {
_runningThreads=("$#")
removableThreads=()
for pid in "${_runningThreads[#]}"; do
if ! ps -p $pid > /dev/null; then
removableThreads+=($pid)
fi
done
echo "$removableThreads"
}
releasePool() {
while [[ ${#runningThreads[#]} -eq $MAX_THREAD_NO ]]; do
echo "releasing"
removableThreads=( $(getCompletedThreads "${runningThreads[#]}") )
if [ ${#removableThreads[#]} -eq 0 ]; then
sleep 0.2
else
for removableThread in "${removableThreads[#]}"; do
runningThreads=( ${runningThreads[#]/$removableThread} )
done
echo "released"
fi
done
}
waitAllThreadComplete() {
while [[ ${#runningThreads[#]} -ne 0 ]]; do
removableThreads=( $(getCompletedThreads "${runningThreads[#]}") )
for removableThread in "${removableThreads[#]}"; do
runningThreads=( ${runningThreads[#]/$removableThread} )
done
if [ ${#removableThreads[#]} -eq 0 ]; then
sleep 0.2
fi
done
}
MAX_THREAD_NO=10
runningThreads=()
sequenceNo=0
for i in {1..36}; do
releasePool
((sequenceNo++))
echo "added $sequenceNo"
doSomething &
pid=$!
runningThreads+=($pid)
done
waitAllThreadComplete

Resources