shell script, for loop, does loop wait for execution of the command to iterate - bash

I have a shell script with a for loop. Does loop wait for execution of the command in its body before iterating?
Thanks in Advance
Here is my code. Will the commands execute sequentially or parallel?
for m in "${mode[#]}"
do
cmd="exec $perlExecutablePath $perlScriptFilePath --owner $j -rel $i -m $m"
$cmd
eval "$cmd"
done

Assuming that you haven't background-ed the command, then yes.
For example:
for i in {1..10}; do cmd; done
waits for cmd to complete before continuing the loop, whereas:
for i in {1..10}; do cmd &; done
doesn't.
If you want to run your commands in parallel, I would suggest changing your loop to something like this:
for m in "${mode[#]}"
do
"$perlExecutablePath" "$perlScriptFilePath" --owner "$j" -rel "$i" -m "$m" &
done
This runs each command in the background, so it doesn't wait for one command to finish before the next one starts.
An alternative would be to look at GNU Parallel, which is designed for this purpose.

Using GNU Parallel it looks like this:
parallel $perlExecutablePath $perlScriptFilePath --owner $j -rel $i -m {} ::: "${mode[#]}"
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

How to issue shell commands to slave machines from master and wait until all are finished?

I have 4 shell commands I need to run and they do not depend on each other.
I have 4 slave machines. So, I want to run one of the 4 commands on each of the 4 machines, and then I want to wait until all 4 of them are finished.
How do I distribute this processing? This is what I tried:
$1 is a list of ip addresses to the slave machines.
for host in $(cat $1)
do
echo $host
# ssh into each machine and launch command
ssh username#$host <command>;
done
But this seems as if it is waiting for the command to finish before moving on to the next host and launching the next command.
How do I accomplish this distributed processing that doesn't depend on each other?
I would use GNU Parallel like this - running hostname in parallel on each of 4 servers:
parallel -j 4 --nonall -S 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4 hostname
If you need to pass parameters, use --onall and put arguments after :::
parallel -j 4 --onall -S 192.168.0.1,192.168.0.2,192.168.0.3,192.168.0.4 echo ::: hello
Add --tag if you want the output lines tagged by the hostname/IP.
Add -k if you want to keep the output in order.
Add : to the server list to run on local host too.
If you aren't concerned about how many commands run concurrently, just put each one in the background with &, then wait on them as a group.
while IFS= read -r host; do
ssh username#$host <command> &
done < "$1"
wait
Note the use of a while loop instead of a for loop; see Bash FAQ 001.
The ssh part of your script needs to be like:
$ ssh -f user#host "sh -c 'sleep 30 ; nohup ls > foo 2>&1 &'"
This one sleeps for 30 secs and writes the output of ls to file foo. 30 secs is enough for you to go and see it yourself. Just build your loop around that.

Remote task queue using bash & ssh for variable number of live workers

I want to distribute the work from a master server to multiple worker servers using batches.
Ideally I would have a tasks.txt file with the list of tasks to execute
cmd args 1
cmd args 2
cmd args 3
cmd args 4
cmd args 5
cmd args 6
cmd args 7
...
cmd args n
and each worker server will connect using ssh, read the file and mark each line as in progress or done
#cmd args 1 #worker1 - done
#cmd args 2 #worker2 - in progress
#cmd args 3 #worker3 - in progress
#cmd args 4 #worker1 - in progress
cmd args 5
cmd args 6
cmd args 7
...
cmd args n
I know how to make the ssh connection, read the file, and execute remotely but don't know how to make the read and write an atomic operation, in order to not have cases where 2 servers start the same task, and how to update the line.
I would like for each worker to go to the list of tasks and lock the next available task in the list rather than the server actively commanding the workers, as I will have a flexible number of workers clones that I will start or close according to how fast I will need the tasks to complete.
UPDATE:
and my ideea for the worker script would be :
#!/bin/bash
taskCmd=""
taskLine=0
masterSSH="ssh usr#masterhost"
tasksFile="/path/to/tasks.txt"
function getTask(){
while [[ $taskCmd == "" ]]
do
sleep 1;
taskCmd_and_taskLine=$($masterSSH "#read_and_lock_next_available_line $tasksFile;")
taskCmd=${taskCmd_and_taskLine[0]}
taskLine=${taskCmd_and_taskLine[1]}
done
}
function updateTask(){
message=$1
$masterSSH "#update_currentTask $tasksFile $taskLine $message;"
}
function doTask(){
return $taskCmd;
}
while [[ 1 -eq 1 ]]
do
getTask
updateTask "in progress"
doTask
taskErrCode=$?
if [[ $taskErrCode -eq 0 ]]
then
updateTask "done, finished successfully"
else
updateTask "done, error $taskErrCode"
fi
taskCmd="";
taskLine=0;
done
You can use flock to concurrently access the file:
exec 200>>/some/any/file ## create a file descriptor
flock -w 30 200 ## concurrently access /some/any/file, timeout of 30 sec.
You can point the file descriptor to your tasks list or any other file, but of course the same file in order to flock work. The lock will me removed as soon as the process that created it is done or fail. You can also remove the lock by yourself when you don't need it anymore:
flock -u 200
An usage sample:
ssh user#x.x.x.x '
set -e
exec 200>>f
echo locking...
flock -w 10 200
echo working...
sleep 5
'
set -e fails the script if any step fails. Play with the sleep time and execute this script in parallel. Just one sleep will execute at a time.
Check if you are reinventing GNU Parallel:
parallel -S worker1 -S worker2 command ::: arg1 arg2 arg3
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
try to implement something like
while read line; do
echo $line
#check if the line contains the # char, if not execute the ssh, else nothing to do
checkAlreadyDone=$(grep "^#" $line)
if [ -z "${checkAlreadyDone}" ];then
<insert here the command to execute ssh call>
<here, if everything has been executed without issue, you should
add a commad to update the file taskList.txt
one option could be to insert a sed command but it should be tested>
else
echo "nothing to do for $line"
fi
done < taskList.txt
Regards
Claudio
I think I have successfully implemented one: https://github.com/guo-yong-zhi/DistributedTaskQueue
It is mainly based on bash, ssh and flock, and python3 is required for string processing.

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

GNU parallel processing

I have the following script that I want to run using GNU parallel, it is a for loop that needs to be run n times. How can I do this using GNU parallel?
SHARK=tshark
# Create file list
FILELIST=`ls $1`
TEMPDIR=/tmp/foobar
mkdir $TEMPDIR
i=1
for I in $FILELIST; do
echo "$i $I $2"
$SHARK -r $I -w $TEMPDIR/~$I-$i -R "$2" &>/dev/null
i=`echo $i+1|bc`
done
There are a number of ways of doing this, either with sub-shells and sub-processes, see e.g.
Running shell script in parallel
or by installing neat utilities designed to do this, e.g:
|P|P|S|S| - (Distributed) Parallel Processing Shell Script
GNU Parallel
I would try to get it done first with sub-shells, and then try the others if you still need better power.

Resources