delay between runs in bash on cluster - bash

I have to submit a large number of jobs on a cluster, I have a script like:
#!/bin/bash
for runname in bcc BNU Can CNRM GFDLG GFDLM
do
cd given_directory/$runname
cat another_directory | while read LINE ; do
qsub $LINE
done
done
There are 4000 lines in the script, i.e. 4000 jobs for each runename.
The number of jobs that can be submitted on the cluster is limited by a user at a given time.
So, I want to delay the process between each runs, in a given for-loop till
one batch, like all runs in bcc directory is done.
How can I do that? Is there a command that I can put after the first done (?) to make the code to wait till bcc is done and then move to BNU?

One option is to use a counter to monitor how many jobs are currently submitted, and wait when the limit is reached. Querying the number of jobs can be a costly operation to the head node so it is better not to do it after every submitted job. Here, it is done maximum once every SLEEP seconds.
#!/bin/bash
TARGET=4000
SLEEP=300
# Count the current jobs, pending or running
get_job_count(){
# The grep is to remove the header, there may be a better way.
qstat -u $USER | grep $USER | wc -l
}
# Wait until the number of job is under the limit, then submit.
submit_when_possible(){
while [ $COUNTER -ge $TARGET ]; do
sleep $SLEEP
COUNTER=$(get_job_count)
done
qsub $1
let "COUNTER++"
}
# Global job counter
COUNTER=$(get_job_count)
for RUNNAME in bcc BNU Can CNRM GFDLG GFDLM
do
cd given_directory/$RUNNAME
cat another_directory | while read JOB ; do
submit_when_possible $JOB
done
done
Note: the script is untested, so it may need minor fixes, but the idea should work.

Related

How to use timeout for this nested Bash script?

I wrote the following bash script, which works all right, apart from some random moments when it freezes completely and doesn't evolve further past a certain value of a0
export OMP_NUM_THREADS=4
N_SIM=15000
N_NODE=1
for ((i = 1; i <= $N_SIM; i++))
do
index=$((i))
a0=$(awk "NR==${index} { print \$2 }" Intensity_Wcm2_versus_a0_10_20_10_25_range.txt)
dirname="a0_${a0}"
if [ -d "${dirname}" ]; then
cd -P -- "${dirname}" # enter the directory because it exists already
if [ -f "ParticleBinning0.h5" ]; then # move to next directory because the sim has been already done and results are there
cd ..
echo ${a0}
echo We move to the next directory because ParticleBinning0.h exists in this one already.
continue 1
else
awk -v s="a0=${a0}" 'NR==6 {print s} 1 {print}' ../namelist_for_smilei.py > namelist_for_smilei_a0included.py
echo ${a0}
mpirun -n 1 ../smilei namelist_for_smilei_a0included.py 2&> smilei.log
cd ..
fi
else
mkdir -p $dirname
cd $dirname
awk -v s="a0=${a0}" 'NR==6 {print s} 1 {print}' ../namelist_for_smilei.py > namelist_for_smilei_a0included.py
echo ${a0}
mpirun -n 1 ../smilei namelist_for_smilei_a0included.py 2&> smilei.log
cd ..
fi
done
I need to let this to run for 12 hours or so in order for it to complete all the 15,000 simulations.
One mpirun -n 1 ../smilei namelist_for_smilei.py 2&> smilei.log command takes 4 seconds to run on average.
Sometimes it just stops at one value of a0 and the last printed value of a0 on the screen is say a0_12.032131.
And it stays like this, stays like this, for no reason.
There's no output being written in the smilei.log from that particularly faulty a0_12.032131 folder.
So I don't know what has happened with this particular value of a0.
Any value of a0 is not particularly important, I can live without the computations for that 1 particular value of a0.
I have tried to use the timeout utility in Ubuntu to somehow make it advance past any value of a0 which takes more than 2 mins to run. If it takes more than that to run, it clearly failed and stops the whole process running forwards.
It is beyond my capabilities to write such a script.
How shall a template look like for my particular pipeline?
Thank you!
It seems that this mpirun program is hanging. As you said you could use the timeout utility to terminate its execution after a reasonable amount of time has passed:
timeout --signal INT 2m mpirun...
Depending on how mpirun handles signals it may be necessary to use KILL instead of INT to terminate the process.

Monitoring a log file until it is complete

I am a high school student attempting to write a script in bash that will submit jobs using the "qsub" command on a supercomputer utilizing a different number of cores. This script will then take the data on the number of cores and the time it took for the supercomputer to complete the simulation from each of the generated log files, called "log.lammps", and store this data in a separate file.
Because it will take each log file a different amount of time to be completely generated, I followed the steps from
https://superuser.com/questions/270529/monitoring-a-file-until-a-string-is-found
to have my script proceed when the last line of the log file with the string "Total wall time: " was generated.
Currently, I am using the following code in a loop so that this can be run for all the specified number of cores:
( tail -f -n0 log.lammps & ) | grep -q "Total wall time:"
However, running the script with this piece of code resulted in the log.lammps file being truncated and the script not completing even when the log.lammps file was completely generated.
Is there any other method for my script to only proceed when the submitted job is completed?
One way to do this is touch a marker file once you're complete, and wait for that:
#start process:
rm -f finished.txt;
( sleep 3 ; echo "scriptdone" > log.lammps ; true ) && touch finished.txt &
# wait for the above to complete
while [ ! -e finished.txt ]; do
sleep 1;
done
echo safe to process log.lammps now...
You could also use inotifywait, or a flock if you want to avoid busy waiting.
EDIT:
to handle the case where one of the first commands might fail, grouped first commands, and then added true to the end such that the group always returns true, and then did && touch finished.txt. This way finished.txt gets modified even if one of the first commands fails, and the loop below does not wait forever.
Try the following approach
# run tail -f in background
(tail -f -n0 log.lammps | grep -q "Total wall time:") > out 2>&1 &
# process id of tail command
tailpid=$!
# wait for some time or till the out file hqave data
sleep 10
# now kill the tail process
kill $tailpid
I tend to do this sort of thing with:
http://stromberg.dnsalias.org/~strombrg/notify-when-up2.html
and
http://stromberg.dnsalias.org/svn/age/trunk/
So something like:
notify-when-up2 --greater-than-or-equal-to 0 'age /etc/passwd' 10
This doesn't look for a specific pattern in your file - it looks for when the file stops changing for a 10 seconds. You can look for a pattern by replacing the age with a grep:
notify-when-up2 --true-command 'grep root /etc/passwd'
notify-when-up2 can do things like e-mail you, give a popup, or page you when a state changes. It's not a pretty approach in some cases, compared to using wait or whatever, but I find myself using a several times a day.
HTH.

Shell script to rsync a file every week without cronjob (school assignement)

#!/bin/bash
z=1
b=$(date)
while [[ $z -eq 1 ]]
do
a=$(date)
if [ "$a" == "$b" ]
then
b=$(date -d "+7 days")
rsync -v -e ssh user#ip_address:~/sample.tgz /home/kartik2
sleep 1d
fi
done
I want to rsync a file every week !! But if I start this script on every boot the file will be rsynced every time the system starts !! How to alter the code to satisfy week basis rsync ? ( PS- I don't want to do this through cronjob - school assignment)
You are talking about having this run for weeks, right? So, we have to take into account that the system will be rebooted and it needs to be run unattended. In short, you need some means of ensuring the script is run at least once every week even when no one is around. The options look like this
Option 1 (worst)
You set a reminder for yourself and you log in every week and run the script. While you may be reliable as a person, this doesn't allow you to go on vacation. Besides, it goes against our principle of "when no one is around".
Option 2 (okay)
You can background the process (./once-a-week.sh &) but this will not reliable over time. Among other things, if the system restarts then your script will not be operating and you won't know.
Option 3 (better)
For this to be reliable over weeks one option is to daemonize the script. For a more detailed discussion on the matter, see: Best way to make a shell script daemon?
You would need to make sure the daemon is started after reboot or system failure. For more discussion on that matter, see: Make daemon start up with Linux
Option 4 (best)
You said no cron but it really is the best option. In particular, it would consume no system resources for the 6 days, 23 hours and 59 minutes when it does not need to running. Additionally, it is naturally resilient to reboots and the like. So, I feel compelled to say that creating a crontab entry like the following would be my top vote: #weekly /full/path/to/script
If you do choose option 2 or 3 above, you will need to make modifications to your script to contain a variable of the week number (date +%V) in which the script last successfully completed its run. The problem is, just having that in memory means that it will not be sustained past reboot.
To make any of the above more resilient, it might be best to create a directory where you can store a file to serve as a semaphore (e.g. week21.txt) or a file to store the state of the last run. Something like once-a-week.state to which you would write a value when run:
date +%V > once-a-week.state # write the week number to a file
Then to read the value, you would:
file="/path/to/once-a-week.state" # the file where the week number is stored
read -d $'\x04' name < "$file"
echo "$name"
You would then need to check to see if the week number matched this present week number and handle the needed action based on match or not.
#!/bin/bash
z=1
b=$(cat f1.txt)
while [[ $z -eq 1 ]]
do
a=$(date +"%d-%m-%y")
if [ "$a" == "$b" ] || [ "$b" == "" ] || [$a -ge $b ]
then
b=$(date +"%d-%m-%y" -d "+7 days")
echo $b > f1.txt
rsync -v -e ssh HOST#ip:~/sample.tgz /home/user
if [ $? -eq 0 ]
then
sleep 1d
fi
fi
done
This code seems to works well and good !! Any changes to it let me know

How to start a large number of quick jobs in Bash

I have 3000 very quick jobs to run that on average take 2/3 seconds.
The list of jobs is in a file, and I want to control how many I have open.
However, the process of starting a job in background (& line) seems to take some time itself, therefore, some jobs are already finishing before "INTOTAL" amount have got started...
Therefore, I am not using my 32 core efficiently.
Is the a better approach than the one below?
#!/bin/sh
#set -x
INTOTAL=28
while true
do
NUMRUNNING=`tasklist | egrep Prod.exe | wc -l`
JOBS=`cat jobs.lst | wc -l`
if [ $JOBS -gt 0 ]
then
MAXSTART=$(($INTOTAL-$NUMRUNNING))
NUMTOSTART=$JOBS
if [ $NUMTOSTART -gt $MAXSTART ]
then
NUMTOSTART=$MAXSTART
fi
echo 'Starting: '$NUMTOSTART
for ((i=1;i<=$NUMTOSTART;i++))
do
JOB=`head -n1 jobs.lst`
sed -i 1d jobs.lst
/Prod $JOB &
done
sleep 2
fi
sleep 3
done
You may want to have a look at parallel, which you should be able to install on Cygwin according to the release notes. Then running the tasks in parallel can be as easy as:
parallel /Prod {} < jobs.lst
See here for an example of this in its man page (and have a look through the plethora of examples for more about the many options it has).
To control how many jobs to run at a time use the -j flag. By default it will run 1 job per core at a time, so 32 for you. To limit to 16 for instance:
parallel -j 16 /Prod {} < jobs.lst

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Resources