how to write a bash script that creates new scripts iteratively - bash

How would I write a script that loops through all of my subjects and creates a new script per subject? The goal is to create a script that runs a program called FreeSurfer per subject on a supercomputer. The supercomputer queue restricts how long each script/job will take, so I will have each job run 1 subject. Ultimately I would like to automate the job submitting process since I cannot submit all the jobs at the same time. In my subjects folder I have three subjects: 3123, 3315, and 3412.
I am familiar with MATLAB scripting, so I was envisioning something like this
for i=1:length(subjects)
nano subjects(i).sh
<contents of FreeSurfer script>
input: /subjects(i)/scan_name.nii
output: /output/subjects(i)/<FreeSurfer output folders>
end
I know I mixed aspects of MATLAB and linux but hopefully it's relatively clear what the goal is. Please let me know if there is a better method.
Here is an example of the FreeSurfer script for a given subject
#!/bin/bash
#PBS -l walltime=25:00:00
#PBS -q long
export FREESURFER_HOME=/gpfs/software/freesurfer/6.0.0/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
export SUBJECTS_DIR=/gpfs/projects/Group/ppmi/freesurfer/subjects/
recon-all -i /gpfs/projects/Group/ppmi/all_anat/3105/Baseline/*.nii -s
$SUBJECTS_DIR/freesurfer/subjects/3105 -autorecon-all
The -i option gives the input and the -s option gives the output.

change your script to accept the subject as an argument, so that you have only one generic script.
#!/bin/bash
#PBS -l walltime=25:00:00
#PBS -q long
subject="$1"
export FREESURFER_HOME=/gpfs/software/freesurfer/6.0.0/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
export SUBJECTS_DIR=/gpfs/projects/Group/ppmi/freesurfer/subjects/
recon-all -i /gpfs/projects/Group/ppmi/all_anat/"$subject"/Baseline/*.nii -s
$SUBJECTS_DIR/freesurfer/subjects/"$subject" -autorecon-all
and you can call it for all your subjects
for s in 3123 3315 3412;
do
./yourscriptnamehere.sh "$s"
done
add error handling as desired.

Related

Iterations of a bash script to run in parallel

I have a bash script that looks like below.
$TOOL is another script which runs 2 times with different inputs(VAR1 and VAR2).
#Iteration 1
${TOOL} -ip1 ${VAR1} -ip2 ${FINAL_PML}/$1$2.txt -p ${IP} -output_format ${MODE} -o ${FINAL_MODE_DIR1}
rename mods mode_c_ ${FINAL_MODE_DIR1}/*.xml
#Iteration 2
${TOOL} -ip1 ${VAR2} -ip2 ${FINAL_PML}/$1$2.txt -p ${IP} -output_format ${MODE} -o ${FINAL_MODE_DIR2}
rename mods mode_c_ ${FINAL_MODE_DIR2}/*.xml
Can I make these 2 iterations in parallel inside a bash script without submitting it in a queue?
If I read this right, what you want is to run them in background.
c.f. https://linuxize.com/post/how-to-run-linux-commands-in-background/
More importantly, if you are going to be writing scripts, PLEASE read the following closely:
https://www.gnu.org/software/bash/manual/html_node/index.html#SEC_Contents
https://mywiki.wooledge.org/BashFAQ/001

For Sun Grid Engine qsub, can you let the job notify multiple email addresses?

I have been using the qsub system for a while now, but the first time I have encountered some problems: Is there a way to send multiple emails to two or more emails?
Here is my script header:
#!/bin/bash
#PBS -V
#PBS -l nodes=1:ppn=20
#PBS -l walltime=12:00:00
#PBS -M email1#school.edu,email2#gmail.com
#PBS -N Model_sim
I have tried the following methods:
I have searched online for a while for if I can do something at -M command, I have seen only one use here: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html. However, it is not really working when I put my two emails like he did on the end of the page.
Also, I have tried the Bash tricks, but it seems like I can't insert any Bash lists after #PBS -M flag, or it shows me the syntax error.
If anyone knows anything regarding to this case, thank you to sharing the knowledge! Any suggestions would also recommanded!
Thanks
It is actually pretty easy, using ; instead of ,:
#PBS -M email1#school.edu;email2#gmail.com

Submit SGE job array with random file names

I have a script that was kicking off ~200 jobs for each sub-analysis. I realized that a job array would probably be much better for this for several reasons. It seems simple enough but is not quite working for me. My input files are not numbered so I've following examples I've seen I do this first:
INFILE=`sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt`
My qsub command takes in quite a few variables as it is both pulling and outputting to different directories. $res does not change, however $INFILE is what I am looping through.
qsub -q test.q -t 1-200 -V -sync y -wd ${res} -b y perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Since this was not working, I was curious as to what exactly was being passed. So I did an echo on this and saw that it only seems to expand up to the first time $INFILE is used. So I get:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/
instead of:
perl -I /master/lib/ myanalysis.pl -c mydirectory/fileABC/configFile-fileABC.txt -o mydirectory/fileABC/
Hoping for some clarity on this and welcome all suggestions. Thanks in advance!
UPDATE: It doesn't look like $SGE_TASK_ID is set on the cluster. I looked for any variable that could be used for an array ID and couldn't find anything. If I see anything else I will update again.
Assuming you are using a grid engine variant then SGE_TASK_ID should be set within the job. It looks like you are expecting it to be set to some useful variable before you use qsub. Submitting a script like this would do roughly what you appear to be trying to do:
#!/bin/bash
INFILE=$(sed -n ${SGE_TASK_ID}p <pathto/listOfFiles.txt)
exec perl -I /master/lib/ myanalysis.pl -c ${res}/${INFILE}/configFile-${INFILE}.txt -o ${res}/${INFILE}/
Then submit this script with
res=${res} qsub -q test.q -t 1-200 -V -sync y -wd ${res} myscript.sh
`

How to iterate over files in many folders

I have 15 folders and each folder contained a *.gz file. I would like to use that file for one of the package to do some filtering.
For this I would like to write something that can open that folder and read the that specific file and do the actions as mentioned and than save the results in the same folder with different extension.
What I did is(PBS Script):
#!/bin/bash
#PBS -N Trimmomatics_filtering
#PBS -l nodes=1:ppn=8
#PBS -l walltime=04:00:00
#PBS -l vmem=23gb
#PBS -q ext_chem_guest
# Go to the Trimmomatics directory
cd /home/tb44227/bioinfo_packages/Trimmomatic/Trimmomatic-0.36
# Java module load
module load java/1.8.0-162
# Input File (I have a list of 15 folders and each contained fastq.gz file)
**inputFile= for f in /home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017/754_{1..15}/*fastq.gz; $f**
# Start the code to filter the file and save the results in the same folder where the input file is
java -jar trimmomatic-0.36.jar SE -threads ${PBS_NUM_PPN} -phred33 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:17 $inputFile $outputFile
# Output File
outputFile=$inputFile{.TRIMMIMG}
My question is How could I define $inputFile and $outputfile so that it can read for all the 15 files.
Thanks
If your application does only process a single input file at a time, you have two options:
Process all files in one single job
Process each file in a different job
From the user's perspective you are usually more interested in the second option, as multiple jobs may run simultaneously if there are resources available. However, this depends on the number of files you need to process and your system usage policy, as sending too many jobs in a short amount of time can cause problems in the job scheudler.
The first option is, more or less, what you already got. You can use find program and a simple bash loop. You basically store find output into a variable, and then iterate over it, like in this example:
#!/bin/bash
# PBS job parameters
module load java
root_dir=/home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017
# Get all files to be processed
files=$(find $root_dir -type f -name "*fastq.gz")
for inputfile in $files; do
outputfile="$inputFile{.TRIMMIMG}"
# Process one file at a time
java -jar ... $inputfile $outputfile
done
Then, you just submit your job script, which will generate a single job.
$ qsub myjobscript.sh
The second option is more powerful, but requires you to change the jobscript for each file. Most job managers let you pass the job script by standard input. This is really helpful because it avoids us to generate intermediate files, which pollute your directories.
#!/bin/bash
function submit_job() {
# Submit job. Jobscript passed through standard input using a HEREDOC.
# Must define $inputfile and $outputfile before calling the function.
qsub - <<- EOF
# PBS job parameters
module load java
# Process a single file only
java -jar ... $inputfile $outputfile
EOF
}
root_dir=/home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017
# Get all files to be processed
files=$(find $root_dir -type f -name "*fastq.gz")
for inputfile in $files; do
outputfile="$inputFile{.TRIMMIMG}"
submit_job
done
Since you are calling qsub inside the script, you just need to call the script itself, like any regular shell script file.
$ bash multijobscript.sh

Variable Not getting recognized in shell script

I use the following shell script to run a simulation on my cluster.
#PBS -N 0.05_0.05_m_1_200k
#PBS -l nodes=1:ppn=1,pmem=1000mb
#PBS -S /bin/bash
#$ -m n
#$ -j oe
FOLDER= 0.57
WDIR=/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER
cd /home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER
LAMBDA= 0.05
/home/durba/gmx455/bin/mdrun -np 1 -deffnm md0.05 -v
############################
Now my problem is that my script doesn't recognize variable FOLDER and throws an error
couldn't find md0.05.tpr
which exist in the folder. If I write 0.57 at the place of $folder,It works fine, which makes me feel that it's not recognizing the variable FOLDER. LAMBDA is recognized perfectly in both of the cases.If somebody can help me here, I will be grateful.
There should not be a space between the = and the value you wish to assign to the variables:
FOLDER="0.57"
WDIR="/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER"
cd "/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER"
LAMBDA="0.05"
/home/durba/gmx455/bin/mdrun -np 1 -deffnm md0.05 -v
############################
All of the double quotes "" I added are not strictly necessary for this example, however it is good practice to get into using them.

Resources