Running different tasks on individual resource sets within same node - bash

I asked about an issue I had with this using a different approach (Having issues running mpi4py on large HPC system. Receving startup errors and sometimes variable errors), however I'm currently attempting two other approaches. With no success. All examples below still put the same task on each of the six resource sets.
Background: I'm attempting to distribute predictions across resource sets on a node. Each resource set contains 1 gpu and 7 cpus and there are six sets per node. Once a RS task completes, it should move on to the next prediction on in a list (part00.lst through part05.lst; in theory one per RS)
First approach looks something like this (a submission bash script calls this using jsrun -r6 -g1 -a1 -c7 -b packed:7 -d packed -l gpu-cpu):
#!/bin/bash
output=/path/ ##where completed predictions will be collected
for i in {0..5}; do
target=part0${i}.lst
........ ##the singularity job script to execute using $target and $output variables
done
The next attempt is using simultaneous jobs steps via UNIX backgrounding (which others have been able to appropriate to do similar things that I wish to do, but for different jobs and tasks). Here I created six separate bash files with each corresponding input file ($target aka part00.lst through part05.lst):
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
I also attempted just hardcoding the six separate bash files:
#!/bin/bash
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_00.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_01.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_02.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_03.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_04.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_05.sh &
wait
Thanks for any help! I'm still quite new to all of this!

Okay, attempt number two using simultaneous job steps/UNIX process backgrounding was nearly correct!
It now works. An example for one node:
Submission script:
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
It was only a matter of incorrect flags (-n 1 -r 1, not -n 1 -r 6).

Related

Using bash how to iterate through lines in a txt file and sequentially pair up every two lines

Hi I am attempting to use bash to iterate through a .txt file which contains the following lines. This is a smaller subset of the full list of fastq files, but all samples follow the same patterns.
/path/10-xxx-111-sample_S1_R1.fastq.gz
/path/10-xxx-111-sample_S1_R2.fastq.gz
/path/12-xxx-222-sample_S2_R1.fastq.gz
/path/12-xxx-222-sample_S2_R2.fastq.gz
/path/13-xxx-333-sample_S3_R1.fastq.gz
/path/13-xxx-333-sample_S3_R2.fastq.gz
And the aim is to pair every two lines and use the paths to provide information to further code in bash.
bwa mem ${index} ${r1} ${r2} -M -t 8 \
-R "#RG\tID:FlowCell.${name}\tSM:${name}\tPL:illumina\tLB:${Job}.${name}" | \
samtools sort -O bam -o ${bam}/${name}_bwa_output.bam
The first R1 and R2 should should correspond to ${r1} and ${r2} respectively, and in a sequential order.
The ${name}'s are contained in another file and consist of "10-xxx-111-sample_S1_" type information.
Any help in iterating through this text file to inform the downstream code would be really appreciated.
Intended output: First two lines of the .txt file will inform downstream code. e.g.
bwa mem ${index} /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 \
Following this, the next two lines will inform the downstream code and so forth. e.g.
bwa mem ${index} /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 \
Why not reading 2 lines at a time? Remove echo before bwa... when you'll be satisfied with the result.
$ cat myScript.sh
#!/usr/bin/env bash
while IFS= read -r r1 && IFS= read -r r2; do
echo bwa mem "${index}" "$r1" "$r2" -M -t 8 ...
done < fastq_filepaths.txt
$ index=myIndex ./myScript.sh
bwa mem myIndex /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/13-xxx-333-sample_S3_R1.fastq.gz /path/13-xxx-333-sample_S3_R2.fastq.gz -M -t 8 ...

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget
With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

why i get this " -bash: ROUGE-1.5.5.pl: command not found error " error?

I tried to evaluate the system generated summaries using ROUGE. I have used the command line bellow but i get this-bash: ROUGE-1.5.5.pl: command not founderror. what is the problem?
ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -d rougejk.in
The path to your file "ROUGE-1.5.5.pl" isn't defined in your PATH variable, or your file isn't in a folder defined in PATH :
$> echo $PATH
If you want to add the path to the folder containing "ROUGE-1.5.5.pl" :
PATH="$PATH:/path/to/folder"
Then you will be able to run your command as described.
You can add this line to your .bashrc, if you want this to be permanent.
Else you have to run the command like this (from your script location) :
./ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -d rougejk.in

Run netlogo in parallel mpi using Sun Grid Engine

#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
unset SGE_ROOT
/opt/mpi/1.8.1/bin/mpirun -np $NSLOTS -hostfile $TMPDIR/machines /home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/scale_med/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv
Error:
The: Command not found.
queuing: Command not found.
time-to-exit: Command not found.
Badly placed ()'s.
Output:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:
hostfile: /tmp/8396.1.all.q/machines
node: compute-0-1
If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------
PE file:
rm: cannot remove `/tmp/8396.1.all.q/rsh': No such file or directory
Earlier, I used to run the below:
#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/std_low/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv \
--threads 30
which simply processes on just one core (on checking at HPC end)though it grabs 30
Edit:
Doc for submitting jobs:
http://it.iiitd.edu.in/HPC_final_doc.pdf Please refer page 4 and 5 section 10 `Job subsmission steps.
Submitted job by qsub <filename.sh>

Create a new batch .txt file with specified content for every current file in a directory

I have a huge list of files on a cluster and I need to create a .txt file for each "pair". Each pair is specified by filename_R1.fq.gz and filename_R2.fq.gz. for each pair of R1 and R2 files I need to create a text file that contains:
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \
--phred33 \
--fast-local \
-X 1000 \
-p 12 \
-x /usr3/graduate/dhc285/reference_files/21G6 \
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \
| samtools view -bS - > ${i%_R1.fq.gz}.bam
Where the $i command refers to my filenames. I would also like each file to be named ${i%_R1.fq.gz}.txt. Thanks!
Using GNU Parallel it looks like this:
sge_jobfile() {
i="$1"
cat <<EOF > ${i%_R1.fq.gz}.txt
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \\
--phred33 \\
--fast-local \\
-X 1000 \\
-p 12 \\
-x /usr3/graduate/dhc285/reference_files/21G6 \\
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \\
| samtools view -bS - > ${i%_R1.fq.gz}.bam
EOF
}
export -f sge_jobfile
parallel sge_jobfile ::: *_R1.fq.gz
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources