Create a new batch .txt file with specified content for every current file in a directory - bash

I have a huge list of files on a cluster and I need to create a .txt file for each "pair". Each pair is specified by filename_R1.fq.gz and filename_R2.fq.gz. for each pair of R1 and R2 files I need to create a text file that contains:
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \
--phred33 \
--fast-local \
-X 1000 \
-p 12 \
-x /usr3/graduate/dhc285/reference_files/21G6 \
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \
| samtools view -bS - > ${i%_R1.fq.gz}.bam
Where the $i command refers to my filenames. I would also like each file to be named ${i%_R1.fq.gz}.txt. Thanks!

Using GNU Parallel it looks like this:
sge_jobfile() {
i="$1"
cat <<EOF > ${i%_R1.fq.gz}.txt
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \\
--phred33 \\
--fast-local \\
-X 1000 \\
-p 12 \\
-x /usr3/graduate/dhc285/reference_files/21G6 \\
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \\
| samtools view -bS - > ${i%_R1.fq.gz}.bam
EOF
}
export -f sge_jobfile
parallel sge_jobfile ::: *_R1.fq.gz
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

Running different tasks on individual resource sets within same node

I asked about an issue I had with this using a different approach (Having issues running mpi4py on large HPC system. Receving startup errors and sometimes variable errors), however I'm currently attempting two other approaches. With no success. All examples below still put the same task on each of the six resource sets.
Background: I'm attempting to distribute predictions across resource sets on a node. Each resource set contains 1 gpu and 7 cpus and there are six sets per node. Once a RS task completes, it should move on to the next prediction on in a list (part00.lst through part05.lst; in theory one per RS)
First approach looks something like this (a submission bash script calls this using jsrun -r6 -g1 -a1 -c7 -b packed:7 -d packed -l gpu-cpu):
#!/bin/bash
output=/path/ ##where completed predictions will be collected
for i in {0..5}; do
target=part0${i}.lst
........ ##the singularity job script to execute using $target and $output variables
done
The next attempt is using simultaneous jobs steps via UNIX backgrounding (which others have been able to appropriate to do similar things that I wish to do, but for different jobs and tasks). Here I created six separate bash files with each corresponding input file ($target aka part00.lst through part05.lst):
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
I also attempted just hardcoding the six separate bash files:
#!/bin/bash
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_00.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_01.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_02.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_03.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_04.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_05.sh &
wait
Thanks for any help! I'm still quite new to all of this!
Okay, attempt number two using simultaneous job steps/UNIX process backgrounding was nearly correct!
It now works. An example for one node:
Submission script:
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
It was only a matter of incorrect flags (-n 1 -r 1, not -n 1 -r 6).

Output multiple files SGE

I have a .sh script that looks like this:
#$ -t 1-8
#$ -tc 8
#list of tasks
task_list=$( sed "${SGE_TASK_ID}q;d" list_of_jobs.txt )
#python script
./ldsc.py \
--h2 ${task_list} \
--ref-ld-chr /baselineLD. \
--out $cts_name \
I am running 8 jobs in parallel, but need each of them to output a separate file using the --out flag.
How can I do this?
The list_of_jobs.txt is a list of eight files (tasks) that get analyzed.
file1.txt
file2.txt
file3.txt
…
file8.txt
figured it out.
basename="${cts_name##*/}"
and then
--h2 $basename was the trick!

Using bash how to iterate through lines in a txt file and sequentially pair up every two lines

Hi I am attempting to use bash to iterate through a .txt file which contains the following lines. This is a smaller subset of the full list of fastq files, but all samples follow the same patterns.
/path/10-xxx-111-sample_S1_R1.fastq.gz
/path/10-xxx-111-sample_S1_R2.fastq.gz
/path/12-xxx-222-sample_S2_R1.fastq.gz
/path/12-xxx-222-sample_S2_R2.fastq.gz
/path/13-xxx-333-sample_S3_R1.fastq.gz
/path/13-xxx-333-sample_S3_R2.fastq.gz
And the aim is to pair every two lines and use the paths to provide information to further code in bash.
bwa mem ${index} ${r1} ${r2} -M -t 8 \
-R "#RG\tID:FlowCell.${name}\tSM:${name}\tPL:illumina\tLB:${Job}.${name}" | \
samtools sort -O bam -o ${bam}/${name}_bwa_output.bam
The first R1 and R2 should should correspond to ${r1} and ${r2} respectively, and in a sequential order.
The ${name}'s are contained in another file and consist of "10-xxx-111-sample_S1_" type information.
Any help in iterating through this text file to inform the downstream code would be really appreciated.
Intended output: First two lines of the .txt file will inform downstream code. e.g.
bwa mem ${index} /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 \
Following this, the next two lines will inform the downstream code and so forth. e.g.
bwa mem ${index} /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 \
Why not reading 2 lines at a time? Remove echo before bwa... when you'll be satisfied with the result.
$ cat myScript.sh
#!/usr/bin/env bash
while IFS= read -r r1 && IFS= read -r r2; do
echo bwa mem "${index}" "$r1" "$r2" -M -t 8 ...
done < fastq_filepaths.txt
$ index=myIndex ./myScript.sh
bwa mem myIndex /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/13-xxx-333-sample_S3_R1.fastq.gz /path/13-xxx-333-sample_S3_R2.fastq.gz -M -t 8 ...

Run netlogo in parallel mpi using Sun Grid Engine

#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
unset SGE_ROOT
/opt/mpi/1.8.1/bin/mpirun -np $NSLOTS -hostfile $TMPDIR/machines /home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/scale_med/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv
Error:
The: Command not found.
queuing: Command not found.
time-to-exit: Command not found.
Badly placed ()'s.
Output:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:
hostfile: /tmp/8396.1.all.q/machines
node: compute-0-1
If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------
PE file:
rm: cannot remove `/tmp/8396.1.all.q/rsh': No such file or directory
Earlier, I used to run the below:
#!/bin/bash
#$ -N new
#$ -q all.q
#$ -pe mpi 30
/home/abhishekb/netlogo/netlogo-5.2.0/netlogo-headless.sh \
--model /home/abhishekb/std_low/try4.nlogo \
--experiment experiment1 \
--table /home/abhishekb/Trash/anything.csv \
--threads 30
which simply processes on just one core (on checking at HPC end)though it grabs 30
Edit:
Doc for submitting jobs:
http://it.iiitd.edu.in/HPC_final_doc.pdf Please refer page 4 and 5 section 10 `Job subsmission steps.
Submitted job by qsub <filename.sh>

Insert string after match in variable

I am trying to make some workaround to solve a problem.
We have a gtk+ program that call a bash script who calls rdesktop.
In a machine, we discover that the rdesktop call need on extra parameter...
Since i didnt write anything of this code, and i can modify the GTK part of the problem, i can only edit the bash script that make the middle call between the calls.
i have a variable called CMD with something that look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5
i need to "live edit" this line for when the printer parameter exists, it append ="MS Publisher Imagesetter" after the printer name.
The best i accompplish so far is
ladb#luisdesk ~ $ input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
ladb#luisdesk ~ $ echo $input | sed s/'printer:.*a /=\"MS Publisher Imagesetter\" '/
Which return me:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r ="MS Publisher Imagesetter" 16 -u -p -d -g 80% 192.168.0.5
Almost this, but i need to append the string, not replace it.
help?
Edit: i pasted incomplete exemples. fixed
Edit2:
With the help of who respond, i end up with
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1\2="MS Publisher Imagesetter"/'
If you want the output to look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
This sed will do, it matches the printer: part first then the existing printer name and quotes both, if not you can adjust the replacement
variables to put the quotes/spacing where you want:
input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1"\2 MS Publisher Imagesetter"/'
output:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
You can use this:
sed 's/printer:[^=]\+=/\0 "MS Publisher Imagesetter"/' <<< "$input"
The \0 in the replacement pattern outputs the match itself.

Resources