Using bash how to iterate through lines in a txt file and sequentially pair up every two lines - bash

Hi I am attempting to use bash to iterate through a .txt file which contains the following lines. This is a smaller subset of the full list of fastq files, but all samples follow the same patterns.
/path/10-xxx-111-sample_S1_R1.fastq.gz
/path/10-xxx-111-sample_S1_R2.fastq.gz
/path/12-xxx-222-sample_S2_R1.fastq.gz
/path/12-xxx-222-sample_S2_R2.fastq.gz
/path/13-xxx-333-sample_S3_R1.fastq.gz
/path/13-xxx-333-sample_S3_R2.fastq.gz
And the aim is to pair every two lines and use the paths to provide information to further code in bash.
bwa mem ${index} ${r1} ${r2} -M -t 8 \
-R "#RG\tID:FlowCell.${name}\tSM:${name}\tPL:illumina\tLB:${Job}.${name}" | \
samtools sort -O bam -o ${bam}/${name}_bwa_output.bam
The first R1 and R2 should should correspond to ${r1} and ${r2} respectively, and in a sequential order.
The ${name}'s are contained in another file and consist of "10-xxx-111-sample_S1_" type information.
Any help in iterating through this text file to inform the downstream code would be really appreciated.
Intended output: First two lines of the .txt file will inform downstream code. e.g.
bwa mem ${index} /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 \
Following this, the next two lines will inform the downstream code and so forth. e.g.
bwa mem ${index} /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 \

Why not reading 2 lines at a time? Remove echo before bwa... when you'll be satisfied with the result.
$ cat myScript.sh
#!/usr/bin/env bash
while IFS= read -r r1 && IFS= read -r r2; do
echo bwa mem "${index}" "$r1" "$r2" -M -t 8 ...
done < fastq_filepaths.txt
$ index=myIndex ./myScript.sh
bwa mem myIndex /path/10-xxx-111-sample_S1_R1.fastq.gz /path/10-xxx-111-sample_S1_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/12-xxx-222-sample_S2_R1.fastq.gz /path/12-xxx-222-sample_S2_R2.fastq.gz -M -t 8 ...
bwa mem myIndex /path/13-xxx-333-sample_S3_R1.fastq.gz /path/13-xxx-333-sample_S3_R2.fastq.gz -M -t 8 ...

Related

Running different tasks on individual resource sets within same node

I asked about an issue I had with this using a different approach (Having issues running mpi4py on large HPC system. Receving startup errors and sometimes variable errors), however I'm currently attempting two other approaches. With no success. All examples below still put the same task on each of the six resource sets.
Background: I'm attempting to distribute predictions across resource sets on a node. Each resource set contains 1 gpu and 7 cpus and there are six sets per node. Once a RS task completes, it should move on to the next prediction on in a list (part00.lst through part05.lst; in theory one per RS)
First approach looks something like this (a submission bash script calls this using jsrun -r6 -g1 -a1 -c7 -b packed:7 -d packed -l gpu-cpu):
#!/bin/bash
output=/path/ ##where completed predictions will be collected
for i in {0..5}; do
target=part0${i}.lst
........ ##the singularity job script to execute using $target and $output variables
done
The next attempt is using simultaneous jobs steps via UNIX backgrounding (which others have been able to appropriate to do similar things that I wish to do, but for different jobs and tasks). Here I created six separate bash files with each corresponding input file ($target aka part00.lst through part05.lst):
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
I also attempted just hardcoding the six separate bash files:
#!/bin/bash
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_00.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_01.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_02.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_03.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_04.sh &
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_05.sh &
wait
Thanks for any help! I'm still quite new to all of this!
Okay, attempt number two using simultaneous job steps/UNIX process backgrounding was nearly correct!
It now works. An example for one node:
Submission script:
#!/bin/bash
## Various submission flags here
for i in {0..5}; do
jsrun -r 6 -g 1 -a 1 -c 7 -brs -d packed -l gpu-cpu bash batch_run_0${i}.sh &
done
wait
It was only a matter of incorrect flags (-n 1 -r 1, not -n 1 -r 6).

Passing multiple arguments to parallel function when uploading to FTP

I'm using ncftpput to upload images to a ftp server.
An example of the script is
# destination. origin
ncftpput -R ftp_server icon_d2/cape_cin ./cape_cin_*.png
ncftpput -R ftp_server icon_d2/t_v_pres ./t_v_pres_*.png
ncftpput -R ftp_server icon_d2/it/cape_cin ./it/cape_cin_*.png
ncftpput -R ftp_server icon_d2/it/t_v_pres ./it/t_v_pres_*.png
I'm trying to parallelize this with GNU parallel but I'm struggling to pass the arguments to ncftpput. I know what I'm doing wrong but somehow can not find the solution.
If I construct the array of what I need to upload
images_output=("cape_cin" "t_v_pres")
# suffix for naming
projections_output=("" "it/")
# remote folder on server
projections_output_folder=("icon_d2" "icon_d2/it")
# Create a list of all the images to upload
upload_elements=()
for i in "${!projections_output[#]}"; do
for j in "${images_output[#]}"; do
upload_elements+=("${projections_output_folder[$i]}/${j} ./${projections_output[$i]}${j}_*.png")
done
done
Then I can do the upload in serial like this
for k in "${upload_elements[#]}"; do
ncftpput -R ftp_server ${k}
done
When using parallel I'm using colsep to separate the arguments
parallel -j 5 --colsep ' ' ncftpput -R ftp_server ::: "${upload_elements[#]}"
but ncftpput gives an error that tells me it is not understanding the structure of the passed argument.
What am I doing wrong?
Try:
parallel -j 5 --colsep ' ' eval ncftpput -R ftp_server ::: "${upload_elements[#]}"
This should do exactly the same:
for k in "${upload_elements[#]}"; do
echo ncftpput -R ftp_server ${k}
done | parallel -j 5

Output multiple files SGE

I have a .sh script that looks like this:
#$ -t 1-8
#$ -tc 8
#list of tasks
task_list=$( sed "${SGE_TASK_ID}q;d" list_of_jobs.txt )
#python script
./ldsc.py \
--h2 ${task_list} \
--ref-ld-chr /baselineLD. \
--out $cts_name \
I am running 8 jobs in parallel, but need each of them to output a separate file using the --out flag.
How can I do this?
The list_of_jobs.txt is a list of eight files (tasks) that get analyzed.
file1.txt
file2.txt
file3.txt
…
file8.txt
figured it out.
basename="${cts_name##*/}"
and then
--h2 $basename was the trick!

Insert string after match in variable

I am trying to make some workaround to solve a problem.
We have a gtk+ program that call a bash script who calls rdesktop.
In a machine, we discover that the rdesktop call need on extra parameter...
Since i didnt write anything of this code, and i can modify the GTK part of the problem, i can only edit the bash script that make the middle call between the calls.
i have a variable called CMD with something that look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5
i need to "live edit" this line for when the printer parameter exists, it append ="MS Publisher Imagesetter" after the printer name.
The best i accompplish so far is
ladb#luisdesk ~ $ input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
ladb#luisdesk ~ $ echo $input | sed s/'printer:.*a /=\"MS Publisher Imagesetter\" '/
Which return me:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r ="MS Publisher Imagesetter" 16 -u -p -d -g 80% 192.168.0.5
Almost this, but i need to append the string, not replace it.
help?
Edit: i pasted incomplete exemples. fixed
Edit2:
With the help of who respond, i end up with
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1\2="MS Publisher Imagesetter"/'
If you want the output to look like:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
This sed will do, it matches the printer: part first then the existing printer name and quotes both, if not you can adjust the replacement
variables to put the quotes/spacing where you want:
input="rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:HP_Officejet_Pro_8600 -a 16 -u -p -d -g 80% 192.168.0.5"
echo "$input" | sed 's/\(printer:\)\([^ ]*\)/\1"\2 MS Publisher Imagesetter"/'
output:
rdesktop -x m -r disk:USBDISK=/media -r disk:user=/home/user/ -r printer:"HP_Officejet_Pro_8600 MS Publisher Imagesetter" -a 16 -u -p -d -g 80% 192.168.0.5
You can use this:
sed 's/printer:[^=]\+=/\0 "MS Publisher Imagesetter"/' <<< "$input"
The \0 in the replacement pattern outputs the match itself.

Create a new batch .txt file with specified content for every current file in a directory

I have a huge list of files on a cluster and I need to create a .txt file for each "pair". Each pair is specified by filename_R1.fq.gz and filename_R2.fq.gz. for each pair of R1 and R2 files I need to create a text file that contains:
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \
--phred33 \
--fast-local \
-X 1000 \
-p 12 \
-x /usr3/graduate/dhc285/reference_files/21G6 \
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \
| samtools view -bS - > ${i%_R1.fq.gz}.bam
Where the $i command refers to my filenames. I would also like each file to be named ${i%_R1.fq.gz}.txt. Thanks!
Using GNU Parallel it looks like this:
sge_jobfile() {
i="$1"
cat <<EOF > ${i%_R1.fq.gz}.txt
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \\
--phred33 \\
--fast-local \\
-X 1000 \\
-p 12 \\
-x /usr3/graduate/dhc285/reference_files/21G6 \\
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \\
| samtools view -bS - > ${i%_R1.fq.gz}.bam
EOF
}
export -f sge_jobfile
parallel sge_jobfile ::: *_R1.fq.gz
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources