parallel with multiple scripts - parallel-processing

I have multiple scripts that are connected and used the output from each other. I have several input files in the directory sample that I would like to parallelize.
Any idea how this is best done?
sample_folder=${working_dir}/samples
input_bam=${sample_folder}/${sample}.bam
samtools fastq -#40 $input_bam > ${init_fastq}
trim_galore out ${sample_folder} $init_fastq
script.py ${preproc_fastq} > ${out_20}
What I started with:
parallel -j 8 script.py -i {} -o ?? -n8 ::: ./sample/*.bam

Related

How to run compression in gnu parallel?

Hi I am trying to compress a file with the bgzip command
bgzip -c 001DD.txt > 001DD.txt.gz
I want to run this command in parallel. I tried:
parallel ::: bgzip -c 001DD.txt > 001DD.txt.gz
but it gives me this error:
parallel: Error: Cannot open input file 'bgzip': No such file or directory
You need to chop the big file into smaller chunks and compress these. It can be done this way:
parallel --pipepart -a 001DD.txt --block -1 -k bgzip > 001DD.txt.gz

Is there any way to Run codes between multiple Nodes on HPC

I am trying to run let's say 10 different codes each saved in it's respective directory named as 1,2,3,..,10.
#PBS -l nodes=10:cores=1
This means I had 1 thread each on 10 different CPU's. Now I had to submit a job so that each directory get's 1 thread of 1 CPU only, and similarly other directories 2,3..,10.
Codes are for molecular dynamics and runs for several hours, and they are independent as well. I tried by Gnu Parallel but I failed to employ each 10 CPU's. May be Gnu Parallel is made to distribute jobs in between 1 CPU cores. I know MPI can, but I don't know exactly how. May anyone please suggest.
I do not have access to a PBS cluster, but Example 2 from
https://www.nas.nasa.gov/hecc/support/kb/using-gnu-parallel-to-package-multiple-jobs-in-a-single-pbs-job_303.html might be what you are looking for:
#PBS -lselect=6:ncpus=4:model=san
#PBS -lwalltime=4:00:00
cd $PBS_O_WORKDIR
seq 64 | parallel -j 4 -u --sshloginfile $PBS_NODEFILE \
"cd $PWD; ./myscript.csh {}"
Adapted to your situation (untested):
#PBS -l place=scatter
#PBS -l nodes=10:cores=1
cd $PBS_O_WORKDIR
seq 10 | parallel -j 1 --sshloginfile $PBS_NODEFILE --wd $PBS_O_WORKDIR ./myscript {}
You need place=scatter because otherwise the same host may be listed twice in $PBS_NODEFILE, and GNU Parallel ignores duplicates.

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget
With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

parallel check md5 file

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:
cat checksums.md5 | parallel md5sum -c {}
But getting this error:
md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory
How can I parallel the md5sum checking?
Assuming checksums.md5 has the format:
d41d8cd98f00b204e9800998ecf8427e My file name
Run:
cat checksums.md5 | parallel --pipe -N1 md5sum -c
If your files are small: -N100
If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.
You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:
cat checksums.md5 | parallel --pipe md5sum -c -
By default size of the block is 1 MB, can be changed with --block option.

Create a new batch .txt file with specified content for every current file in a directory

I have a huge list of files on a cluster and I need to create a .txt file for each "pair". Each pair is specified by filename_R1.fq.gz and filename_R2.fq.gz. for each pair of R1 and R2 files I need to create a text file that contains:
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \
--phred33 \
--fast-local \
-X 1000 \
-p 12 \
-x /usr3/graduate/dhc285/reference_files/21G6 \
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \
| samtools view -bS - > ${i%_R1.fq.gz}.bam
Where the $i command refers to my filenames. I would also like each file to be named ${i%_R1.fq.gz}.txt. Thanks!
Using GNU Parallel it looks like this:
sge_jobfile() {
i="$1"
cat <<EOF > ${i%_R1.fq.gz}.txt
#!/bin/bash
#$ -N align.$i
#$ -j y
#$ -l h_rt=4:00:00
#$ -pe omp 12
bowtie2 \\
--phred33 \\
--fast-local \\
-X 1000 \\
-p 12 \\
-x /usr3/graduate/dhc285/reference_files/21G6 \\
-1 $i -2 ${i%_R1.fq.gz}_R2.fq.gz \\
| samtools view -bS - > ${i%_R1.fq.gz}.bam
EOF
}
export -f sge_jobfile
parallel sge_jobfile ::: *_R1.fq.gz
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Resources