How to run compression in gnu parallel? - parallel-processing

Hi I am trying to compress a file with the bgzip command
bgzip -c 001DD.txt > 001DD.txt.gz
I want to run this command in parallel. I tried:
parallel ::: bgzip -c 001DD.txt > 001DD.txt.gz
but it gives me this error:
parallel: Error: Cannot open input file 'bgzip': No such file or directory

You need to chop the big file into smaller chunks and compress these. It can be done this way:
parallel --pipepart -a 001DD.txt --block -1 -k bgzip > 001DD.txt.gz

Related

Passing multiple arguments to parallel function when uploading to FTP

I'm using ncftpput to upload images to a ftp server.
An example of the script is
# destination. origin
ncftpput -R ftp_server icon_d2/cape_cin ./cape_cin_*.png
ncftpput -R ftp_server icon_d2/t_v_pres ./t_v_pres_*.png
ncftpput -R ftp_server icon_d2/it/cape_cin ./it/cape_cin_*.png
ncftpput -R ftp_server icon_d2/it/t_v_pres ./it/t_v_pres_*.png
I'm trying to parallelize this with GNU parallel but I'm struggling to pass the arguments to ncftpput. I know what I'm doing wrong but somehow can not find the solution.
If I construct the array of what I need to upload
images_output=("cape_cin" "t_v_pres")
# suffix for naming
projections_output=("" "it/")
# remote folder on server
projections_output_folder=("icon_d2" "icon_d2/it")
# Create a list of all the images to upload
upload_elements=()
for i in "${!projections_output[#]}"; do
for j in "${images_output[#]}"; do
upload_elements+=("${projections_output_folder[$i]}/${j} ./${projections_output[$i]}${j}_*.png")
done
done
Then I can do the upload in serial like this
for k in "${upload_elements[#]}"; do
ncftpput -R ftp_server ${k}
done
When using parallel I'm using colsep to separate the arguments
parallel -j 5 --colsep ' ' ncftpput -R ftp_server ::: "${upload_elements[#]}"
but ncftpput gives an error that tells me it is not understanding the structure of the passed argument.
What am I doing wrong?
Try:
parallel -j 5 --colsep ' ' eval ncftpput -R ftp_server ::: "${upload_elements[#]}"
This should do exactly the same:
for k in "${upload_elements[#]}"; do
echo ncftpput -R ftp_server ${k}
done | parallel -j 5

parallel with multiple scripts

I have multiple scripts that are connected and used the output from each other. I have several input files in the directory sample that I would like to parallelize.
Any idea how this is best done?
sample_folder=${working_dir}/samples
input_bam=${sample_folder}/${sample}.bam
samtools fastq -#40 $input_bam > ${init_fastq}
trim_galore out ${sample_folder} $init_fastq
script.py ${preproc_fastq} > ${out_20}
What I started with:
parallel -j 8 script.py -i {} -o ?? -n8 ::: ./sample/*.bam

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget
With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

parallel check md5 file

I have a md5sum file containing lots of lines. I want to use GNU parallel to accelerate the md5sum checking process. In the md5sum, when no file input, it will take the md5 string from stdin. I tried this:
cat checksums.md5 | parallel md5sum -c {}
But getting this error:
md5sum 445350b414a8031d9dd6b1e68a6f2367 testing.gz: No such file or directory
How can I parallel the md5sum checking?
Assuming checksums.md5 has the format:
d41d8cd98f00b204e9800998ecf8427e My file name
Run:
cat checksums.md5 | parallel --pipe -N1 md5sum -c
If your files are small: -N100
If that does not speed up your processing make sure your disks are fast enough: md5sum can process 500 MB/s. iostat -dkx 1 can tell you if your disks are a bottleneck.
You need option --pipe. In this mode parallel splits stdin into blocks and supplies each block to the command via stdin, see man parallel for details:
cat checksums.md5 | parallel --pipe md5sum -c -
By default size of the block is 1 MB, can be changed with --block option.

Different results with MACS2 when Peakcalling with .bed or .bam

I got the following problem:
I use MACS2 (2.1.0.20140616) with the following short commandline:
macs2 callpeak -t file.bam -f bam -g 650000000 -n Test -B --nomodel -q 0.01
It seems to work as I want, but when I convert the .bamfile into .bed via
bedtools bamtobed -i file.bam > file.bed
and use MACS2 on this, I get a lot more peaks. As far as I understand, the .bed-file should contain the same information as the .bam-file, so that's kinda odd.
Any suggestions what's the problem?
Thanks!

Resources