Linux for loop for 2 inputs and 4 outputs - bash

I need help to write a for loop.
Input: file01_R1.fastq, file01_R2.fastq. I have 100 files e.g., file02_R1.fastq, file02_R2.fastq and so on.
Output: file01_R1_PE.fastq, file01_R1_SE.fastq, file01_R2_PE.fastq, file01_R2_SE.fastq
I need to write a for loop so that I can run an executable for all 100 files. Any help please!

I assume that given the file
file01_R1.fastq
you want to run:
Trimmomatic file01_R1.fastq file01_R2.fastq -o file01_R1_PE.fastq file01_R1_SE.fastq file01_R2_PE.fastq file01_R2_SE.fastq
Using GNU Parallel it looks like this:
parallel Trimmomatic {} {= s/_R1/_R2/ =} -o {= s/_R1/_R1_PE/ =} {= s/_R1/_R1_SE/ =} {= s/_R1/_R2_PE/ =} {= s/_R1/_R2_SE/ =} ::: *_R1.fastq
GNU Parallel is a general parallelizer and makes it easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

for file in *
do
some_command_that_does_something_unspecified "$file"
done

Store 'R1.fastq' files in an array and run loop for all 'R2.fastq' files
R1.fastq=*_R1.fastq
for R2.fastq in *_R2.fastq
i=0
do command ${R1.fastq[$((i++))]} $R2.fastq
done
Name the output files as required by Trimmomatic

Related

parallel execution of script on cluster for multiple inputs

I have a script pytonscript.py that I want to run on 500 samples. I have 50 CPUs available and want to run the script in parallel using 1 CPU for each sample, so that 50 samples are constantly running with 1 CPU each. Any ideas how to set this up without typing 500 lines with the different inputs? I know how to make a loop for each sample, but not how to make 50 samples running in parallel. I guess GNU parallel is a way?
Input samples in folder samples:
sample1
sample2
sample2
...
sample500
pytonscript.py -i samples/sample1.sam.bz2 -o output_folder
What about GNU xargs?
printf '%s\0' samples/sample*.sam.bz |
xargs -0L1 -P 50 pytonscript.py -o output_dir -i
This launches a new instance of the python script for each file, concurrently, maintaining a maximum of 50 at once.
If the wildcard glob expansion isn't specific enough, you can use bash's extglob: shopt -s exglob; # samples/sample+([0-9]).sam.bz

Mapping a script over all lines of stdin

Is there a more idiomatic way of doing the following:
cat some_lines.txt | while read x; do ./process_line.sh $x; done
ie. applying a script to each line of stdin?
I could include the while read x; boilerplate in the script itself, but that doesn't really feel right either.
If you're running an external process and have GNU xargs, consider:
xargs -n1 -d $'\n' ./process_line.sh <some_lines.txt
If you don't like the while read loop's verbosity, and are running a shell function (where a fork() isn't natively needed, and thus where using an external tool like xargs or GNU parallel has a substantial performance cost), you can avoid it by wrapping the loop in a function:
for_each_line() {
local line
while IFS= read -r line; do
"$#" "$line" </dev/null
done
}
...can be run as:
process_line() {
echo "Processing line: $1"
}
for_each_line process_line <some_lines.txt
GNU Parallel is made for this kind of tasks - provided there is no problem in running the processing in parallel:
cat some_lines.txt | parallel ./process_line.sh {}
By default it will run one job per cpu-core. This can be adjusted with --jobs.
There is an overhead of running it through GNU Parallel in the order of 5 ms per job. One of the benefits you get is that you are guaranteed the output from the different jobs are not jumbled together and you can therefore use use the output as if the jobs had not been run in parallel:
cat some_lines.txt | parallel ./process_line.sh {} | do_post_processing
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

GNU Parallel: Argument list too long when calling function

I created a script to verify a (big) number of items and it was doing the verification in a serial way (one after the other) with the end result of the script taking about 9 hours to complete. Looking around about how to improve this, I found GNU parallel but I'm having problems making it work.
The list of items is in a text file so I was doing the following:
readarray items < ${ALL_ITEMS}
export -f process_item
parallel process_item ::: "${items[#]}"
Problem is, I receive an error:
GNU parallel: Argument list too long
I understand by looking at similar posts 1, 2, 3 that this is more a Linux limitation than a GNU parallel one. From the answers to those posts I also tried to extrapolate a workaround by piping the items to head but the result is that only a few items (the parameter passed to head) are processed.
I have been able to make it work using xargs:
cat "${ALL_ITEMS}" | xargs -n 1 -P ${THREADS} -I {} bash -c 'process_item "$#"' _ {}
but I've seen GNU parallel has other nice features I'd like to use.
Any idea how to make this work with GNU parallel? By the way, the number of items is about 2.5 million and growing every day (the script run as a cron job).
Thanks
From man parallel:
parallel [options] [command [arguments]] < list_of_arguments
So:
export -f process_item
parallel process_item < ${ALL_ITEMS}
probably does what you want.
You can pipe the file to parallel, or just use the -a (--arg-file) option. The following are equvalent:
cat "${ALL_ITEMS}" | parallel process_item
parallel process_item < "${ALL_ITEMS}"
parallel -a "${ALL_ITEMS}" process_item
parallel --arg-file "${ALL_ITEMS}" process_item
parallel process_item :::: "${ALL_ITEMS}"

How to use gnu-parallel for processing a script with two inputs?

I am trying to run a Python script with two inputs as follows. I got ~300 of these two inputs so I wonder if somebody could advise how to run them with parallel.
The single run looks like:
python stable.py KOG_1.fan KOG_1.fasta > KOG_1.stable
My test with parallel which is not working:
ls *.fan; ls *.fasta | parallel python stable.py {} {} > {.}.stable
but how do I specify that is has to run with _1.fan and _1.fasta; then _2.fan and _1.fasta and so on... until _300.fan and _300.fasta.
This is not really a Python question, it's a question about GNU parallel. You could try this if all files are prefixed with "KOG_":
seq 1 300 | parallel python stable.py KOG_{}.fan KOG_{}.fasta ">" KOG_{.}.stable
The quotes around the redirect (">") are important, unless you want all of the output in one file.
To handle generic prefixes:
ls *fan *fasta | parallel ---max-lines=2 python stable.py {1} {2} ">" {1.}.stable
This uses the -max-lines option to take 2 lines per command. Of course this works only if the *.fan and *.fasta files match up, i.e. there must be the same number of each, and the numbers need to match up, otherwise you'll end up pairing files that shouldn't be paired. If that is a problem, you can figure out a command that will more robustly feed pairs to parallel.
Try:
parallel python stable.py {} {.}.fasta '>' {.}.stable ::: *fan
I recommend you split this task in two steps:
Create a jobs file containing all commands you want to run with
parallel.
You need to create a text file jobs.txt that should be similar to the one presented bellow:
python stable.py KOG_1.fan KOG_1.fasta > KOG_1.stable
python stable.py KOG_2.fan KOG_2.fasta > KOG_2.stable
python stable.py KOG_3.fan KOG_3.fasta > KOG_3.stable
python stable.py KOG_4.fan KOG_4.fasta > KOG_4.stable
...
python stable.py KOG_300.fan KOG_300.fasta > KOG_300.stable
If all your files are prefixed with KOG, you can build up this file this way:
for I in `seq 300`; do echo "python stable.py KOG_$I.fan KOG_$I.fasta > KOG_$I.stable" >> jobs.txt; done;
Run parallel using the jobs file
Once you have the jobs file, you just need to run the following command:
parallel -j4 < jobs.txt
Note that -j4 indicates that at most 4 commands from your jobs file will be running in parallel. You can adjust that according to the number of cores available on your computer.

Parallel nested for loop in bash

I am trying to run a c executable through bash. The executable will take a different argument in each iteration, and I want to do it in parallel since I have 12 cores available.
I tried
w=1;
for i in {1..100}
do
l=$(($i-1));
for j in {12*l..12*i}
do
./run $w/100 > "$w"_out &
done
expr=$w % 12;
if ["$expr" -eq "0"]
then wait;
fi;
done
run is the c executable. I want to run it with increasing argument w in each step, and I want to wait until all processes are done if 12 of the cores are in use. SO basically, I will run 12 executables at the same time, then wait until they are completed, and then move to the next 12.
Hope I made my point clear.
Cheers.
Use gnu parallel instead:
parallel ./myscript {1} ::: {1..100}
You can specify the number of parallel processes with the -P option, but it defaults to the number of cores in the system.
You can also specify -k to keep the output order and redirect the file.
To redirect the output to individual files, you can specify the output redirection, but you have to quote it, so that it is not parsed by the shell. For example:
parallel ./run {1} '>' {1}_out ::: {1..10}
is equivalent to running ./run 1 > 1_out to ./run 10 > 10_out

Resources