I created a script to verify a (big) number of items and it was doing the verification in a serial way (one after the other) with the end result of the script taking about 9 hours to complete. Looking around about how to improve this, I found GNU parallel but I'm having problems making it work.
The list of items is in a text file so I was doing the following:
readarray items < ${ALL_ITEMS}
export -f process_item
parallel process_item ::: "${items[#]}"
Problem is, I receive an error:
GNU parallel: Argument list too long
I understand by looking at similar posts 1, 2, 3 that this is more a Linux limitation than a GNU parallel one. From the answers to those posts I also tried to extrapolate a workaround by piping the items to head but the result is that only a few items (the parameter passed to head) are processed.
I have been able to make it work using xargs:
cat "${ALL_ITEMS}" | xargs -n 1 -P ${THREADS} -I {} bash -c 'process_item "$#"' _ {}
but I've seen GNU parallel has other nice features I'd like to use.
Any idea how to make this work with GNU parallel? By the way, the number of items is about 2.5 million and growing every day (the script run as a cron job).
Thanks
From man parallel:
parallel [options] [command [arguments]] < list_of_arguments
So:
export -f process_item
parallel process_item < ${ALL_ITEMS}
probably does what you want.
You can pipe the file to parallel, or just use the -a (--arg-file) option. The following are equvalent:
cat "${ALL_ITEMS}" | parallel process_item
parallel process_item < "${ALL_ITEMS}"
parallel -a "${ALL_ITEMS}" process_item
parallel --arg-file "${ALL_ITEMS}" process_item
parallel process_item :::: "${ALL_ITEMS}"
Related
I have a script pytonscript.py that I want to run on 500 samples. I have 50 CPUs available and want to run the script in parallel using 1 CPU for each sample, so that 50 samples are constantly running with 1 CPU each. Any ideas how to set this up without typing 500 lines with the different inputs? I know how to make a loop for each sample, but not how to make 50 samples running in parallel. I guess GNU parallel is a way?
Input samples in folder samples:
sample1
sample2
sample2
...
sample500
pytonscript.py -i samples/sample1.sam.bz2 -o output_folder
What about GNU xargs?
printf '%s\0' samples/sample*.sam.bz |
xargs -0L1 -P 50 pytonscript.py -o output_dir -i
This launches a new instance of the python script for each file, concurrently, maintaining a maximum of 50 at once.
If the wildcard glob expansion isn't specific enough, you can use bash's extglob: shopt -s exglob; # samples/sample+([0-9]).sam.bz
Is there a more idiomatic way of doing the following:
cat some_lines.txt | while read x; do ./process_line.sh $x; done
ie. applying a script to each line of stdin?
I could include the while read x; boilerplate in the script itself, but that doesn't really feel right either.
If you're running an external process and have GNU xargs, consider:
xargs -n1 -d $'\n' ./process_line.sh <some_lines.txt
If you don't like the while read loop's verbosity, and are running a shell function (where a fork() isn't natively needed, and thus where using an external tool like xargs or GNU parallel has a substantial performance cost), you can avoid it by wrapping the loop in a function:
for_each_line() {
local line
while IFS= read -r line; do
"$#" "$line" </dev/null
done
}
...can be run as:
process_line() {
echo "Processing line: $1"
}
for_each_line process_line <some_lines.txt
GNU Parallel is made for this kind of tasks - provided there is no problem in running the processing in parallel:
cat some_lines.txt | parallel ./process_line.sh {}
By default it will run one job per cpu-core. This can be adjusted with --jobs.
There is an overhead of running it through GNU Parallel in the order of 5 ms per job. One of the benefits you get is that you are guaranteed the output from the different jobs are not jumbled together and you can therefore use use the output as if the jobs had not been run in parallel:
cat some_lines.txt | parallel ./process_line.sh {} | do_post_processing
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
I need help to write a for loop.
Input: file01_R1.fastq, file01_R2.fastq. I have 100 files e.g., file02_R1.fastq, file02_R2.fastq and so on.
Output: file01_R1_PE.fastq, file01_R1_SE.fastq, file01_R2_PE.fastq, file01_R2_SE.fastq
I need to write a for loop so that I can run an executable for all 100 files. Any help please!
I assume that given the file
file01_R1.fastq
you want to run:
Trimmomatic file01_R1.fastq file01_R2.fastq -o file01_R1_PE.fastq file01_R1_SE.fastq file01_R2_PE.fastq file01_R2_SE.fastq
Using GNU Parallel it looks like this:
parallel Trimmomatic {} {= s/_R1/_R2/ =} -o {= s/_R1/_R1_PE/ =} {= s/_R1/_R1_SE/ =} {= s/_R1/_R2_PE/ =} {= s/_R1/_R2_SE/ =} ::: *_R1.fastq
GNU Parallel is a general parallelizer and makes it easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
for file in *
do
some_command_that_does_something_unspecified "$file"
done
Store 'R1.fastq' files in an array and run loop for all 'R2.fastq' files
R1.fastq=*_R1.fastq
for R2.fastq in *_R2.fastq
i=0
do command ${R1.fastq[$((i++))]} $R2.fastq
done
Name the output files as required by Trimmomatic
I have written a script that accept two files as input. I want to run all in parallel at the same time on different CPUs.
inputs:
x00.A x00.B
x01.A x01.B
...
x30.A x30.B
instead of running 30 times:
./script x00.A x00.B
./script x01.A x01.B
...
./script x30.A x30.B
I wanted to use paste and seq to generate and send them to the script.
paste & seq | xargs -n1 -P 30 ./script
But I do not know how to combine letters and numbers using paste and seq commands.
for num in $(seq -f %02.f 0 30); do
./script x$num.A x$num.B &
done
wait
Although I personally prefer to not use GNU seq or BSD jot but (ksh/bash) builtins:
num=-1; while (( ++num <= 30 )); do
./script x$num.A x$num.B &
done
wait
The final wait is just needed to make sure they all finish, after having run spread across your available CPU cores in the background. So, if you need the output of ./script, you must keep the wait.
Putting them into the background with & is the simplest way for parallelism. If you really want to exercise any sort of control over lots of backgrounded jobs like that, you will need some sort of framework like GNU Parallel instead.
You can use pure bash for generating the sequence:
printf "%s %s\n" x{00..30}.{A..B} | xargs -n1 -P 30 ./script
Happy holidays!
I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.