Mapping a script over all lines of stdin - bash

Is there a more idiomatic way of doing the following:
cat some_lines.txt | while read x; do ./process_line.sh $x; done
ie. applying a script to each line of stdin?
I could include the while read x; boilerplate in the script itself, but that doesn't really feel right either.

If you're running an external process and have GNU xargs, consider:
xargs -n1 -d $'\n' ./process_line.sh <some_lines.txt
If you don't like the while read loop's verbosity, and are running a shell function (where a fork() isn't natively needed, and thus where using an external tool like xargs or GNU parallel has a substantial performance cost), you can avoid it by wrapping the loop in a function:
for_each_line() {
local line
while IFS= read -r line; do
"$#" "$line" </dev/null
done
}
...can be run as:
process_line() {
echo "Processing line: $1"
}
for_each_line process_line <some_lines.txt

GNU Parallel is made for this kind of tasks - provided there is no problem in running the processing in parallel:
cat some_lines.txt | parallel ./process_line.sh {}
By default it will run one job per cpu-core. This can be adjusted with --jobs.
There is an overhead of running it through GNU Parallel in the order of 5 ms per job. One of the benefits you get is that you are guaranteed the output from the different jobs are not jumbled together and you can therefore use use the output as if the jobs had not been run in parallel:
cat some_lines.txt | parallel ./process_line.sh {} | do_post_processing
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

GNU Parallel: Argument list too long when calling function

I created a script to verify a (big) number of items and it was doing the verification in a serial way (one after the other) with the end result of the script taking about 9 hours to complete. Looking around about how to improve this, I found GNU parallel but I'm having problems making it work.
The list of items is in a text file so I was doing the following:
readarray items < ${ALL_ITEMS}
export -f process_item
parallel process_item ::: "${items[#]}"
Problem is, I receive an error:
GNU parallel: Argument list too long
I understand by looking at similar posts 1, 2, 3 that this is more a Linux limitation than a GNU parallel one. From the answers to those posts I also tried to extrapolate a workaround by piping the items to head but the result is that only a few items (the parameter passed to head) are processed.
I have been able to make it work using xargs:
cat "${ALL_ITEMS}" | xargs -n 1 -P ${THREADS} -I {} bash -c 'process_item "$#"' _ {}
but I've seen GNU parallel has other nice features I'd like to use.
Any idea how to make this work with GNU parallel? By the way, the number of items is about 2.5 million and growing every day (the script run as a cron job).
Thanks
From man parallel:
parallel [options] [command [arguments]] < list_of_arguments
So:
export -f process_item
parallel process_item < ${ALL_ITEMS}
probably does what you want.
You can pipe the file to parallel, or just use the -a (--arg-file) option. The following are equvalent:
cat "${ALL_ITEMS}" | parallel process_item
parallel process_item < "${ALL_ITEMS}"
parallel -a "${ALL_ITEMS}" process_item
parallel --arg-file "${ALL_ITEMS}" process_item
parallel process_item :::: "${ALL_ITEMS}"

Linux for loop for 2 inputs and 4 outputs

I need help to write a for loop.
Input: file01_R1.fastq, file01_R2.fastq. I have 100 files e.g., file02_R1.fastq, file02_R2.fastq and so on.
Output: file01_R1_PE.fastq, file01_R1_SE.fastq, file01_R2_PE.fastq, file01_R2_SE.fastq
I need to write a for loop so that I can run an executable for all 100 files. Any help please!
I assume that given the file
file01_R1.fastq
you want to run:
Trimmomatic file01_R1.fastq file01_R2.fastq -o file01_R1_PE.fastq file01_R1_SE.fastq file01_R2_PE.fastq file01_R2_SE.fastq
Using GNU Parallel it looks like this:
parallel Trimmomatic {} {= s/_R1/_R2/ =} -o {= s/_R1/_R1_PE/ =} {= s/_R1/_R1_SE/ =} {= s/_R1/_R2_PE/ =} {= s/_R1/_R2_SE/ =} ::: *_R1.fastq
GNU Parallel is a general parallelizer and makes it easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
for file in *
do
some_command_that_does_something_unspecified "$file"
done
Store 'R1.fastq' files in an array and run loop for all 'R2.fastq' files
R1.fastq=*_R1.fastq
for R2.fastq in *_R2.fastq
i=0
do command ${R1.fastq[$((i++))]} $R2.fastq
done
Name the output files as required by Trimmomatic

read multiple files in bash

I have two .txt files that I want to read line per line simultaneously in .sh script. Both .txt files have the same number of lines. Inside the loop I want to use the sed-command to change the full_sample_name and sample_name in another file.
I know how this works if you just read one file, but I cannot get it work for two files.
#! /bin/bash
FULL_SAMPLE="file1.txt"
SAMPLE="file2.txt"
while read ... && ...
do
sed -e "s/\<full_sample_name\>/$FULL_SAMPLE/g" -e "s/\<sample_name\>/$SAMPLE/g" pipeline.sh > $SAMPLE.sh
done < ...?
Charles provided a very good answer.
You could use paste to join the lines of the files with some delimiter (that shouldn't appear in the files):
paste -d ":" file1.txt file2.txt | while IFS=":" read -r full samp; do
do_stuff_with "$full" and "$samp"
done
#!/bin/bash
full_sample_file="file1.txt"
sample_file="file2.txt"
while read -r -u 3 full_sample_name && read -r -u 4 sample_name; do
sed -e "s/\<full_sample_name\>/$full_sample_name/g" \
-e "s/\<sample_name\>/$sample_name/g" \
pipeline.sh >"$sample_name.sh"
done 3<"$full_sample_file" 4<"$sample_file" # automatically closed on loop exit
In this case, I'm assigning file descriptor 3 to file1.txt and file descriptor 4 to file2.txt.
By the way, with bash 4.1 or newer, you no longer need to handle file descriptors manually:
# opening explicitly, since even if opened on the loop, these need
# to be explicitly closed.
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
: do stuff here with "$full_sample_name" and "$sample_name"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
One more note: You could make this a bit more efficient (and also more correct, if your sample_name and full_sample_name values aren't guaranteed to evaluate to themselves when interpreted as regular expressions, if your input file contains no literal NULs [which, as a shell script, it shouldn't], and if the arrow brackets are intended to be literal rather than word-boundary regex characters) by not using sed at all, but just reading the input to be converted into a shell variable, and doing the replacements there!
exec {full_sample_fd}<file1.txt
exec {sample_fd}<file2.txt
IFS= read -r -d '' input_file <pipeline.sh
while read -r -u "$full_sample_fd" full_sample_name \
&& read -r -u "$sample_fd" sample_name; do
output=${input_file//'<full_sample_name>'/${full_sample_name}}
output=${output//'<sample_name>'/${sample_name}}
printf '%s' "$output" >"${sample_name}.sh"
done
# close the files explicitly
exec {full_sample_fd}>&- {sample_fd}>&-
With GNU Parallel it will look like this:
#! /bin/bash
do_sed() {
sed -e "s/\<full_sample_name\>/$1/g" -e "s/\<sample_name\>/$2/g" pipeline.sh > "$2".sh
}
export -f do_sed
parallel --xapply do_sed {1} {2} :::: file1.txt file2.txt
The added benefit is that you get it run in parallel. Depending on your storage system this may speed up the processing: On a raid6 I have seen a 6x speedup by running 10 jobs in parallel. YMMV, so the only way to know for sure is to test and measure.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Generating sequences of numbers and characters with bash

I have written a script that accept two files as input. I want to run all in parallel at the same time on different CPUs.
inputs:
x00.A x00.B
x01.A x01.B
...
x30.A x30.B
instead of running 30 times:
./script x00.A x00.B
./script x01.A x01.B
...
./script x30.A x30.B
I wanted to use paste and seq to generate and send them to the script.
paste & seq | xargs -n1 -P 30 ./script
But I do not know how to combine letters and numbers using paste and seq commands.
for num in $(seq -f %02.f 0 30); do
./script x$num.A x$num.B &
done
wait
Although I personally prefer to not use GNU seq or BSD jot but (ksh/bash) builtins:
num=-1; while (( ++num <= 30 )); do
./script x$num.A x$num.B &
done
wait
The final wait is just needed to make sure they all finish, after having run spread across your available CPU cores in the background. So, if you need the output of ./script, you must keep the wait.
Putting them into the background with & is the simplest way for parallelism. If you really want to exercise any sort of control over lots of backgrounded jobs like that, you will need some sort of framework like GNU Parallel instead.
You can use pure bash for generating the sequence:
printf "%s %s\n" x{00..30}.{A..B} | xargs -n1 -P 30 ./script
Happy holidays!

running parallel bash background

I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.

Resources