GNU parallel with nested for loops and multiple commands - parallel-processing

I am trying to run 10 instances of a BASH function simultaneously with GNU Parallel
The BASH function downloads tiles from an image and stitches them together - first single rows, then each column - to a single image file.
function DOWNLOAD_PAGE {
for PAGE in {0041..0100}
do
for COLUMN in {0..1}
do
for ROW in {0..2}
do wget -O "$PAGE"_"$COLUMN"_"$ROW".jpg "http://www.webb$PAGE$COLUMN$ROW"
done
convert "$PAGE"_"$COLUMN"_*.jpg -append "$PAGE"__"$COLUMN".jpg
done
convert "$PAGE"__*.jpg +append "$PAGE"_done.jpg
done
}
Unfortunately, the apparently obviuous solutions - the first one being
export -f DOWNLOAD_PAGE
parallel -j10 DOWNLOAD_PAGE
do not work.
Is there a way to do this using GNU Parallel?

Parts of your function can be parallized and others cannot: E.g. you cannot append the images before you have downloaded them.
function DOWNLOAD_PAGE {
export PAGE=$1
for COLUMN in {0..1}
do
parallel wget -O "$PAGE"_"$COLUMN"_{}.jpg "http://www.webb$PAGE$COLUMN{}" ::: {0..2}
convert "$PAGE"_"$COLUMN"_*.jpg -append "$PAGE"__"$COLUMN".jpg
done
convert "$PAGE"__*.jpg +append "$PAGE"_done.jpg
}
export -f DOWNLOAD_PAGE
parallel -j10 DOWNLOAD_PAGE ::: {0041..0100}
A more parallelized version (but harder to read):
function DOWNLOAD_PAGE {
export PAGE=$1
parallel -I // --arg-sep /// parallel wget -O "$PAGE"_//_{}.jpg "http://www.webb$PAGE//{}"\; convert "$PAGE"_"//"_\*.jpg -append "$PAGE"__"//".jpg ::: {0..2} /// {0..1}
convert "$PAGE"__*.jpg +append "$PAGE"_done.jpg
}
export -f DOWNLOAD_PAGE
parallel -j10 DOWNLOAD_PAGE ::: {0041..0100}
Your understanding of what GNU Parallel does is somewhat misguided. Consider walking though the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html and then try to understand how the examples work: n1_argument_appending">http://www.gnu.org/software/parallel/man.html#example__working_as_xargs_n1_argument_appending

Related

How to give file input from a dir and produce the output in a different dir using GNU parallel?

I am trying to use parallel for bam to sort and index using samtools and producing the output in a given output_dir but facing some problems.
I tried so far the following, which is working but I don't want that dir name "1" within output_dir and also getting results files within input_dir.
parallel --results output_dir 'samtools sort -o {.}.sorted.bam {}' ::: input_dir/*.bam
This, from comments, is not working:
parallel 'samtools sort -o output_dir/{.}.sorted.bam {}' ::: input_dir/*.bam
I get the error
“[E::hts_open_format] Failed to open file output_dir/input_dir/A-8_20181222_0036.sorted.bam”
Note: This is just one tool (samtools) I am asking but I will be using other tools that produce output using --output / -o flag.
If your question is "how can I add a different directory instead of the input directory", just put it verbatim before the {/.} token. (You had {.} but we also want to trim the directory name.)
parallel 'samtools sort -o output_dir/{/.}.sorted.bam {}' ::: input_dir/*.bam
See the manual for more ideas, there is a large number of transformations you can perform on the input token.

Parallelize for loop in bash

I have the following snippet in my bash script
#!/bin/bash
for ((i=100; i>=70; i--))
do
convert test.png -quality "$i" -sampling-factor 1x1 test_libjpeg_q"$i".jpg
done
How can i execute the for loop in parallel using all cpu cores.I have seen gnu parallel being used but here i need the output filename in a specific naming scheme as shown above
You can use parallel like this:
parallel \
'convert test.png -quality {} -sampling-factor 1x1 test_libjpeg_q{}.jpg' ::: {100..70}

GNU Parallel: Argument list too long when calling function

I created a script to verify a (big) number of items and it was doing the verification in a serial way (one after the other) with the end result of the script taking about 9 hours to complete. Looking around about how to improve this, I found GNU parallel but I'm having problems making it work.
The list of items is in a text file so I was doing the following:
readarray items < ${ALL_ITEMS}
export -f process_item
parallel process_item ::: "${items[#]}"
Problem is, I receive an error:
GNU parallel: Argument list too long
I understand by looking at similar posts 1, 2, 3 that this is more a Linux limitation than a GNU parallel one. From the answers to those posts I also tried to extrapolate a workaround by piping the items to head but the result is that only a few items (the parameter passed to head) are processed.
I have been able to make it work using xargs:
cat "${ALL_ITEMS}" | xargs -n 1 -P ${THREADS} -I {} bash -c 'process_item "$#"' _ {}
but I've seen GNU parallel has other nice features I'd like to use.
Any idea how to make this work with GNU parallel? By the way, the number of items is about 2.5 million and growing every day (the script run as a cron job).
Thanks
From man parallel:
parallel [options] [command [arguments]] < list_of_arguments
So:
export -f process_item
parallel process_item < ${ALL_ITEMS}
probably does what you want.
You can pipe the file to parallel, or just use the -a (--arg-file) option. The following are equvalent:
cat "${ALL_ITEMS}" | parallel process_item
parallel process_item < "${ALL_ITEMS}"
parallel -a "${ALL_ITEMS}" process_item
parallel --arg-file "${ALL_ITEMS}" process_item
parallel process_item :::: "${ALL_ITEMS}"

How to automate file transformation with simultanous execution?

I am working on transforming a lot of image files (png) into text files. I have the basic code to do this one by one, which is really time consuming. My process involves converting the image files into a black and white format and then using tesseract to transform those into a text file. This process works great but it would take days for me to acomplisyh my task if done file by file.
Here is my code:
for f in $1
do
echo "Processing $f file..."
convert $f -resample 200 -colorspace Gray ${f%.*}BW.png
echo "OCR'ing $f"
tesseract ${f.*}BW.png ${f%.*} -l tla -psm 6
echo "Removing black and white for $f"
rn ${f%.*}BW.png
done
echo "Done!"
Is there a way to perform this process to each file at the same time, that is, how would I be able to run this process simultaneously instead of one by one? My goal is to significantly reduce the amount of time it would take for me to transform these images into text files.
Thanks in advance.
You could make the content for your for loop a function then call the function multiple times but send each all to the background so you could execute another.
function my_process{
echo "Processing $1 file..."
convert $1 -resample 200 -colorspace Gray ${1%.*}BW.png
echo "OCR'ing $1"
tesseract ${1.*}BW.png ${1%.*} -l tla -psm 6
echo "Removing black and white for $1"
rn ${1%.*}BW.png
}
for file in ${files[#]}
do
# & at the end send it to the background.
my_process "$file" &
done
I want to thank contributors #Songy and #shellter.
To answer my question... I ended up using GNU Parallel in order to make these processes run in intervals of 5. Here is the code that I used:
parallel -j 5 convert {} "-resample 200 -colorspace Gray" {.}BW.png ::: *.png ; parallel -j 5 tesseract {} {} -l tla -psm 6 ::: *BW.png ; rm *BW.png
I am now in the process of splitting my dataset in order to run this command simultaneously with different subgroups of my (very large) pool of images.
Cheers

Prepend header to file without changing the file

Background
The enscript command can apply syntax highlighting to various types of source files, including SQL statements, shell scripts, PHP code, HTML files, and more. I am using enscript to generate 300dpi images of source code for a technical manual to:
Generate content for the book based on actual source code.
Distribute the source code along with the book, without any modification.
Run and test the scripts while writing the book.
Problem
The following shell script performs the conversion almost as desired:
#!/bin/bash
DIRNAME=$(dirname $1)
FILENAME=$(basename $1)
# Remove the extension from the filename.
BASENAME=${FILENAME%%.*}
FILETYPE=${FILENAME##*.}
LIGHTGRAY="#f3f3f3"
enscript --escapes --color -f Courier10 -X ps -B -1 --highlight=$FILETYPE \
$2 -h -o - $1 | \
gs -dSAFER -sDEVICE=pngalpha -dGraphicsAlphaBits=4 -dNOPAUSE -r300 \
-sOutputFile=$BASENAME.png -dBackgroundColor=16$LIGHTGRAY > /dev/null && \
convert -trim $BASENAME.png $BASENAME-trimmed.png && \
mv $BASENAME-trimmed.png $BASENAME.png
The problem is that the background is not a light gray colour. According to the enscript man page, the --escapes (-e) option indicates that the file (i.e., $1) has enscript-specific control sequences embedded within it.
Adding the control sequences means having to duplicate code, which defeats the purpose of having a single source.
Solution
The enscript documentation implies that it should be possible to concatenate two files together (the target and a "header") before running the script, to create a third file:
^#shade{0.85} -- header line
#!/bin/bash -- start of source file
Then delete the third file once the command completes.
Questions
Q.1. What is a more efficient way to pipe the control sequences and the source file to the enscript program without using a third file?
Q.2. What other options are available to automate syntax highlighting for a book, while honouring the single source requirements I have described? (For example, write the book in LyX and use LaTeX commands for import and syntax highlighting.)
Q1 You can use braces '{}' to do I/O redirection:
{ echo "^#shade{0.85}"; cat $1; } |
enscript --color -f Courier10 -X ps -B -1 --highlight=$FILETYPE $2 -h -o - |
gs -dSAFER -sDEVICE=pngalpha -dGraphicsAlphaBits=4 -dNOPAUSE -r300 \
-sOutputFile=$BASENAME.png -dBackgroundColor=16$LIGHTGRAY > /dev/null &&
convert -trim $BASENAME.png $BASENAME-trimmed.png &&
mv $BASENAME-trimmed.png $BASENAME.png
This assumes that enscript reads its standard input when not given an explicit file name; if not, you may need to use an option (perhaps '-i -') or some more serious magic, possibly even 'process substitution' in bash.
You could also use parentheses to run a sub-shell:
(echo "^#shade{0.85}"; cat $1) | ...
Note that the semi-colon after cat is necessary with braces and not necessary with parentheses (and a space is necessary after the open brace) - such are the mysteries of shell scripting.
Q2 I don't have any alternatives to offer. When I produced a book (20 years ago now, using troff), I wrote a program to convert source into the the necessary markup, so that the book was produced from the source code, but by an automated process.
(Is 300 dpi sufficiently high resolution?)
Edit
To work-around the enscript program interpreting the escape sequence embedded in the conversion script itself:
{ cat ../../enscript-header.txt $1; } |
Q2: Use LaTeX with the listings package.

Resources