How to automate file transformation with simultanous execution? - bash

I am working on transforming a lot of image files (png) into text files. I have the basic code to do this one by one, which is really time consuming. My process involves converting the image files into a black and white format and then using tesseract to transform those into a text file. This process works great but it would take days for me to acomplisyh my task if done file by file.
Here is my code:
for f in $1
do
echo "Processing $f file..."
convert $f -resample 200 -colorspace Gray ${f%.*}BW.png
echo "OCR'ing $f"
tesseract ${f.*}BW.png ${f%.*} -l tla -psm 6
echo "Removing black and white for $f"
rn ${f%.*}BW.png
done
echo "Done!"
Is there a way to perform this process to each file at the same time, that is, how would I be able to run this process simultaneously instead of one by one? My goal is to significantly reduce the amount of time it would take for me to transform these images into text files.
Thanks in advance.

You could make the content for your for loop a function then call the function multiple times but send each all to the background so you could execute another.
function my_process{
echo "Processing $1 file..."
convert $1 -resample 200 -colorspace Gray ${1%.*}BW.png
echo "OCR'ing $1"
tesseract ${1.*}BW.png ${1%.*} -l tla -psm 6
echo "Removing black and white for $1"
rn ${1%.*}BW.png
}
for file in ${files[#]}
do
# & at the end send it to the background.
my_process "$file" &
done

I want to thank contributors #Songy and #shellter.
To answer my question... I ended up using GNU Parallel in order to make these processes run in intervals of 5. Here is the code that I used:
parallel -j 5 convert {} "-resample 200 -colorspace Gray" {.}BW.png ::: *.png ; parallel -j 5 tesseract {} {} -l tla -psm 6 ::: *BW.png ; rm *BW.png
I am now in the process of splitting my dataset in order to run this command simultaneously with different subgroups of my (very large) pool of images.
Cheers

Related

Only cropping/appending a subset of images from an image sequence

I have a numbered image sequence that I need to crop and append, but only certain frame ranges.
Example, sequence of 100 images named as follows:
frame001.jpg
frame002.jpg
frame003.jpg
...
Sometimes might only need to crop and append images 20-30, or other time, 5-75.
How can I specify a range? Simply outputting to a PNG.
For examle, if you want to pick the jpg files in the range of 20-30
and generate a png file appending them, would you please try:
#!/bin/bash
declare -a input # an array to store jpg filenames
for i in $(seq 20 30); do # loop between 20 and 30
input+=( "$(printf "frame%03d.jpg" "$i")" ) # append the filename one by one to the array
done
echo convert -append "${input[#]}" "output.png" # generate a png file appending the files
If the output command looks good, drop echo.
If you are unsure how to run a bash script and prefer a one-liner, please try instead:
declare -a input; for i in $(seq 20 30); do input+=( "$(printf "frame%03d.jpg" "$i")" ); done; echo convert -append "${input[#]}" "output.png"
[Edit]
If you want to crop the images with e.g. 720x480+300+200,
then please try:
#!/bin/bash
declare -a input
for i in $(seq 20 30); do
input+=( "$(printf "frame%03d.jpg" "$i")" )
done
convert "${input[#]}" -crop 720x480+300+200 -append "output.png"
The order of options and filenames doesn't matter here, but I have followed
the modern style of ImageMagick usage to place the input filenames first.

Processing images using ImageMagick through bash scripts

I have a folder of images. I want to iterate through the folder, apply the same ImageMagick convert function to each file, and save the output to a separate folder.
The way I'm doing it currently is this:
#!/bin/bash
mkdir "Folder2"
for f in Folder1/*.png
do
echo "convert -brightness-contrast 10x60" "Folder1/$f" "Folder2/"${f%.*}"_suffix.png"
done
Then I copy and paste that terminal output into a new bash script that ends up looking like this:
#!/bin/bash
convert -brightness-contrast 10x60 Folder1/file1.png Folder2/file1_suffix.png
convert -brightness-contrast 10x60 Folder1/file2.png Folder2/file2_suffix.png
convert -brightness-contrast 10x60 Folder1/file3.png Folder2/file3_suffix.png
I tried to write a single bash script for this task but there was some weirdness with the variable handling, and this two-script method got me what I needed ...but I suspect there's an easier/simpler way and possibly even a one-line solution.
It's enough to change your first script, not to just echo the commands, but to execute them.
#!/bin/bash
mkdir "Folder2"
for f in Folder1/*.png
do
convert -brightness-contrast 10x60 "Folder1/$f" "Folder2/${f%.*}_suffix.png"
done
Crystal ball is telling me that there are spaces in filenames causing "some weirdness with the variable handling". In that case you need a workaround for the spaces. For example, you may try the following script:
#!/bin/bash
hasspaces="^(.+[^\'\"])([ ])(.+)$"
function escapespaces {
declare -n name=$1
while [[ $name =~ $hasspaces ]] ; do
name=${BASH_REMATCH[1]}'\'${BASH_REMATCH[2]}${BASH_REMATCH[3]}
echo 'Escaped string: '\'$name\'
done
}
mkdir Folder2
while read -r entry; do
echo "File '$entry'"
escapespaces entry
echo "File '$entry'"
tmp=${entry#Folder1}
eval "convert -brightness-contrast 10x60" "$entry" "Folder2/"${tmp%.*}"_suffix.png"
done <<<"$(eval "ls -1 Folder1/*.png")"
If this does not work, by all means let me know so I can request a refund for my crystal ball! Also, if you can give more details on the "weirdness in variable handling", we could try to help with those other weirdnesses :-)
Check this answer (and the few in the question) out:
You can use it in your example :
find Folder1 -name "*.png" | sed -e 'p;s/Folder1/Folder2/g' -e 's/.png/_suffix.png' | xargs -n2 convert -brightness-contrast 10x60
note : the p in the first sed makes the trick.
find will list you all the files in Folder1 with a name that matches the *.png expression
sed -e 'p;/Folder1/Folder2/g' will (a) print the input line and (b) replace Folder1 by Folder2
-e 's/.png$/_suffix.png' replaces the .png suffix with the _suffix.png suffix
xargs -n2 tells the shell that xargs should take two arguments max (the first one being printed by sed 'p' and the second one going through all the -e)
convert ... is your command, taking two inputs.

Linux tools: How to apply a filter on a PDF document?

I am a researcher and I have to read research papers. Unfortunately, the characters are not dark enough, so the papers are hard to read when printed on the paper. Note that the printer's cartridge has no problem, but the characters are not printed dark enough (the text is already in black: take a look at a sample).
This is how the characters look like in Photoshop:
[
Note that the background is transparent when you import a PDF document in photoshop.
I use this awful solution:
First, I Import the PDF document into Photoshop. The pages are imported as individual images with transparent background.
Then, for each page I do either of these two methods:
Method 1: Copy the layer over itself multiple times, so that the image gets darker
Method 2: Apply a Min filter on the image
This is how it looks like after conversion (left: Min filter, right: layer duplication)
[
This solves my problem for printing a single page and I can read the printed contents easily. However, it is hard to convert every page of every PDF paper using PHOTOSHOP!!!!. Is there any wiser solution/tool/application ???
Here is what I need:
1. How to convert PDF to high-quality image (either in Linux or Windows, with any tool).
2. How to apply Min Filter (or any better filter) on the image files automatically. (e.g. a script or whatever)
Thanks!
It can be done using four tools:
pdftoppm: Convert PDF file to separate png images
octave (MATLAB's open-source alternative): Generate an octave script that applies the min filter on the images. After running the script,
pdflatex: Create a single .tex file that imports all the images, then compile the .tex file to obtain a single PDF
bash!: A bash script that automates the process
Here is a fully automated solution as a single bash script:
pdftoppm -rx 300 -ry 300 -png $1 img
# Create an octave script than applieds min-filter to all files.
echo "#!/usr/bin/octave -qf" > script.m
echo "pkg load image" >> script.m
for i in `ls -1 img*.png`
do
echo "i = imread('$i');" >> script.m
echo "i(:,:,1) = ordfilt2(i(:,:,1),1,ones(3,3));" >> script.m
echo "i(:,:,2) = ordfilt2(i(:,:,2),1,ones(3,3));" >> script.m
echo "i(:,:,3) = ordfilt2(i(:,:,3),1,ones(3,3));" >> script.m
echo "imwrite(i,'p$i')" >> script.m
done
# Running the octave script
chmod 755 script.m
./script.m
# Converting png images to a single PDF
# Create a latex file that contains all image files
echo "\documentclass{article}" > f.tex
echo "\usepackage[active,tightpage]{preview}" >> f.tex
echo "\usepackage{graphicx}" >> f.tex
echo "\PreviewMacro[{*[][]{}}]{\includegraphics}" >> f.tex
echo "\begin{document}" >> f.tex
echo -n "%" >> f.tex
for i in `ls -1 pimg*.png`
do
echo "\newpage" >> f.tex
echo "\includegraphics{"$i"}" >> f.tex
done
echo "\end{document}" >> f.tex
#Compiling the latex document
pdflatex -synctex=1 -interaction=nonstopmode f

Unix: Combine PDF -files and images into PDF -file?

My friend is asking this question, he is using Mac and cannot get PdfLatex working (having no dev CD, related here). Anyway my first idea:
$ pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf [only pdfs]
$ convert 1.png 2.png myfile.pdf [only images]
Now I don't know without LaTex or iPad's Notes Plus how to combine images and PDF -files. So how can I combine pdf -files and images in Unix?
You could run a loop, identifying PDF and images, and converting images to PDF with ImageMagick. When you're done, you assemble it all with pdftk.
This is a Bash-only script.
#!/bin/bash
# Convert arguments into list
N=0
for file in $*; do
files[$N]=$file
N=$[ $N + 1 ]
done
# Last element of list is our destination filename
N=$[ $N - 1 ]
LAST=$files[$N]
unset files[$N]
N=$[ $N - 1 ]
# Check all files in the input array, converting image types
T=0
for i in $( seq 0 $N ); do
file=${files[$i]}
case ${file##*.} in
jpg|png|gif|tif)
temp="tmpfile.$T.pdf"
convert $file $temp
tmp[$T]=$temp
uses[$i]=$temp
T=$[ $T + 1 ]
# Or also: tmp=("${tmp[#]}" "$temp")
;;
pdf)
uses[$i]=$file
;;
esac
done
# Now assemble PDF files
pdftk ${uses[#]} cat output $LAST
# Destroy all temporary file names. Disabled because you never know :-)
echo "I would remove ${tmp[#]}"
# rm ${tmp[#]}
I am gathering here some info.
Unix Commandline
More about Pdftk here.
Merging png images into one pdf file
Combine all files in a folder as pdf
Mac
Because the moderator in Apple SE removed the useful thread "Merge PDF files and images to single PDF file in Mac?"
here
-- I collect the tips here for Mac -- sorry but the moderator is very intolerant about collecting Newbie -things.
https://apple.stackexchange.com/questions/16226/what-software-is-available-preferably-free-to-create-and-edit-pdf-files-on-mac
https://apple.stackexchange.com/questions/812/how-can-i-combine-two-pdfs-in-preview
https://apple.stackexchange.com/questions/11163/how-do-i-combine-two-or-more-images-to-get-a-single-pdf-file
https://apple.stackexchange.com/questions/69659/ipad-pdf-software-to-edit-merge-annotate-etc-well-pdf-documents-like-in-deskto

How do I determine if a gif is animated?

I have large number of files with .gif extension. I would like to move all animated gifs to another directory. How can I do this using linux shell?
Basically, if identify returns more than one line for a GIF, it's likely animated because it contains more than one image. You may get false positives, however.
Example use in shell:
for i in *.gif; do
if [ `identify "$i" | wc -l` -gt 1 ] ; then
echo move "$i"
else
echo dont move "$i"
fi
done

Resources