I have a small bash script to OCR PDF files (slightly modified this script). The basic flow for each file is:
For each page in pdf FILE:
Convert page to TIFF image (imegamagick)
OCR image (tesseract)
Cat results to text file
Script:
FILES=/home/tgr/OCR/input/*.pdf
for f in $FILES
do
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] page.tif
echo processing file $f, page $i
tesseract page.tif tempoutput -l ces
cat tempoutput.txt >> $OUTPUT
done
rm tempoutput.txt
rm page.tif
done
Because of high resolution and fact that tesseract can utilize only one core, the process is extremely slow (takes approx. 3 minutes to convert one PDF file).
Because I have thousands of PDF files I think I can use parallel to use all 4 cores, but I don't get the concept how to use it. In examples I see:
Nested for-loops like this:
(for x in `cat xlist` ; do
for y in `cat ylist` ; do
do_something $x $y
done
done) | process_output
can be written like this:
parallel do_something {1} {2} :::: xlist ylist | process_output
Unfortunately I was not able to figure out how to apply this. How do I parallelize my script?
Since you have 1000s of PDF files it is probably enough simply to parallelize the processing of PDF-files and not parallelize the processing of the pages in a single file.
function convert_func {
f=$1
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] $$.tif
echo processing file $f, page $i
tesseract $$.tif $$ -l ces
cat $$.txt >> $OUTPUT
done
rm $$.txt
rm $$.tif
}
export -f convert_func
parallel convert_func ::: /home/tgr/OCR/input/*.pdf
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial or http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line
with love you for it.
Read the EXAMPLEs (LESS=+/EXAMPLE: man parallel).
You can have a script like this.
#!/bin/bash
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4
local TEMP0=$(exec mktemp --suffix ".00.$PAGE_INDEX.tif")
local TEMP1=$(exec mktemp --suffix ".01.$PAGE_INDEX")
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"
echo "processing file $FILE, page $PAGE_INDEX" ## I think you mean to place this before the line above.
tesseract "$TEMP0" "$TEMP1" -l ces
cat "$TEMP1".txt >> "$OUTPUT" ## Lines may be mixed up from different processes here and a workaround may still be needed but it may no longer be necessary if outputs are small enough.
rm -f "$TEMP0" "$TEMP1"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch "$OUTPUT" ## This may no longer be necessary. Or probably you mean to truncate it instead e.g. : > "$OUTPUT"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}'
done
It exports a function that's importable by parallel, make proper sanitation of arguments, and unique temporary files to make parallel processing possible.
Update. This would hold output on multiple temporary files first before concatenating them to one main output file.
#!/bin/bash
shopt -s nullglob
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4 TEMPLISTFILE=$5
local TEMP_TIF=$(exec mktemp --suffix ".01.$PAGE_INDEX.tif")
local TEMP_TXT_BASE=$(exec mktemp --suffix ".02.$PAGE_INDEX")
echo "processing file $FILE, page $PAGE_INDEX"
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TIF" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TXT_BASE"
tesseract "$TEMP_TIF" "$TEMP_TXT_BASE" -l ces
echo "$PAGE_INDEX"$'\t'"${TEMP_TXT_BASE}.txt" >> "$TEMPLISTFILE"
rm -f "$TEMP_TIF"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
BASENAME=${FILENAME%.*}
OUTPUT="/home/tgr/OCR/output/$BASENAME.txt"
RESOLUTION=1400
TEMPLISTFILE=$(exec mktemp --suffix ".00.$BASENAME")
: > "$TEMPLISTFILE"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}' "$TEMPLISTFILE"
while IFS=$'\t' read -r __ FILE; do
cat "$FILE"
rm -f "$FILE"
done < <(exec sort -n "$TEMPLISTFILE") > "$OUTPUT"
rm -f "$TEMPLISTFILE"
done
Related
I'm trying to perform certain operation on each file in a directory but there is a problem with order it's going through. It should do one file at the time. The long line (unzipping, grepping, zipping) works fine on a single file without a script, so there is a problem with a loop. Any ideas?
Script should grep through through each zipped file and look for word1 or word2. If at least one of them exist then:
unzip file
grep word1 and word2 and save it to file_done
remove unzipped file
zip file_done to /donefiles/ with original name
remove file_done from original directory
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The code snipped you've provided has a few problems, e.g. unneeded nested for cycle and erroneous pipeline
(the whole line gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip...).
Note also your code will work correctly only if *.gz files don't have spaces (or special characters) in names.
Also zgrep -c 'word1\|word2' will also match strings like line_starts_withword1_orword2_.
Here is the working version of the script:
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c -E 'word1|word2' $file) # now counter is the number of word1/word2 occurences in $file
if [[ $counter -gt 0 ]]; then
name=$(basename $file .gz)
zcat $file | grep -E 'word1|word2' > ${name}_done
gzip -f -c ${name}_done > /donefiles/$file
rm -f ${name}_done
else
echo 'nothing to do here'
fi
done
What we can improve here is:
since we unzipping the file anyway to check for word1|word2 presence, we may do this to temp file and avoid double-unzipping
we don't need to count how many word1 or word2 is inside the file, we may just check for their presence
${name}_done can be a temp file cleaned up automatically
we can use while cycle to handle file names with spaces
#!/bin/bash
tmp=`mktemp /tmp/gzip_demo.XXXXXX` # create temp file for us
trap "rm -f \"$tmp\"" EXIT INT TERM QUIT HUP # clean $tmp upon exit or termination
find . -maxdepth 1 -mindepth 1 -type f -name '*.gz' | while read f; do
# quotes around $f are now required in case of spaces in it
s=$(basename "$f") # short name w/o dir
gunzip -f -c "$f" | grep -P '\b(word1|word2)\b' > "$tmp"
[ -s "$tmp" ] && gzip -f -c "$tmp" > "/donefiles/$s" # create archive if anything is found
done
It looks like you have an inner loop inside the outer one :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do #<<< HERE
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The inner loop goes through all the files in the directory if one of them contains file1 or file2. You probably want this :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
else
echo "nothing to do here"
fi
done
I currently use a bash script and PDFgrep to rename files to a certain structure. However, in order to stop overriding if the new file has a duplicate name, I want to add a number at the end of the name. Keep in mind that there may be 3 or 4 duplicate names. What's the best way to do this?
#!/bin/bash
if [ $# -ne 1 ]; then
echo Usage: Renamer file
exit 1
fi
f="$1"
id1=$(pdfgrep -m 1 -i "MR# : " "$f" | grep -oE "[M][0-9][0-9]+") || continue
id2=$(pdfgrep -m 1 -i "Visit#" "$f" | grep -oE "[V][0-9][0-9]+") || continue
{ read today; read dob; read dop; } < <(pdfgrep -i " " "$f" | grep -oE "[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]")
dobsi=$(echo $dob | sed -e 's/\//-/g')
dopsi=$(echo $dop | sed -e 's/\//-/g')
mv -- "$f" "${id1}_${id2}_$(printf "$dobsi")_$(printf "$dopsi")_1.pdf"
Use a loop that checks if the destination filename exists, and increments a counter if it does. Replace the mv line with this:
prefix="${id1}_{id2}_${dob}_${dop}"
counter=0
while true
do
if [ "$counter" -ne 0 ]
then target="${prefix}_${counter}.pdf"
else target="${prefix}.pdf"
fi
if [ ! -e "$target" ]
then
mv -- "$f" "$target"
break
fi
((counter++))
done
Note that this suffers from a TOCTTOU problem, if the duplicate file is created between the ! -f "$target" test and the mv. I thought it would be possible to replace the existence check with using mv -n; but while this won't overwrite the file, it still treats the mv as successful, so you can't test the result to see if you need to increment the counter.
I'm new at Bashing and wrote a code to check my photos files but find it very slow and gets a few empty returns checking 17000+ photos. Is there any way to use all 4 cpus running this script and so speed it up
Please help
#!/bin/bash
readarray -t array < ~/Scripts/ourphotos.txt
totalfiles="${#array[#]}"
echo $totalfiles
i=0
ii=0
check1=""
while :
do
check=${array[$i]}
if [[ ! -r $( echo $check ) ]] ; then
if [ $check = $check1 ]; then
echo "empty "$check
else
unset array[$i]
ii=$((ii + 1 ))
fi
fi
if [ $totalfiles = $i ]; then
break
fi
i=$(( i + 1 ))
done
if [ $ii -gt "1" ]; then
notify-send -u critical $ii" files have been deleted or are unreadable"
fi
It's a filesystem operation so multiple cores will hardly help.
Simplification might:
while read file; do
i=$((i+1)); [ -e "$file" ] || ii=$(ii+1));
done < "$HOME/Scripts/ourphotos.txt"
#...
Two points:
you don't need to keep the whole file in memory (no arrays needed)
$( echo $check ) forks a proces. You generally want to avoid forking and execing in loops.
This is an old question, but a common problem lacking an evidence-based solution.
awk '{print "[ -e "$1" ] && echo "$2}' | parallel # 400 files/s
awk '{print "[ -e "$1" ] && echo "$2}' | bash # 6000 files/s
while read file; do [ -e $file ] && echo $file; done # 12000 files/s
xargs find # 200000 files/s
parallel --xargs find # 250000 files/s
xargs -P2 find # 400000 files/s
xargs -P96 find # 800000 files/s
I tried this on a few different systems and the results were not consistent, but xargs -P (parallel execution) was consistently the fastest. I was surprised that xargs -P was faster than GNU parallel (not reported above, but sometimes much faster), and I was surprised that parallel execution helped so much — I thought that file I/O would be the limiting factor and parallel execution wouldn't matter much.
Also noteworthy is that xargs find is about 20x faster than the accepted solution, and much more concise. For example, here is a rewrite of OP's script:
#!/bin/bash
total=$(wc -l ~/Scripts/ourphotos.txt | awk '{print $1}')
# tr '\n' '\0' | xargs -0 handles spaces and other funny characters in filenames
found=$(cat ~//Scripts/ourphotos.txt | tr '\n' '\0' | xargs -0 -P4 find | wc -l)
if [ $total -ne $found ]; then
ii=$(expr $total - $found)
notify-send -u critical $ii" files have been deleted or are unreadable"
fi
I have made (with my little knowledge of Bash and an extensive use of a search engine) a Bash script to reorder the pages of a big PDF file:
#!/bin/bash
file=originalfile.pdf;
newfile=$(basename $file .pdf)-2.pdf;
tmpfile=$(mktemp --suffix=.pdf);
blankfile=$(mktemp --suffix=.pdf);
cp -f $file $newfile;
cp -f $file $tmpfile;
numberofpages=`pdftk $file dump_data | grep "NumberOfPages" | sed 's:.*\([0-9][0-9*]\).*:\1:'`;
echo "" | ps2pdf -sPAPERSIZE=a4 - $blankfile;
while (( $numberofpages % 4 != 0 ));
do
((numberofpages++));
pdftk A=$newfile B=$blankfile cat A B output $tmpfile;
cp -f $tmpfile $newfile;
done;
neworder=`
for (( a=1, b=3, c=4, d=2 ;
a <=numberofpages ;
((a+=4)), ((b+=4)), ((c+=4)), ((d+=4))
));
do
echo -n "$a $b $c $d ";
done`;
pdftk $tmpfile cat $neworder output $newfile;
I wanted to make a Nautilus-Actions script out of it so it could be "installed" and used by a regular user. By regular user, I mean someone unable to type any command-line and unable to follow a few steps to copy the script at a specified place.
Unfortunately the script didn't work and I came up with this new script thanks to the help of people commenting below:
#!/bin/bash
file=originalfile.pdf;
newfile=$(basename $file .pdf)-2.pdf;
tmpfile=$(mktemp --suffix=.pdf);
blankfile=$(mktemp --suffix=.pdf);
cp -f $file $newfile;
cp -f $file $tmpfile;
numberofpages=`pdftk $file dump_data | grep "NumberOfPages" | sed 's:.*\([0-9][0-9*]\).*:\1:'`;
echo "" | ps2pdf -sPAPERSIZE=a4 - $blankfile;
while (( $numberofpages % 4 != 0 )); # NOTE: replace % by %% in Nautilus-Actions
do
((numberofpages++));
pdftk A=$newfile B=$blankfile cat A B output $tmpfile;
cp -f $tmpfile $newfile;
done;
a=0;
neworder=$(
while [ $a -lt $numberofpages ];
do
echo -n "$(($a + 1)) $(($a + 3)) $(($a + 4)) $(($a + 2)) ";
((a+=4));
done;
);
pdftk $tmpfile cat $neworder output $newfile;
I did paste everything in the Path entry of Nautilus-Actions and it finally worked. The newly created Nautilus-action could then be exported in a .desktop file (and therefore imported very easily by any user):
If I ask Nautilus-Actions to display the output, It seems that Nautilus-Actions execute the command line inside a /bin/sh -c 'myscript...' command.
Could you explain to me why I had to change so many things in order to make it work ? Especially why I had to change the for into a while ?
Note: I completely revamp the question since It was a mess.
AIM: To find files with a word count less than 1000 and move them another folder. Loop until all under 1k files are moved.
STATUS: It will only move one file, then error with "Unable to move file as it doesn't exist. For some reason $INPUT_SMALL doesn't seem to update with the new file name."
What am I doing wrong?
Current Script:
Check for input files already under 1k and move to Split folder
INPUT_SMALL=$( ls -S /folder1/ | grep -i reply | tail -1 )
INPUT_COUNT=$( cat /folder1/$INPUT_SMALL 2>/dev/null | wc -l )
function moveSmallInput() {
while [[ $INPUT_SMALL != "" ]] && [[ $INPUT_COUNT -le 1003 ]]
do
echo "Files smaller than 1k have been found in input folder, these will be moved to the split folder to be processed."
mv /folder1/$INPUT_SMALL /folder2/
done
}
I assume you are looking for files that has the word reply somewhere in the path. My solution is:
wc -w $(find /folder1 -type f -path '*reply*') | \
while read wordcount filename
do
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
Run the script once, if the output looks correct, then uncomment the mv command and run it for real this time.
Update
The above solution has trouble with files with embedded spaces. The problem occurs when the find command hands its output to the wc command. After a little bit of thinking, here is my revised soltuion:
find /folder1 -type f -path '*reply*' | \
while read filename
do
set $(wc -w "$filename") # $1= word count, $2 = filename
wordcount=$1
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
A somewhat shorter version
#!/bin/bash
find ./folder1 -type f | while read f
do
(( $(wc -w "$f" | awk '{print $1}' ) < 1000 )) && cp "$f" folder2
done
I left cp instead of mv for safery reasons. Change to mv after validating
I you also want to filter with reply use #Hai's version of the find command
Your variables INPUT_SMALL and INPUT_COUNT are not functions, they're just values you assigned once. You either need to move them inside your while loop or turn them into functions and evaluate them each time (rather than just expanding the variable values, as you are now).