Tesseract OCR large number of files - parallel-processing

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}

You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

Related

rsync - How to copy a certain number of files, pause, repeat

I have a situation which I have failed to find a solution for.
I have a process which generates ~10,000 xml files into one directory. Those files get rsync'd (with a delete on the source once copied) to a server which runs a process every 5 minutes to import them. The problem is that the volume of files is such that it takes longer than 5 minutes to process them and I can't change that timing. What I would like to do is come up with a script which would allow me to rsync the first 2500 files in the directory, wait 5 minutes, rsync the next 2500, etc. The numbers of files vary, so I'd want it to just keep going through until all the files have been copied. The order of the files doesn't matter, they could be listed alphabetically or by date or just random. Does anyone have any examples of how to do this?
Thanks!
If I understood correctly your problem, you need something like:
while true; do
ls | shuf -n 2500 > /tmp/sync_files # pick random files
rsync -av `cat /tmp/sync_files` /destination/ # sync the files
xargs rm < /tmp/sync_files # delete the files
sleep 300; # sleep 5 minutes
done
In the code you pick random files, synced in another directory, then remove them (if the files contain spaces or some weird characters it shall be done with a for loop and then rm command, and finally sleep 5 seconds. Let me know if I got your problem right.
Randomness is optional and we want to stop when the files are transmitted. Using the output of ls sometimes gives strange results So that would make it:
#!/bin/bash
qty=2500
sleeptime=300
typeset -i i
i=0
for f in * ; do
rsync -av "$f" /destination/
rm $f
i=$i+1
if [ $i = $qty ] ; then
sleep $sleeptime
i=0
fi
done
But then you do an rsync per file, which may also not be what you want.

gnu parallel to parallelize a for loop

I have seen several questions about this topic, but I lack the ability to translate this to my specific problem. I have a for loop that loops through sub directories and then executes a .sh script on a compressed text file inside each directory. I want to parallelize this process, but I'm struggling to apply gnu parallel.
Here is my loop:
for d in ./*/ ; do (cd "$d" && script.sh); done
I understand I need to input a list into parallel, so i have been trying this:
ls -d */ | parallel cd && script.sh
While this appears to get started, I get an error when gzip tries to unzip one of the txt files inside the directory, saying the file does not exist:
gzip: *.txt.gz: No such file or directory
However, when I run the original for loop, I have no issues aside from it taking a century to finish. Also, I only get the gzip error once when using parallel, which is so weird considering I have over 1000 sub-directories.
My questions are:
How do I get Parallel to work in my case? How do I get parallel to parallelize the application of a .sh script to 1000s of files in their own sub-directories? ie- what is the solution to my problem? I gotta make progress.
What am I missing? Syntax, loop, bad script? I want to learn.
Is Parallel actually attempting to run all these .sh scripts in parallel? Why dont I get an error for every .txt.gz file?
Is parallel the best option for the application? Is there another option that is better suited to my needs?
Two problems:
In:
ls -d */ | parallel cd && script.sh
what is paralleled is just cd, not script.sh. script.sh is only executed once, after all parallel cd jobs have run, if there was no error. It is the same as:
ls -d */ | parallel cd
if [ $? -eq 0 ]; then script.sh; fi
You do not pass the target directory to cd. So, what is executed by parallel is just cd, which just changes the current directory to your home directory. The final script.sh is executed in the current directory (from where you invoked the command) where there are probably no *.txt.gz files, thus the error.
You can check yourself the effect of the first problem with:
$ mkdir /tmp/foobar && cd /tmp/foobar && mkdir a b c
$ ls -d */ | parallel cd && pwd
/tmp/foobar
The output of pwd is printed only once, even if you have more than one input directory. You can fix it by quoting the command and then check the second problem with:
$ ls -d */ | parallel 'cd && pwd'
/homes/myself
/homes/myself
/homes/myself
You should see as many pwd outputs as there are input directories but it is always the same output: your home directory. You can fix the second problem by using the {} replacement string that is substituted with the current input. Check it with:
$ ls -d */ | parallel 'cd {} && pwd'
/tmp/foobar/a
/tmp/foobar/b
/tmp/foobar/c
Now, you should have all input directories properly listed in the output.
For your specific problem this should work:
ls -d */ | parallel 'cd {} && script.sh'

How to monitor multiple file through shell script

I want to monitor Apache and Tomcat logs through a shell script.
I could monitor single files through script. But How do I monitor multiple files through script?
I have written sample script for single files.
#!/bin/bash
file=/root/logs_flow/apache_access_log
current=`date +%s`
last_modified=`stat -c "%Y" $file`
if [ $(($current-$last_modified)) -gt 180 ]; then
mail -s "$file is not updating proper" ramacn11#xx.xx.xxx
else
mail -s "$file is updating proper" ramacn11#xx.xxx.xxx
fi
I want to monitor the files apache_error_log and tomcat logs with same script.
An easy solution starting from what you have already would be to call your script with the file to monitor as argument:
script.sh /root/logs_flow/apache_access_log
Then inside you put
file=$1
Now you can put a bunch of these in cron
* * * * * script.sh /root/logs_flow/apache_access_log
* * * * * script.sh /some/other/file.log
You might want to expand your script a bit to check if the argument is passed and if it's a valid filename.
You can list files that have or haven't been updated in a period of time using the find command, which will be more portable than processing the output of stat, which varies by operating system.
The following will output the names of specified logs that have a modification time more than 3 minutes ago:
find httpd.log tomcat.log -not -mtime -3m
Or, for more easier file list management, you could use a bash array:
#!/usr/bin/env bash
files=(
/root/logs_flow/apache_access_log
/var/log/tomcat.log
/var/log/www/apache-*.log # This is an expanding glob.
)
find "${files[#]}" -not -mtime -3m
Files in the array will be listed if they are more than 3 minutes old.
To read from multiple log files, at once.. One could do
tail -f /home/user/log_A -f /home/user/log_B |egrep -v "^$|="
Note: The egrep -v "^$|=" part is to remove header lines and empty lines from the output of the tail command. You can remove that if you want to keep the headers.

run windows program under wine using gnu parallel

I have a very basic script to run multiple copies of a windows population genetics program (msvar.exe) under Wine. It uses "find" to look through multiple folders for an initiation file (INTFILE) and then starts an instance of msvar.exe in each directory using that initiation file. Different folders will have different paramaters in the initiation file so I can run a series of simulations by adding the "&" parameter. Here it is;
for i in $(find /home/msvartest -name INTFILE -type f)
do (
cd $(dirname $(realpath $i));
# wine explorer /desktop=name msvar.exe;
wineconsole --backend=user msvar.exe;
) &
done
At the moment I run up to 20 copies of msvar.exe at once each under it's own wineconsole (or wine explorer window) on my dual hexacore machine. Each run instance can take 3 or 4 days, but the program only runs on a single core, so I need to run the simulations in parallel. It looks like Gnu parallel would be a better way to run msvar.exe and would allow me to run more simulations over remote computers. I unsuccessfully tried to get Gnu parallel working with wineconsole following the suggestions in Run wine in parallel with gnu-parallel - needs {%} slot substitution to work. Is anybody able to help, or even better knock up a script I could use.
Thanks for your help.
I think your command is going to get horribly long and unwieldy unless you use an exported function like this:
#!/bin/bash
doit() {
...
...
}
export -f doit
parallel -j 10 doit ::: {0..99}
So, for your example that will look something like (untested):
#!/bin/bash
doit() {
echo Processing $1
cd $(dirname $(realpath "$1"));
WINEPREFIX=$HOME/slot{%} wineconsole --backend=user msvar.exe
}
export -f doit
find /home/msvartest -name INTFILE -type f | parallel --dry-run doit
Unfortunately I don't have your environment set up to test this but it should be close and easy to correct if there are minor errors. Try and see what it does, then remove the --dry-run to let it actually do something.
If you have spaces in your filenames, you should use -print0 with your find command and also add -0 after parallel but that just complicates things for the moment.
#!/bin/bash
doit() {
echo Processing $1
cd $(dirname $(realpath "$1"));
WINEPREFIX=$HOME/slot$2 wineconsole --backend=user msvar.exe
}
export -f doit
find /home/msvartest -name INTFILE -type f | parallel doit {} {%}

Running a limited number of child processes in parallel in bash? [duplicate]

This question already has answers here:
How to limit number of threads/sub-processes used in a function in bash
(7 answers)
Closed 3 years ago.
I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.
In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.
However a very simple example shell script like this will trash the system performance due to excessive load and swapping:
find . -type f | while read name ;
do
some_heavy_processing_command ${name} &
done
So what I want is essentially similar to what "gmake -j4" does.
I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).
What is the simplest/cleanest/best solution to do what I want?
Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The "xargs --max-procs=4" works like a charm.
(So I voted to close my own question)
I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)
function max2 {
while [ `jobs | wc -l` -ge 2 ]
do
sleep 5
done
}
find . -type f | while read name ;
do
max2; some_heavy_processing_command ${name} &
done
wait
#! /usr/bin/env bash
set -o monitor
# means: run background processes in a separate processes...
trap add_next_job CHLD
# execute add_next_job when we receive a child complete signal
todo_array=($(find . -type f)) # places output into an array
index=0
max_jobs=2
function add_next_job {
# if still jobs to do then add one
if [[ $index -lt ${#todo_array[*]} ]]
# apparently stackoverflow doesn't like bash syntax
# the hash in the if is not a comment - rather it's bash awkward way of getting its length
then
echo adding job ${todo_array[$index]}
do_job ${todo_array[$index]} &
# replace the line above with the command you want
index=$(($index+1))
fi
}
function do_job {
echo "starting job $1"
sleep 2
}
# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
add_next_job
done
# wait for all jobs to complete
wait
echo "done"
Having said that Fredrik makes the excellent point that xargs does exactly what you want...
With GNU Parallel it becomes simpler:
find . -type f | parallel some_heavy_processing_command {}
Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
I think I found a more handy solution using make:
#!/usr/bin/make -f
THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)
.PHONY: all $(TARGETS)
all: $(TARGETS)
$(TARGETS):
some_heavy_processing_command $#
$(THIS): ; # Avoid to try to remake this makefile
Call it as e.g. 'test.mak', and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.
It is more flexible than xargs, and make is part of the standard distribution, not like parallel.
This code worked quite well for me.
I noticed one issue in which the script couldn't end.
If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.
To prevent the above scenario, I've added the following right after the "max_jobs" declaration.
if [ $max_jobs -gt ${#todo_array[*]} ];
then
# there are more elements found in the array than max jobs, setting max jobs to #of array elements"
max_jobs=${#todo_array[*]}
fi
Another option:
PARALLEL_MAX=...
function start_job() {
while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
sleep .1 # Wait for background tasks to complete.
done
"$#" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...
Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the - 1 in the pgrep subtracts the wc -l subprocess.
function jobmax
{
typeset -i MAXJOBS=$1
sleep .1
while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
do
sleep .1
done
}
nproc=5
for i in {1..100}
do
sleep 1 &
jobmax $nproc
done
wait # Wait for the rest

Resources