run windows program under wine using gnu parallel - parallel-processing

I have a very basic script to run multiple copies of a windows population genetics program (msvar.exe) under Wine. It uses "find" to look through multiple folders for an initiation file (INTFILE) and then starts an instance of msvar.exe in each directory using that initiation file. Different folders will have different paramaters in the initiation file so I can run a series of simulations by adding the "&" parameter. Here it is;
for i in $(find /home/msvartest -name INTFILE -type f)
do (
cd $(dirname $(realpath $i));
# wine explorer /desktop=name msvar.exe;
wineconsole --backend=user msvar.exe;
) &
done
At the moment I run up to 20 copies of msvar.exe at once each under it's own wineconsole (or wine explorer window) on my dual hexacore machine. Each run instance can take 3 or 4 days, but the program only runs on a single core, so I need to run the simulations in parallel. It looks like Gnu parallel would be a better way to run msvar.exe and would allow me to run more simulations over remote computers. I unsuccessfully tried to get Gnu parallel working with wineconsole following the suggestions in Run wine in parallel with gnu-parallel - needs {%} slot substitution to work. Is anybody able to help, or even better knock up a script I could use.
Thanks for your help.

I think your command is going to get horribly long and unwieldy unless you use an exported function like this:
#!/bin/bash
doit() {
...
...
}
export -f doit
parallel -j 10 doit ::: {0..99}
So, for your example that will look something like (untested):
#!/bin/bash
doit() {
echo Processing $1
cd $(dirname $(realpath "$1"));
WINEPREFIX=$HOME/slot{%} wineconsole --backend=user msvar.exe
}
export -f doit
find /home/msvartest -name INTFILE -type f | parallel --dry-run doit
Unfortunately I don't have your environment set up to test this but it should be close and easy to correct if there are minor errors. Try and see what it does, then remove the --dry-run to let it actually do something.
If you have spaces in your filenames, you should use -print0 with your find command and also add -0 after parallel but that just complicates things for the moment.

#!/bin/bash
doit() {
echo Processing $1
cd $(dirname $(realpath "$1"));
WINEPREFIX=$HOME/slot$2 wineconsole --backend=user msvar.exe
}
export -f doit
find /home/msvartest -name INTFILE -type f | parallel doit {} {%}

Related

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

gnu parallel to parallelize a for loop

I have seen several questions about this topic, but I lack the ability to translate this to my specific problem. I have a for loop that loops through sub directories and then executes a .sh script on a compressed text file inside each directory. I want to parallelize this process, but I'm struggling to apply gnu parallel.
Here is my loop:
for d in ./*/ ; do (cd "$d" && script.sh); done
I understand I need to input a list into parallel, so i have been trying this:
ls -d */ | parallel cd && script.sh
While this appears to get started, I get an error when gzip tries to unzip one of the txt files inside the directory, saying the file does not exist:
gzip: *.txt.gz: No such file or directory
However, when I run the original for loop, I have no issues aside from it taking a century to finish. Also, I only get the gzip error once when using parallel, which is so weird considering I have over 1000 sub-directories.
My questions are:
How do I get Parallel to work in my case? How do I get parallel to parallelize the application of a .sh script to 1000s of files in their own sub-directories? ie- what is the solution to my problem? I gotta make progress.
What am I missing? Syntax, loop, bad script? I want to learn.
Is Parallel actually attempting to run all these .sh scripts in parallel? Why dont I get an error for every .txt.gz file?
Is parallel the best option for the application? Is there another option that is better suited to my needs?
Two problems:
In:
ls -d */ | parallel cd && script.sh
what is paralleled is just cd, not script.sh. script.sh is only executed once, after all parallel cd jobs have run, if there was no error. It is the same as:
ls -d */ | parallel cd
if [ $? -eq 0 ]; then script.sh; fi
You do not pass the target directory to cd. So, what is executed by parallel is just cd, which just changes the current directory to your home directory. The final script.sh is executed in the current directory (from where you invoked the command) where there are probably no *.txt.gz files, thus the error.
You can check yourself the effect of the first problem with:
$ mkdir /tmp/foobar && cd /tmp/foobar && mkdir a b c
$ ls -d */ | parallel cd && pwd
/tmp/foobar
The output of pwd is printed only once, even if you have more than one input directory. You can fix it by quoting the command and then check the second problem with:
$ ls -d */ | parallel 'cd && pwd'
/homes/myself
/homes/myself
/homes/myself
You should see as many pwd outputs as there are input directories but it is always the same output: your home directory. You can fix the second problem by using the {} replacement string that is substituted with the current input. Check it with:
$ ls -d */ | parallel 'cd {} && pwd'
/tmp/foobar/a
/tmp/foobar/b
/tmp/foobar/c
Now, you should have all input directories properly listed in the output.
For your specific problem this should work:
ls -d */ | parallel 'cd {} && script.sh'

Parse filenames into touch command on OS X

I have a large number of photos on my machine where I'd like to parse the standard naming convention I have created for each file, and then pipe it to the touch command.
For example, I have these files:
2016-08-06-00h28m34.jpg
2016-08-06-00h28m35.jpg
2016-08-06-00h28m36.jpg
I would like to generate (and then run) the following commands:
touch -t 201608060028.34 2016-08-06-00h28m34.jpg
touch -t 201608060028.35 2016-08-06-00h28m35.jpg
touch -t 201608060028.36 2016-08-06-00h28m36.jpg
I can do this manually in a text editor, but it's extremely time-consuming, due to the number of files in each directory. I could also do it in C# and run it over my LAN, but that seems like overkill. Heck, I can even do this in SQL Server, but ... it's OS X and I'm sure there's a simple command-line thing I'm missing.
I've looked at Windows version of Unix touch command to modify a file's modification date from filename part, and Split Filename Up to Define Variables, but I can't seem to figure out how to add in the period for the seconds portion of the script, plus I don't want to add the batch script to each of the hundreds of folders I have.
Any assistance will be greatly appreciated.
Simple Option
I presume you are trying to set the filesystem time to match the EXIF capture time of thousands of photos. There is a tool for that, which runs on OSX, Linux (and Windows if you must). It is called jhead and I installed it on OSX using homebrew with:
brew install jhead
There may be other ways to install it - jhead website.
Please make a back up before trying this, or try it out on a small subset of your files, as I may have misunderstood your needs!
Basically the command to set the filesystem timestamp to match the EXIF timestamp on a single file is:
jhead -ft SomeFile.jpg
So, if you wanted to set the timestamps for all files in $HOME/photos/tmp and all subdirectories, you would do:
find $HOME/photos/tmp -iname \*.jpg -exec jhead -ft {} \;
Option not requiring any extra software
Failing that, you could do it with Perl which is installed on OSX by default anyway:
find . -name \*.jpg | perl -lne 'my $a=$_; s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" '
which gives this sort of output on my machine:
touch -t 201608060028.34 "./2016-08-06-00h28m34.jpg"
touch -t 201608060028.35 "./2016-08-06-00h28m35.jpg"
touch -t 201501060028.35 "./tmp/2015-01-06-00h28m35.jpg"
and if that looks good on your machine, you could send those commands into bash to be executed like this:
find . -name \*.jpg | perl -lne 'my $a=$_;s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" ' | bash -x
And, for the Perl purists out there, yes, I know Perl could do the touch itself and save invoking a whole touch process per file, but that would require modules and explanation and a heap of other extraneous stuff that is not really necessary for a one-off, or occasional operation.

Portable way to build up arguments for a utility in shell?

I'm writing a shell script that's meant to run on a range of machines. Some of these machines have bash 2 or bash 3. Some are running BusyBox 1.18.4 where bin/bash exists but
/bin/bash --version doesn't return anything at all
foo=( "hello" "world" ) complains about a syntax error near the unexpected "(" both with and without the extra spaces just inside the parens ... so arrays seem either limited or missing
There are also more modern or more fully featured Linux and bash versions.
What is the most portable way for a bash script to build up arguments at run time for calling some utility like find? I can build up a string but feel that arrays would be a better choice. Except there's that second bullet point above...
Let's say my script is foo and you call it like so: foo -o 1 .jpg .png
Here's some pseudo-code
#!/bin/bash
# handle option -o here
shift $(expr $OPTIND - 1)
# build up parameters for find here
parameters=(my-diretory -type f -maxdepth 2)
if [ -n "$1" ]; then
parameters+=-iname '*$1' -print
shift
fi
while [ $# -gt 0 ]; do
parameters+=-o -iname '*$1' -print
shift
done
find <new positional parameters here> | some-while-loop
If you need to use mostly-POSIX sh, such as would be available in busybox ash-named-bash, you can build up positional parameters directly with set
$ set -- hello
$ set -- "$#" world
$ printf '%s\n' "$#"
hello
world
For a more apt example:
$ set -- /etc -name '*b*'
$ set -- "$#" -type l -exec readlink {} +
$ find "$#"
/proc/mounts
While your question involves more than just Bash, you may benefit from reading the Wooledge Bash FAQ on the subject:
http://mywiki.wooledge.org/BashFAQ/050
It mentions the use of "set --" for older shells, but also gives a lot of background information. When building a list of argument, it's easy to create a system that works in simple cases but fails when the data has special characters, so reading up on the subject is probably worthwhile.

How to run fswatch to call a program with static arguments?

I used to use fswatch v0.0.2 like so (in this instance to run django test suit when a file changed)
$>fswatch . 'python manage.py test'
this works fine.
I wanted to exclude some files that were causing the test to run more than once per save (Sublime text was saving a .tmp file, and I suspect .pyc files were also causing this)
So I upgraded fswatch to enable the -e mode.
However the way fswatch has changed which is causing me troubles - it now accepts a pipe argument like so:
$>fswatch . | xargs -n1 program
I can't figure out how to pass in arguments to the program here. e.g. this does not work:
$>fswatch . | xargs -n1 python manage.py test
nor does this:
$>fswatch . | xargs -n1 'python manage.py test'
how can I do this without packaging up my command in a bash script?
fswatch documentation (either the Texinfo manual, or the wiki, or README) have examples of how this is done:
$ fswatch [opts] -0 -o path ... | xargs -0 -n1 -I{} your full command goes here
Pitfalls:
xargs -0, fswatch -0: use it to make sure paths with newlines are interpreted correctly.
fswatch -o: use it to have fswatch "bubble" all the events in the set into a single one printing only the number of records in the set.
-I{}: specifying a placeholder is the trick you missed for xargs interpreting correctly your command arguments in those cases where you do not want the record (in this case, since -o was used, the number of records in the set) to be passed down to the command being executed.
Alternative answer not fighting xargs' default reason for being - passing on the output as arguments to the command to be run.
fswatch . | (while read; do python manage.py test; done)
Which is still a bit wordy/syntaxy, so I have created a super simple bash script fswatch-do that simplifies things for me:
#!/bin/bash
(while read; do "$#"; done)
usage:
fswatch -r -o -e 'pyc' somepath | fswatch-do python manage.py test someapp.SomeAppTestCase

Resources