How to pipe multiple files into tesseract-ocr from a loop - bash

I am looking to find a way to sequentially add files (PNG input files) to a ocr'ed PDF (via tesseract-3).
The idea is to scan a PNG, optimize it (optipng) and feed it via a stream to tesseract, which adds it to a ever growing PDF.
The time between scans is 20-40 seconds, and the scans go into the hundreds, which is why I want to use the wait time between the scans to do the OCR already.
I imagine this to work like this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
#do some magic piping into a single tesseract instance here
done #or here?
The inspiration for this comes from here:
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-ocr-streaming-images-to-pdf-using-tesseract
Thanks very much for any hint,
Joost
Edits:
OS: OpenSuse Tumbleweed
Scan: more of a series of "image aquisitions" resulting in a single PNG each (not a real scanner); going on for several hours at least.
FollowUp:
This kind of works when doing
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
echo "$capnum.png"
done | tesseract -l deu+eng -c stream_filelist=true - Result pdf
, though the PDF is corrupted when you try to open it in between scan additions or stop this loop with e.g. Ctrl-C. I do not see a way to get an uncorrupted PDF.

try this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
done | tesseract -c stream_filelist=true - - pdf > output.pdf

Related

Image Conversion - RAW to png/raw for game (Pac The Man X)

So I have raw image and I am just curious If I can edit such image to save as RGB-32 Packed transparent interlaced raw and what program I could use, there is specification:
Format of RAW image
I have tried using photoshop but then game crashes. Is it even possible? I should get file without thumbnail. I also tried using gimp, free converters and Raw viewer but no luck. Any suggestions?
Edit:
Used photoshop (interleaved with transparency format), game starts but images are just bunch of pixels.
file that i try to prepare (221bits)
We are still not getting a handle on what output format you are really trying to achieve. Let's try generating a file from scratch, to see if we can get there.
So, let's just use simple commands that are available on a Mac and generate some test images from first principles. Start with exactly the same ghost.raw image you shared in your question. We will take the first 12 bytes as the header, and then generate a file full of red pixels and see if that works:
# Grab first 12 bytes from "ghost.raw" and start a new file "red.raw"
head -c 12 ghost.raw > red.raw
# Now generate 512x108 pixels, where red=ff, green=00, blue=01, alpha=fe and append to "red.raw"
perl -E 'say "ff0001fe" x (512*108)' | xxd -r -p >> red.raw
So you can try using red.raw in place of ghost.raw and tell me what happens.
Now try generating a blue file just the same:
# Grab first 12 bytes from "ghost.raw" and start a new file "blue.raw"
head -c 12 ghost.raw > blue.raw
# Now generate 512x108 pixels, where red=00, green=01, blue=ff, alpha=fe and append to "blue.raw"
perl -E 'say "0001fffe" x (512*108)' | xxd -r -p >> blue.raw
And then try blue.raw.
Original Answer
AFAIK, your image is actually 512 pixels wide by 108 pixels tall in RGBA8888 format with a 12-byte header at the start - making 12 + 4*(512 * 108) bytes.
You can convert it to PNG or JPEG with ImageMagick like this:
magick -size 512x108+12 -depth 8 RGBA:ghost.raw result.png
I still don't understand from your question or comments what format you actually want - so if you clarify that, I am hopeful we can get you answered.
Try using online converters. They help most of the time.\
A Website like these can possibly help:
https://www.freeconvert.com/raw-to-png
https://cloudconvert.com/raw-to-png
https://www.zamzar.com/convert/raw-to-png/
Some are specific websites which ask you for detail and some are straight forward conversions.

How can I compare the file sizes match between duplicate directories?

I need to compare two directories to validate a backup.
Say my directory looks like the following:
Filename Filesize Filename Filesize
user#main_server:~/mydir/ user#backup_server:~/mydir/
file1000.txt 4182410737 file1000.txt 4182410737
file1001.txt 8241410737 - <-- missing on backup_server!
... ...
file9999.txt 2410418737 file9999.txt 1111111111 <-- size != main_server
Is there a quick one liner that would get me close to output like:
Invalid Backup Files:
file1001.txt
file9999.txt
(with the goal to instruct the backup script to refetch these files)
I've tried to get variations of the following to no avail.
[main_server] $ rsync -n ~/mydir/ user#backup_server:~/mydir
I cannot do rsync to backup the directories itself because it takes way too long (8-24hrs). Instead I run multiple threads of scp to fetch files in batches. This completes regularly <1hr. However, occasionally I find a few files that were somehow missed (perhaps dropped connection).
Speed is a priority, so file sizes should be sufficient. But I'm open to including a checksum, provided it doesn't slow the process down like I find with rsync.
Here's my test process:
# Generate Large Files (1GB)
for i in {1..100}; do head -c 1073741824 </dev/urandom >foo-$i ; done
# SCP them from src to dest
for i in {1..100}; do ( scp ~/mydir/foo-$i user#backup_server:~/mydir/ & ) ; sleep 0.1 ; done
# Confirm destination has everything from source
# This is the point of the question. I've tried:
rsync -Sa ~/mydir/ user#backup_server:~/mydir
# Way too slow
What do you recommend?
By default, rsync uses the quick check method which only transfers files that differ in size or last-modified time. As you report that the sizes are unchanged, that would seem to indicate that the timestamps differ. Two options to handlel this are:
Use -p to preserve timestamps when transferring files.
Use --size-only to ignore timestamps and transfer only files that differ in size.

How do I improve the performance of an read-write intensive imagemagick script?

I use a bash script to process a bunch of images for a timelapse movie. The method is called shutter drag, and i am creating a moving average for all images. The following script works fine:
#! /bin/bash
totnum=10000
seqnum=40
skip=1
num=$(((totnum-seqnum)/1))
i=1
j=1
while [ $i -le $num ]; do
echo $i
i1=$i
i2=$((i+1))
i3=$((i+2))
i4=$((i+3))
i5=$((i+4))
...
i37=$((i+36))
i38=$((i+37))
i39=$((i+38))
i40=$((i+39))
convert $i1.jpg $i2.jpg $i3.jpg $i4.jpg $i5.jpg ... \
$i37.jpg $i38.jpg $i39.jpg $i40.jpg \
-evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
i=$((i+$skip))
j=$((j+1))
done
However, i noticed that this script takes a very long time to process a lot of images with a large average window (1s per image). I guess, this is caused by a lot of reading and writing in the background.
Is it possible to increase the speed of this script? For example by storing the images in the memory, and with every iteration deleting the first, and loading the last image only.
I discovered the mpr:{label} function of imagemagick, but i guess this is not the right approach, as the memory is cleared after the convert command?
Suggestion 1 - RAMdisk
If you want to put all your files on a RAMdisk before you start, it should help the I/O speed enormously.
So, to make a 1GB RAMdisk, use:
sudo mkdir /RAMdisk
sudo mount -t tmpfs -o size=1024m tmpfs /RAMdisk
Suggestion 2 - Use MPC format
So, assuming you have done the previous step, convert all your JPEGs to MPC format files on the RAMdisk. The MPC file can be dma'ed straight into memory without your CPU needing to do costly JPEG decoding as MPC is just the same format as ImageMagick uses in memory, but on-disk.
I would do that with GNU Parallel like this:
parallel -X mogrify -path /RAMdisk -fmt MPC ::: *.jpg
The -X passes as many files as possible to mogrify without creating loads of convert processes. The -path says where the output files must go. The -fmt MPC makes mogrify convert the input files to MPC format (Magick Pixel Cache) files which your subsequent convert commands in the loop can read by pure DMA rather than expensive JPEG decoding.
If you don't have, or don't like, GNU Parallel, just omit the leading parallel -X and the :::.
Suggestion 3 - Use GNU Parallel
You could also run #chepner's code in parallel...
for ...; do
echo convert ...
done | parallel
Essentially, I am echoing all the commands instead of running them and the list of echoed commands is then run by GNU Parallel. This could be especially useful if you cannot compile ImageMagick with OpenMP as Eric suggested.
You can play around with switches such as --eta after parallel to see how long it will take to finish, or --progress. Also, experiment with -j 2 or -j4 depending how big your machine is.
I did some benchmarks, just for fun. First, I made 250 JPEG images of random noise at 640x480, and ran chepner's code "as-is" - that took 2 minutes 27 seconds.
Then, I used the same set of images, but changed the loop to this:
for ((i=1, j=1; i <= num; i+=skip, j+=1)); do
echo convert "${files[#]:i:seqnum}" -evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
done | parallel
The time went down to 35 seconds.
Then I put the loop back how it was, and changed all the input files to MPC instead of JPEG, the time went down to 36 seconds.
Finally, I used MPC format and GNU Parallel as above and the time dropped to 19 seconds.
I didn't use a RAMdisk as I am on a different OS from you (and have extremely fast NVME disks), but that should help you enormously too. You could write your output files to RAMdisk too, and also in MPC format.
Good luck and let us know how you get on please!
There is nothing you can do in bash to speed this up; everything except the actual IO that convert has to do is pretty trivial. However, you can simplify the script greatly:
#! /bin/bash
totnum=10000
seqnum=40
skip=1
num=$(((totnum-seqnum)/1))
# Could use files=(*.jpg), but they probably won't be sorted correctly
for ((i=1; i<=totnum; i++)); do
files+=($i.jpg)
done
for ((i=1, j=1; i <= num; i+=skip, j+=1)); do
convert "${files[#]:i:seqnum}" -evaluate-sequence mean ~/timelapse/Images/Shutterdrag/$j.jpg
done
Storing the files in a RAM disk would certainly help, but that's beyond the scope of this site. (Of course, if you have enough RAM, the OS should probably be keeping a file in disk cache after it is read the first time so that subsequent reads are much faster without having to preload a RAM disk.)

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

ImageMagick crop huge image

I am trying to create tiles from a huge image say 40000x40000
i found a script on line for imagemagick he crops the tiles. it works fine on small images like say 10000x5000
once i get any bigger it ends up using to much memory and the computer dies.
I have added the limit options but they dont seem to take affect
i have the monitor in there but it does not help as the script just slows down and locksup the machine
it seems to just goble up like 50gig of swap disk then kill the machine
i think the problem is that as it crops each tile it keeps them in memory. What i think i needs is for it to write each tile to disk as it creates it not store them all up in memory.
here is the script so far
#!/bin/bash
file=$1
function tile() {
convert -monitor -limit memory 2GiB -limit map 2GiB -limit area 2GB $file -scale ${s}%x -crop 256x256 \
-set filename:tile "%[fx:page.x/256]_%[fx:page.y/256]" \
+repage +adjoin "${file%.*}_${s}_%[filename:tile].png"
}
s=100
tile
s=50
tile
After a lot more digging and some help from the guys on the ImageMagick forum I managed to get it working.
The trick to getting it working is the .mpc format. Since this is the native image format used by ImageMagick it does not need to convert the initial image, it just cuts out the piece that it needs. This is the case with the second script I setup.
Lets say you have a 50000x50000 .tif image called myLargeImg.tif. First convert it to the native image format using the following command:
convert -monitor -limit area 2mb myLargeImg.tif myLargeImg.mpc
Then, run the bellow bash script that will create the tiles. Create a file named tiler.sh in the same folder as the mpc image and put the below script:
#!/bin/bash
src=$1
width=`identify -format %w $src`
limit=$[$width / 256]
echo "count = $limit * $limit = "$((limit * limit))" tiles"
limit=$((limit-1))
for x in `seq 0 $limit`; do
for y in `seq 0 $limit`; do
tile=tile-$x-$y.png
echo -n $tile
w=$((x * 256))
h=$((y * 256))
convert -debug cache -monitor $src -crop 256x256+$w+$h $tile
done
done
In your console/terminal run the below command and watch the tiles appear one at at time into your folder.
sh ./tiler.sh myLargeImg.mpc
libvips has an operator that can do exactly what you want very quickly. There's a chapter in the docs introducing dzsave and explaining how it works.
It can also do it in relatively little memory: I regularly process 200,000 x 200,000 pixel slide images using less than 1GB of memory.
See this answer, but briefly:
$ time convert -crop 512x512 +repage huge.tif x/image_out_%d.tif
real 0m5.623s
user 0m2.060s
sys 0m2.148s
$ time vips dzsave huge.tif x --depth one --tile-size 512 --overlap 0 --suffix .tif
real 0m1.643s
user 0m1.668s
sys 0m1.000s
You may try to use gdal_translate utility from GDAL project. Don't get scared off by the "geospatial" in the project name. GDAL is an advanced library for access and processing of raster data from various formats. It is dedicated to geospatial users, but it can be used to process regular images as well, without any problems.
Here is simple script to generate 256x256 pixel tiles from large in.tif file of dimensions 40000x40000 pixels:
#!/bin/bash
width=40000
height=40000
y=0
while [ $y -lt $height ]
do
x=0
while [ $x -lt $width ]
do
outtif=t_${y}_$x.tif
gdal_translate -srcwin $x $y 256 256 in.tif $outtif
let x=$x+256
done
let y=$y+256
done
GDAL binaries are available for most Unix-like systems as well as Windows are downloadable.
ImageMagick is simply not made for this kind of task. In situations like yours I recommend using the VIPS library and the associated frontend Nip2
VIPS has been designed specifically to deal with very large images.
http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS

Resources