Recognize symbols on a scanned/shot document

Recognize symbols on a scanned/shot document - algorithm

I'm trying to do the following:
GIVEN:
Scanned/shot document - a form that is filled by many different programs.
I'm trying to recognize only small part of the data. As shown here:
All symbols are digits except the first one in the top-most field which is a letter.
THE PROBLEM is that I tried tessaract and google ml OCRs but the results is very poor , maybe because it's single symbols in cells, not normal text. I don't know.
So I decided to try I own simple recognizing module.
a) I transform it to grayscale and to B&W then
b) Unfortunately there is not garanteed that fields and at exact same places every time. Also they are not with same size because of the scan/photo.
So I'm dynamically trying to find the places of the fields.
But on there test photos I received for test there is no garantee that lines are stright.
Also the scanned/shot resolution is not always the same.
It would be great if someone can give me advice on the following:
Dynamically finding the fields. (currently my success is about 50% depending of the photo)
How to handle the non-straight lines.
How to detect a single cell content/symbol.
A good way to recognize single symbols/digits/ (compare the source etc.)
And maybe a better B&W-transformation, not a simple threshold.

Try to remove the rectangular frames around the numbers before recognition.
By morphological operations such as closing / opening, you can close the frame at the bottom of the picture, thereby saving the numbers.
convert input.jpg -threshold 90% -fuzz 25% -fill black -floodfill +0+0 white -fill white -floodfill +0+0 black out.png

Related

how to whiten the white parts and blacken the black parts of a scanned image in MatLab or photoshop

I have a scanned image, scanned from printed word (docx) file. I want the scanned image to be looked like the original word file i.e. to remove noise and enhancement. i.e. to fully whiten the white parts and fully blacken the black parts without changing colorful parts on the fileenter image description here

There are a number of ways you could approach this. The simplest would be to apply a levels filter with the black point raised a bit and the white point lowered a bit. This can be done to all 3 color channels or selectively to a subset. Since you're going for creating pure black and white and there's no color cast on the image, I would apply the same settings to all 3 color channels. It works like this:
destVal = (srcVal - blackPt) + srcVal / (whitePt - blackPt);
This will slightly change the colored parts of the image, probably resulting in making them slightly more or less saturated.
I tried this in a photo editing app and was disappointed with the results. I was able to remove most of the noise by bringing the white point down to about 66%. However, the logo in the upper left is so wispy that it also ended up turning it very white. The black point didn't really need to be moved.
I think you're going to have a tough time with that logo. You could isolate it from your other changes, though, and that might help. A simple circular area around it where you just ignore any processing would probably do the trick.
But I got to thinking - this was made with Word. Do you have a copy of Word? It probably wouldn't be too difficult to put together a layout that's nearly identical. It still wouldn't help with the logo. But what you could do is layout the text the same and export it to a PDF or other image format. (Or if you can find the original template, just use it directly.) Then you could write some code to process your scanned copy and wherever a pixel is grayscale (red = green = blue), use the corresponding pixel from the version you made, otherwise use the pixels from the scan. That would get you all the stamps and signatures, while having the text nice and sharp. Perhaps you could even find the organization's logo online. In fact, Wikipedia even has a copy of their logo.
You'd probably need to have some sort of threshold for the grayscale because some pixels might be close but have a slight color cast. One option might be something like this:
if ((fabs(red - green) < threshold) && (fabs(red - blue) < threshold))
{
destVal = recreationVal; // The output is the same as the copy you made manually
}
else
{
destVal = scannedVal; // The output is the same as the scan
}
You may find this eats away at some of the colored marks, so you could do a second pass over the output where any pixel that's adjacent to a colored pixel brings in the corresponding pixel from the original scan.

Compare images for rendering differences (ignoring antialiasing)

I have a PDF file that I render using Adobe PDF engine as well as another PDF rip. I want to try to identify places where they render differently. The problem is that they both render with slightly different anti-alias, color, and very minimal positional differences. I am more concerned with larger difference. I'd like to compare the attached files match1.png and match2.png. These two should match. The files diff1.png and diff2.png should NOT match. They should fail as the one has a missing letter. I've worked with ImageMagick's compare, but haven't got good enough results to ignore the slight rendering differences and only focus on the major ones. I think the below algorithm could work, but I wasn't sure how to read PNG files and compare them pixel by pixel.
sample image files
compare each pixel in ImageA to the same pixel in ImageB. Also compare to each neighboring pixel in ImageB
Find the nearest matching pixel from all the tested pixels in ImageB.
Store the color difference of this closest matching pixel as difference index for that pixel in ImageA.
Repeat this calculation for each pixel in ImageA.
Normalize the difference values to make the smallest difference a zero. Un-normalized values might be (3,8,5,18) and normalized may be (0,5,2,15) or some other type of normalizing function.
Then count up how many pixels are over a threshold value, say, 5 or whatever. Then this count of 'bad' pixels could be used to calculate the probability that there is a rendering error in one of the images.
This is just a sample algorithm. I wasn't sure how to process PNG files pixel by pixel. I'd be open to a PHP solution or a solution using command line tools such as imagemagick, etc.

If you flicker the two match image (alternately display one then the other and back and forth) you will see that both orange colors are different. It is not just difference due to anti-aliasing at the edges.
So using Imagemagick 6, compare will show quite a lot of change (as red). If using Imagemagick 7, use magick compare. See http://www.imagemagick.org/Usage/compare/#compare.
compare -metric rmse match1.png match2.png match_diff.png
2304.18 (0.0351595)
Here the difference is one 3.5%, but the difference image show a lot of red since every pixel has a different value.
Another way to see the difference is with -compose difference. The whiter the result the more difference.
convert match1.png match2.png -compose difference -composite match_diff2.png
This does not show much different, so all values are only slightly different. But by stretching the dynamic range, you can see where it is most different.
One way to check neighboring pixels is to repeat the compare shifting one image 1 pixel in each direction (using -roll +X+Y) and perhaps trimming off 1 pixel all around each image using -shave XxY. You can do that in a loop and save each result. Then use -evaluate-sequence min to find the minimum (closest pixel values) for each image at the same pixel location. That would be equivalent to search the 3x3 neighborhood of compares. The problem is you won't know which image gave the max result.
P.S. You can also use compare -metric AE -fuzz 5% . The fuzz value will allow the compare to ignore values that within that percent. This does not seem to be working in Imagemagick 6.9.9.10 but is working in Imagemagick 7.0.6.10.
im7 magick compare -metric AE -fuzz 5% match1.png match2.png match_diff3.png
219487
This says that there are 219487 mismatched pixels of any amount. Here are where they differ.

I would blur the images slightly to remove any fine detail, then look for the maximum difference. A simple difference metric is dE76. Spatial CIELAB does something very similar (if a little fancier).
Here's an implementation using php-vips:
#!/usr/bin/env php
<?php
require __DIR__ . '/vendor/autoload.php';
use Jcupitt\Vips;
$a = Vips\Image::newFromFile($argv[1]);
$b = Vips\Image::newFromFile($argv[2]);
$a = $a->gaussblur(2);
$b = $b->gaussblur(2);
$diff = $a->dE76($b);
# output a 500 pixel across difference map
$scale = 500.0 / $diff->width;
$diff->multiply(5)->resize($scale)->writeToFile("map.png");
echo("maximum image difference is " . $diff->max() . " dE76\n");
For your images, I see:
$ ./measure_diff.php match1.png match2.png
maximum image difference is 13.739426612854 dE76
$ ./measure_diff.php diff1.png diff2.png
maximum image difference is 55.040554046631 dE76
The 13 in the first pair is just differences in colour rendering, the 55 in the second is the extra comma. It's very visible on the map image:

How can I calculate the overall amount of detail in an image?

I'm working with a library of photos, and I'd like some way of calculating a 'detail' value for each image as a whole. By detail, I mean the amount of contrasting colours, shades and edges, such that an image that's a single flat colour would have 0 detail, and a photo filled with lots of small elements would have a high detail value.
This doesn't have to be super-precise - I'd like to identify images that have a detail value higher than some threshold, and treat them differently. A reasonable guess is good enough.
I have a few ideas that might be feasible:
Save the image as a JPG at a given size and compression level, and check the resulting file size. This basically uses the compression algorithm as the check - detailed images make large files. Seems slow, expensive, and crude, but it wouldn't require a lot of custom work.
Sub-divide the image into a grid, sample points within each square, and compare how unique their values are. It seems like it could work, but would require a fine grid and a lot of samples in order to be useful.
Use an edge-detecting filter like unsharp-mask: Take the original, and a copy sharpened by a known amount, then take the average colour of each. If they are very different, the filter has done a 'lot of work' and therefore the image has a lot of edges (and so a lot of detail). This seems promising, but I'm not sure if it would actually work!
Processing will be done out-of-band, so performance isn't a huge issue. If it takes a few seconds per image, that's fine. I'm using rMagick (imageMagick) and Ruby.
Am I missing something? Is there an easier way?

You could try measuring the entropy and see how that works for your images - it's hard to tell how well it will perform for your needs though.
You get ImageMagick to measure it like this:
identify -format '%[entropy]' input.jpg
You can also measure it using the convert tool like this:
convert -size 100x100 xc:white gradient:red-blue +append -print '%[entropy]' null:
Or you could do, say, a Canny edge detection and then calculate the mean of the resulting black and white image which will tell you what percentage of the pixels are edges as an integer between 0-100, like this:
convert input.jpg -canny 0x1+10%+30% -format "%[fx:int(mean*100)]" info:
12
As suggested by Kurt, you could also look at the number of unique colours divided by the number of pixels. which will obviously give small results for large blocks of constant colour and larger results for photograph-type images.
identify -format "%k different colors, %w (width), %h (height)\n" image.jpg

I liked the edge detection suggestion -
convert input.jpg -canny 0x1+10%+30% -format "%[fx:int(mean*100)]" info:
However, this has some big drawbacks. The biggest is that the values you get are hugely dependent on the size of the image.
For reference, I am interested in using this to get an objective measure of the level of detail in Tibetan thangkas. The price of a thangka is very dependent on the amount of time an artist takes to paint it and that, in turn, is hugely dependent on the amount of detail in the painting. The more painstaking little details the longer it take to paint the thangka. One thangka with a lot of detail can take over a year. Another thangka that is the same size but has much less detail can take just a week.
To be useful, this measure has to be scaled based on the size of the photograph - it should give the same answer if you have a gigapixel photograph as if you have a megapixel photograph as long as you have enough resolution to see the smallest details in the painting.
My first pass with ImageMagick is:
magick convert pic.jpg -canny 0x1+10%+30% -format "%[fx:mean*sqrt(w*h)]" info:
When I test with this image, it works very well.
multi-star picture
I get the same value for this image when I scale it by a factor of 2 and a pretty close value when I scale it down by 10%... but the number is smaller which you should expect because some of the detail is being eliminated when you scale it down.
Unfortunately, when I try it on an actual thangka image it does not work as well:
Actual thangka
When I scale this up I get very different results - by as much as a factor of 1.5. Need to find a better measure.

Ghostscript tiffg3 gray=black

Trying to take a pdf and convert it to a tiff, but make all grayscale into pure black. In other words, if it's not white, it should be black. The tiffg3 device is 1 bit, but it's still taking gray and trying to fake it.
Example command:
gs -dQUIET -dNOPAUSE -dBATCH -r200 -sPAPERSIZE=letter -sDEVICE=tiffg3 -sOutputFile=out.tiff in.pdf
Example input:
Example output:
Desired output:

The 1-bit device uses halftoning (aka screening) to represent shades of gray using only black and white pixels. That's what its intended to do, its not intended to change colours at all.
There are various ways you can fake what you want in PostScript (redefining setgray is one method, also setting a transfer function), but PDF is not a programming language, so this approach doesn't really work.
You could use a custom ICC profile to colour correct gray values so that they are all black. Ghostscript versions 9 and above use Little CMS as the colour management system, and have a 'default' Gray ICC profile which you can override. Presumably it would be easy enough to construct a profile which maps anything other than white to pure black. However, this is not my field. You also need to consider how this is going to affect (for example) images.
There are additional controls based on object type in Ghostscript, so you could optionally only apply this conversion to text, or linework.
Ideally you should go back to the original document, alter it there, and make a new PDF.

You can use convert from imagemagick for that purpose, using the flag -level. For example to convert an image to pure black/white splitting the gray scale in half:
convert -level 50%,50% file.pdf blackwhite.pdf
and then pass it to GS. My suggestion for fax though is to leave part of the grayscale there since many documents rely on it for its legibility:
convert -density 816x784 -level 55%,95% file.pdf blackwhite.pdf
This one leaves a 40% of the grayscale there, converts to black every dot in the range 0-55 (being 0 black and 100 white) and puts everything in the range 95-100 to white, you can play with that and get your best match. The density flag will help it later when converting to tiffg3 in GS giving a better quality in the resulting tif.

Detecting image equality at different resolutions

I'm trying to build a script to go through my original, high-res photos and replace the old, low-res ones I uploaded to Flickr before I had a pro account.
For many of them I can just use Exif info such as date taken to determine a match. But some are really old, and either the original file didn't have Exif info, or it got clobbered by whatever stupid resizing software I used at the time.
So, unable to rely on metadata, I'm forced to resort to the content itself. The problem is that the originals are in different resolutions than the ones on Flickr (which is the whole point of this endeavour). So is there a way for me to compare them with some sort of fuzzy similarity measure that would allow me to set a threshold for requiring human input or not?
I guess knowing one image is a resized version of the other can yield better results than general similarity. A solution in any language will do, but Ruby would be a plus :)

Interesting problem, btw :)
Slow-ish solution - excellent chance of success
Use a scale-invariant feature detector to find corresponding features in both images. If the features are matched with with a high score at similar locations, then you have your match.
I'd recommend SIFT which generates a scale & rotation invariant 128-integer descriptor for a feature found in an image. SURF (available in OpenCV) is another (faster) feature point detector.
You can match features across two images via bruteforce (compare each descriptor to a descriptor in the other image) which is O(n^2) but pretty fast (especially in the VL SIFT implementation). But if you need to compare the features in one image to several images (which you might have to) you should build a tree of the features to query it with the other image's features. K-D trees are useful, and OpenCV has a nice implementation.
Fast solution - might work
Downsample your high-res image to the low-res dimensions and use a similarity measure like SAD (where the sum of the differences between block of, say, 3x3 pixels around a pixel in both images is the score) to determine a match.

I'd recommend scripting a solution off of ImageMagick. The following (from the documentation on comparing images with IM) would output a comparative value that you can use.
convert image1 image2 \
-compose difference -composite -colorspace gray miff:- |\
identify -verbose - |\
sed -n '/^.*Mean: */{s//scale=2;/;s/(.*)//;s/$/*100\/32768/;p;q;}' | bc

Compute the normalized color histogram of both images and compare them using some method (histogram intersection, for example - see the link above). Note the normalized histogram is needed because the images present different resolutions. If the images are so dissimilar, they are not the same picture. But if they are similar, you have one of these two cases: (i) they are the same picture or (ii) they are different pictures but present similar global color distributions.
For case (ii), split the images and rectangular tiles and repeat the process, comparing correspondent tiles. You are trying to account for local properties of the image. Rank the results and pick the best match.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio