I have grey scale images with an unknown frequency of handwriten digits (0-9) on them.
I am trying a build machine learning model that determines:
The x,y coordinate for each digit.
The digit label (i.e. 0-9).
Example
(I couldn't upload the greyscale images, so suppose . denotes "black background", and the numbers represent themselves):
Image1: Image2: Image3:
7....... .2...... ........
........ .....3.. ........
....1... ........ ........
........ ....2... ........
Thus, letting f denote my machine learning model/function we should have:
f(Image1) = [ label0:[], f(Image2) = [ label0:[], f(Image3) = [ label0:[],
label1:[(x=4,y=2)], label1:[], label1:[],
label2:[], label2:[(x=1,y=0), label2:[],
(x=1,y=3)],
label3:[], label3:[(x=5,y=1)], label3:[],
label4:[], label4:[], label4:[],
label5:[], label5:[], label5:[],
label6:[], label6:[], label6:[],
label7:[(x=0,y=0)], label7:[], label7:[],
label8:[], label8:[], label8:[],
label9:[], label9:[], label9:[],
]
I'm attempting to apply deep learning methods using Keras to solve both problems at the same time, but I'm struggling to setup my labels as there are an unknown number of labels for each image.
Anyone have any ideas about how I could setup such a problem for deep learning?Should I break the problem into 2 stages (location then classification - but then the location problem still has an unknown number of labels)? Thanks!
You can divide this problem to two parts.
In first part you should create a method for detecting if in the image is digit or not. For this purpose you can use method called "sliding windows" (watch this video by Andrew Ng explaining this method). Assume you have image with size 200x200 and each digit has size around 20x20. You can create a window of size 20x20, and in each iteration window is moving right by 20 pixels (or less/more), if window reach right part of image, its moving back to left side, and 20 pixel down (or less/more). After each move of window you're cropping image, checking by using neural network if there's digit on cropped image. If there's digit, you're saving x, y coordinates of window and cropped image to images' array.
Second part should be easy, having digits you're passing them to neural network which determines digit's label.
So, you should train two neural networks - one for detecting if there's digit on image or not, and second for determining label of digit.
There's second way to find digits on image, you can train neural network which determines number of digits on image (it might be difficult), and then, by using k-means (you should set number of clusters to number of digits you're got from NN) you can find positions of digits if they're not too close to each other. I did this in one project, and it worked, but you should have images with plain background and you have to create an array with pixels positions which have brightness exceeding some threshold.
Related
Need your help.
For example i hv game. It is 500x500 pixels. shows random pictures. When there is a particular picture of smiling dog, you need to press space in 1 second.
Lets assume that we allready have that particular image of smiling dog in our memory.(Lets call it target image from now) And all that is left, is just to compare scan image to a target one. So we need to: scan this 500x500 area in real time(one-ten times per second) and compare scan result to target image, and then, do some stuff, depending on a result of comparison.
I need to do it as fast, as reliable and as efficient(in terms of computing power) as possible.
Also i need to be able to move window with game, so it would not brake screen capture(I need screen capture to be anchored to game window, not to whole screen)
So whats your suggestions on handling scanning and comparative phases of this programm in fastest, most reliable and most efficient way possible?
/*
EDIT:I think i need to use OpenCV for screen scan and comparision. But what language would be preferable in terms of speed and efficient computing power distribution? C++ or Python?
*/
Also i need this program to work Windows and Mac OS in future, for now it is just windows.
Ty in advance.
/*
EDIT: What do you guys think of this method :
let our scan result image be "X", and let our target image(image that we allready have aka "image of smiling dog" and want our X to be identical to this image) be "Y".
First we downscale Y from 500x500 to 50x50, than we dowunscale X from 500x500 to 50x50, then we transfer both images to grey scale. And then run normalize sum of squared differene, or even normalized cross corelation to detect similarities betwen X and Y. What do you think? Will it be faster than my suggestion:
{
downsample captured image to 50x50 pixels. Then compare every 25th pixel one by one, gradually(If 1st pixel is same as in target image - proceed to next pixel. If not - wait untill next scanning event and scann first pixel again) to downsampled target image. If all pixels are identical - then we are good, and we found that exact picture.
}
*/
I'm building a photographic film scanner. The electronic hardware is done now I have to finish the mechanical advance mechanism then I'm almost done.
I'm using a line scan sensor so it's one pixel width by 2000 height. The data stream I will be sending to the PC over USB with a FTDI FIFO bridge will be just 1 byte values of the pixels. The scanner will pull through an entire strip of 36 frames so I will end up scanning the entire strip. For the beginning I'm willing to manually split them up in Photoshop but I would like to implement something in my program to do this for me. I'm using C++ in VS. So, basically I need to find a way for the PC to detect the near black strips in between the images on the film, isolate the images and save them as individual files.
Could someone give me some advice for this?
That sounds pretty simple compared to the things you've already implemented; you could
calculate an average pixel value per row, and call the resulting signal s(n) (n being the row number).
set a threshold for s(n), setting everything below that threshold to 0 and everything above to 1
Assuming you don't know the exact pixel height of the black bars and the negatives, search for periodicities in s(n). What I describe in the following is total overkill, but that's how I roll:
use FFTw to calculate a discrete fourier transform of s(n), call it S(f) (f being the frequency, i.e. 1/period).
find argmax(abs(S(f))); that f represents the distance between two black bars: number of rows / f is the bar distance.
S(f) is complex, and thus has an argument; arctan(imag(S(f_max))/real(S(f_max)))*number of rows will give you the position of the bars.
To calculate the width of the bars, you could do the same with the second highest peak of abs(S(f)), but it'll probably be easier to just count the average length of 0 around the calculated center positions of the black bars.
To get the exact width of the image strip, only take the pixels in which the image border may lie: r_left(x) would be the signal representing the few pixels in which the actual image might border to the filmstrip material, x being the coordinate along that row). Now, use a simplistic high pass filter (e.g. f(x):= r_left(x)-r_left(x-1)) to find the sharpest edge in that region (argmax(abs(f(x)))). Use the average of these edges as the border location.
By the way, if you want to write a source block that takes your scanned image as input and outputs a stream of pixel row vectors, using GNU Radio would offer you a nice method of having a flow graph of connected signal processing blocks that does exactly what you want, without you having to care about getting data from A to B.
I forgot to add: Use the resulting coordinates with something like openCV, or any other library capable of reading images and specifying sub-images by coordinates as well as saving to new images.
I want to read a barcode from a scanned image that I printed. The image format is not relevant. I found that the scanned images are of very low quality and can understand why it normal barcodes fail.
My idea is to create a non standard and very simple barcode at the top of each page printed. It will be 20 squares in a row forming a simple binary code.Filled = 1, open = 0. It will be large enough on aA4 to make detection easy.
At this stage I need to load the image and find the barcode somewhere at the top. It will not be exactly at the same spot as it is scanned in. Step into each block and build the ID.
Any knowledge or links to info would be awesome.
If you can preset a region of interest that contains the code and nothing else, then detection is pretty easy. Scan a few rays across this region and find the white/black and black/white transitions. Then, knowing where the "cells" should be, you known their polarity.
For this to work, you need to frame your cells with two black ones on both ends to make sure to know where it starts/stops (if the scale is fixed, you can do with just a start cell, but I would not recommend this).
You could have a look at https://github.com/zxing/zxing. I would suggest to use a 1D bar code, but wide enough to match the low resolution of the scanner.
You could also invent your own bar code encoding and try to parse it your self. Use thick bars for 1 and thin lines for 0. A thick bar would be for instance 2 white pixels, 4 black pixels. A thin line would be 2 white pixels, 2 black pixels and 2 white pixels. The last two pixels encode the bit value.
The pixel should be the size of the scanned image pixel.
You then process the image scan line by scan line, trying to locate the bar code.
We locate the bar code by comparing a given pixel value sequence with a pattern. This is performed by computing a score function. The sum of squared difference is a good pick. When computing the score we ignore the two pixels encoding the bit value.
When the score is below a threshold, we found a matching pattern. It is good to add parity bits to the encoded value so that it's validity can be checked.
Computing a sum of square on a sliding window can be optimized.
I am currently working on OCR software and my idea is to use templates to try to recognize data inside invoices.
However scanned invoices can have several 'flaws' with them:
Not all invoices, based on a single template, are correctly aligned under the scanner.
People can write on invoices
etc.
Example of invoice: (Have to google it, sadly cannot add a more concrete version as client data is confidential obviously)
I find my data in the invoices based on the x-values of the text.
However I need to know the scale of the invoice and the offset from left/right, before I can do any real calculations with all data that I have retrieved.
What have I tried so far?
1) Making the image monochrome and use the left and right bounds of the first appearance of a black pixel. This fails due to the fact that people can write on invoices.
2) Divide the invoice up in vertical sections, use the sections that have the highest amount of black pixels. Fails due to the fact that the distribution is not always uniform amongst similar templates.
I could really use your help on (1) how to identify important points in invoices and (2) on what I should focus as the important points.
I hope the question is clear enough as it is quite hard to explain.
Detecting rotation
I would suggest you start by detecting straight lines.
Look (perhaps randomly) for small areas with high contrast, i.e. mostly white but a fair amount of very black pixels as well. Then try to fit a line to these black pixels, e.g. using least squares method. Drop the outliers, and fit another line to the remaining points. Iterate this as required. Evaluate how good that fit is, i.e. how many of the pixels in the observed area are really close to the line, and how far that line extends beyond the observed area. Do this process for a number of regions, and you should get a weighted list of lines.
For each line, you can compute the direction of the line itself and the direction orthogonal to that. One of these numbers can be chosen from an interval [0°, 90°), the other will be 90° plus that value, so storing one is enough. Take all these directions, and find one angle which best matches all of them. You can do that using a sliding window of e.g. 5°: slide accross that (cyclic) region and find a value where the maximal number of lines are within the window, then compute the average or median of the angles within that window. All of this computation can be done taking the weights of the lines into account.
Once you have found the direction of lines, you can rotate your image so that the lines are perfectly aligned to the coordinate axes.
Detecting translation
Assuming the image wasn't scaled at any point, you can then try to use a FFT-based correlation of the image to match it to the template. Convert both images to gray, pad them with zeros till the originals take up at most 1/2 the edge length of the padded image, which preferrably should be a power of two. FFT both images in both directions, multiply them element-wise and iFFT back. The resulting image will encode how much the two images would agree for a given shift relative to one another. Simply find the maximum, and you know how to make them match.
Added text will cause no problems at all. This method will work best for large areas, like the company logo and gray background boxes. Thin lines will provide a poorer match, so in those cases you might have to blur the picture before doing the correlation, to broaden the features. You don't have to use the blurred image for further processing; once you know the offset you can return to the rotated but unblurred version.
Now you know both rotation and translation, and assumed no scaling or shearing, so you know exactly which portion of the template corresponds to which portion of the scan. Proceed.
If rotation is solved already, I'd just sum up all pixel color values horizontally and vertically to a single horizontal / vertical "line". This should provide clear spikes where you have horizontal and vertical lines in the form.
p.s. Generated a corresponding horizontal image with Gimp's scaling capabilities, attached below (it's a bit hard to see because it's only one pixel high and may get scaled down because it's > 700 px wide; the url is http://i.stack.imgur.com/Zy8zO.png ).
I have an image that I used to analyze in LabView using a method called Rake. Basically, what that method does is it finds all the significant edges along parallel lines on an image.
http://zone.ni.com/reference/en-XX/help/370281P-01/imaqvision/imaq_rake_3/ (as seen on the last image at the bottom of the link).
The beauty of this function is that it will give you all edge points that are larger than a certain edge strength, and each edge will only generate one edge point (thickness of the edge line is 1 pixel)
I want to use OpenCV to do something similar. The way I could imagine for doing this is
- deconstructing the Canny operator with a filter of my choice,
- hysterisis thresholding of the edge values with two thresholds
- followed by nonmaxima suppression
- read the pixels along that line and mark all pixels that are larger than my threshold
the problem is that the canny comes as a bundle and I cant find the nonmaxima suppression function by itself.
Does anybody know of a way to do something similar to the operation I've described?
Thanks
Not sure if I understand this question fully, but about the unbundled non-maximum suppression part:
One simple way for 2d non-maximum suppression is this:
dilate the image. Dilation in OpenCV sets the value of each pixel to the max() of the local neighborhood. Repeat a few times or use a larger kernel to get the desired radius.
Then compare the dilated image with the original and set all pixels with differing values to zero.
The remaining pixels are local maxima.
# some code I once used in OpenCV/Python
# given an image, sets all pixels to zero, unless they are local maxima
def supressNonMaxima(img):
localMax = cvCreateImage (cvGetSize(img), IPL_DEPTH_16U, 1)
cvDilate(img, localMax, null, 3) # max() with radius of 3
mask = cvCreateImage( cvGetSize(img), 8, 1)
cvCmp(img, localMax, mask, CV_CMP_LT)
cvSet(img,0,mask)