Convolutional Neural Networks - Theory

Convolutional Neural Networks - Theory - filter

I am sorry for asking this stupid question, but after a bit thinking, I still don't get it yet:
According to Jordi Torres (see here), if we look at an image with 28x28 = 784 pixels, then one way to implement this is to let one neuron of a hidden layer learn about 5x5 = 25 pixels of the input layer:
However, as he explains it:
Analyzing a little bit the concrete case we have proposed, we note that, if we have an input of 28×28 pixels and a window of 5×5, this defines a space of 24×24 neurons in the first hidden layer because we can only move the window 23 neurons to the right and 23 neurons to the bottom before hitting the right (or bottom) border of the input image. We would like to point out to the reader that the assumption we have made is that the window moves forward 1 pixel away, both horizontally and vertically when a new row starts. Therefore, in each step, the new window overlaps the previous one except in this line of pixels that we have advanced.
I really don't get why we need a space of 24x24 neurons in the first hidden layer? Since I take 5x5 windows (which have 25 pixels out of 784 in them), I thought we would need 785/25 = 32 neurons at all. I mean, doesn't one neuron of the hidden layer learn the property of 25 pixels?
Apparently not, but I am really confused.

You're assuming non-overlapping 5x5 segments, but that's not the case. In this example, the first output is derived from rows 1-5, columns 1-5 of the input. The next one uses rows 1-5, columns 2-6, on to rows 1-5, columns 24-28, then rows 2-6, columns 1-5, etc. etc. until rows 24-28, columns 24-28. This is referred to as a "stride" of 1.

Related

Speeding up a pre-computed seam carving algorithm

I'm writing a JS seam carving library. It works great, I can rescale a 1024x1024 image very cleanly in real time as fast as I can drag it around. It looks great! But in order to get that performance I need to pre-compute a lot of data and it takes about 10 seconds. I'm trying to remove this bottleneck and am looking for ideas here.
Seam carving works by removing the lowest energy "squiggly" line of pixels from an image. e.g. If you have a 10x4 image a horizontal seam might look like this:
........x.
.x.....x.x
x.xx..x...
....xx....
So if you resize it to 10x3 you remove all the 'X' pixels. The general idea is that the seams go around the things that look visually important to you, so instead of just normal scaling where everything gets squished, you're mostly removing things that look like whitespace, and the important elements in a picture are unaffected.
The process of calculating energy levels, removing them, and re-calculating is rather expensive, so I pre-compute it in node.js and generate a .seam file.
Each seam in the .seam file is basically: starting position, direction, direction, direction, direction, .... So for the above example you'd have:
starting position: 2
seam direction: -1 1 0 1 0 -1 -1 -1 1
This is quite compact and allows me to generate .seam files in ~60-120kb for a 1024x1024 image depending on settings.
Now, in order to get fast rendering I generate a 2D grids that represents the order in which pixels should be removed. So:
(figure A):
........1.
.1.....1.1
1.11..1...
....11....
contains 1 seam of info, then we can add a 2nd seam:
(figure B):
2...2....2
.222.2.22.
......2...
and when merged you get:
2...2...12
.122.2.1.1
1211..122.
....112...
For completeness we can add seams 3 & 4:
(figures C & D):
33.3..3...
..3.33.333
4444444444
and merge them all into:
(figure E):
2343243412
3122424141
1211331224
4434112333
You'll notice that the 2s aren't all connected in this merged version, because the merged version is based on the original pixel positions, whereas the seam is based on the pixel positions at the moment the seam is calculated which, for this 2nd seam, is a 10x3px image.
This allows the front-end renderer to basically just loop over all the pixels in an image and filter them against this grid by number of desired pixels to remove. It runs at 100fps on my computer, meaning that it's perfectly suitable for single resizes on most devices. yay!
Now the problem that I'm trying to solve:
The decoding step from seams that go -1 1 0 1 0 -1 -1 -1 1 to the pre-computed grid of which pixels to remove is slow. The basic reason for this is that whenever one seam is removed, all the seams from there forward get shifted.
The way I'm currently calculating the "shifting" is by splicing each pixel of a seam out of a 1,048,576 element array (for a 1024x1024 px image, where each index is x * height + y for horizontal seams) that stores the original pixel positions. It's veeerrrrrryyyyy slow running .splice a million times...
This seems like a weird leetcode problem, in that perhaps there's a data structure that would allow me to know "how many pixels above this one have already been excluded by a seam" so that I know the "normalized index". But... I can't figure it out, anything I can think of requires too many re-writes to make this any faster.
Or perhaps there might be a better way to encode the seam data, but using 1-2 bits per pixel of the seam is very efficient, and anything else I can come up with would make those files huge.
Thanks for taking the time to read this!
[edit and tl;dr] -- How do I efficiently merge figures A-D into figure E? Alternatively, any ideas that yield figure E efficiently, from any compressed format

If I understand correct your current algorithm is:
while there are pixels in Image:
seam = get_seam(Image)
save(seam)
Image = remove_seam_from_image(Image, seam)
You then want to construct an array containing the numbers of each seam.
To do so, you could make a 1024x1024 array where each value is the index of that element of the array (y*width+x). Call this Indices.
A modified algorithm then gives you what you want
Let Indices have the dimensions of Image and be initialized to [0, len(Image)0
Let SeamNum have the dimensions of Image and be initialized to -1
seam_num = 0
while there are pixels in Image:
seam = get_seam(Image)
Image = remove_seam_from_image(Image, seam)
Indices = remove_seam_from_image_and_write_seam_num(Indices, seam, SeamNum, seam_num)
seam_num++
remove_seam_from_image_and_write_seam_num is conceptually identical to remove_seam_from_image except that as it walks seam to remove each pixel from Indices it writes seam_num to the location in SeamNum indicated by the pixel's value in Indices.
The output is the SeamNum array you're looking for.

Detecting individual images in an array of images

I'm building a photographic film scanner. The electronic hardware is done now I have to finish the mechanical advance mechanism then I'm almost done.
I'm using a line scan sensor so it's one pixel width by 2000 height. The data stream I will be sending to the PC over USB with a FTDI FIFO bridge will be just 1 byte values of the pixels. The scanner will pull through an entire strip of 36 frames so I will end up scanning the entire strip. For the beginning I'm willing to manually split them up in Photoshop but I would like to implement something in my program to do this for me. I'm using C++ in VS. So, basically I need to find a way for the PC to detect the near black strips in between the images on the film, isolate the images and save them as individual files.
Could someone give me some advice for this?

That sounds pretty simple compared to the things you've already implemented; you could
calculate an average pixel value per row, and call the resulting signal s(n) (n being the row number).
set a threshold for s(n), setting everything below that threshold to 0 and everything above to 1
Assuming you don't know the exact pixel height of the black bars and the negatives, search for periodicities in s(n). What I describe in the following is total overkill, but that's how I roll:
use FFTw to calculate a discrete fourier transform of s(n), call it S(f) (f being the frequency, i.e. 1/period).
find argmax(abs(S(f))); that f represents the distance between two black bars: number of rows / f is the bar distance.
S(f) is complex, and thus has an argument; arctan(imag(S(f_max))/real(S(f_max)))*number of rows will give you the position of the bars.
To calculate the width of the bars, you could do the same with the second highest peak of abs(S(f)), but it'll probably be easier to just count the average length of 0 around the calculated center positions of the black bars.
To get the exact width of the image strip, only take the pixels in which the image border may lie: r_left(x) would be the signal representing the few pixels in which the actual image might border to the filmstrip material, x being the coordinate along that row). Now, use a simplistic high pass filter (e.g. f(x):= r_left(x)-r_left(x-1)) to find the sharpest edge in that region (argmax(abs(f(x)))). Use the average of these edges as the border location.
By the way, if you want to write a source block that takes your scanned image as input and outputs a stream of pixel row vectors, using GNU Radio would offer you a nice method of having a flow graph of connected signal processing blocks that does exactly what you want, without you having to care about getting data from A to B.
I forgot to add: Use the resulting coordinates with something like openCV, or any other library capable of reading images and specifying sub-images by coordinates as well as saving to new images.

"Barcode" reading from scanned image

I want to read a barcode from a scanned image that I printed. The image format is not relevant. I found that the scanned images are of very low quality and can understand why it normal barcodes fail.
My idea is to create a non standard and very simple barcode at the top of each page printed. It will be 20 squares in a row forming a simple binary code.Filled = 1, open = 0. It will be large enough on aA4 to make detection easy.
At this stage I need to load the image and find the barcode somewhere at the top. It will not be exactly at the same spot as it is scanned in. Step into each block and build the ID.
Any knowledge or links to info would be awesome.

If you can preset a region of interest that contains the code and nothing else, then detection is pretty easy. Scan a few rays across this region and find the white/black and black/white transitions. Then, knowing where the "cells" should be, you known their polarity.
For this to work, you need to frame your cells with two black ones on both ends to make sure to know where it starts/stops (if the scale is fixed, you can do with just a start cell, but I would not recommend this).

You could have a look at https://github.com/zxing/zxing. I would suggest to use a 1D bar code, but wide enough to match the low resolution of the scanner.
You could also invent your own bar code encoding and try to parse it your self. Use thick bars for 1 and thin lines for 0. A thick bar would be for instance 2 white pixels, 4 black pixels. A thin line would be 2 white pixels, 2 black pixels and 2 white pixels. The last two pixels encode the bit value.
The pixel should be the size of the scanned image pixel.
You then process the image scan line by scan line, trying to locate the bar code.
We locate the bar code by comparing a given pixel value sequence with a pattern. This is performed by computing a score function. The sum of squared difference is a good pick. When computing the score we ignore the two pixels encoding the bit value.
When the score is below a threshold, we found a matching pattern. It is good to add parity bits to the encoded value so that it's validity can be checked.
Computing a sum of square on a sliding window can be optimized.

XSLT create complex SVG visualisation minimizing line crossings etc

This is no actual single coding problem, rather a problem of the right approach to a complex issue.
So, I have built a rather complex svg visualisation of my XML data using xslt. It looks like this:
(source: erksst.de)
This is just a small sample of the whole data. There are two or three rows. Each row could contain up to 160 yellow boxes.
(The yellow boxes are letter collections, the blue/grey boxes single letters, the lines represent their way of dissemination.)
It works well so far but I want to optimize it:
(1) minimize the number of line crossing
(2) minimize the number of lines crossing a blue/grey box
(3) minimize the lines being too near to another line.
To achieve this there are things to vary:
(a) The broadest row (in the sample it is the third) is fix. It can't be moved. But the other (two) can be moved in the range of the width of the broadest row. I.e. in my example the yellow box of the second row could be moved some 160 pixels to the right.
(b) Furthermore, in the two smaller rows the margin between the yellow boxes could be varied. In my example there is just one per line. But of course there could be more than one yellow box in the two smaller rows.
(c) The order of the yellow boxes within a row could be altered.
So, many possibilites to realize this visualisation.
The problem is the performance time.
I have started with the line crossing problem by using a function which kind of pre-builds the visualisation and calculates the number of crossings.
The variation with the smallest number of crossings is actually built in the output.
The problem is the time it needs. The transformation with just 100 possibilites and my hole XML data took 90 seconds. Doesn't sound like much, but taking into account that 100 variations are just a very small part of all theoritically possible options and that the visualition should at some point in the future build on the fly on server for a user's selection of the data 90 seconds simply is way too much.
I have allready reduced the visualition template for the calculate line crossings functions to all what is necessary leaving asside all captions and so on. That did help, but not as much as expected.
The lines are drawn as follows: First, all boxes are drawn keeping their id from the original data. Then I go back to my data, look where connections are and build the lines.

You could transform your XML into the DOT language (plain TXT format) by XSLT and process it by GraphViz. I solved some similar issue (although not so huge as yours seems to be) this way.

Invoice / OCR: Detect two important points in invoice image

I am currently working on OCR software and my idea is to use templates to try to recognize data inside invoices.
However scanned invoices can have several 'flaws' with them:
Not all invoices, based on a single template, are correctly aligned under the scanner.
People can write on invoices
etc.
Example of invoice: (Have to google it, sadly cannot add a more concrete version as client data is confidential obviously)
I find my data in the invoices based on the x-values of the text.
However I need to know the scale of the invoice and the offset from left/right, before I can do any real calculations with all data that I have retrieved.
What have I tried so far?
1) Making the image monochrome and use the left and right bounds of the first appearance of a black pixel. This fails due to the fact that people can write on invoices.
2) Divide the invoice up in vertical sections, use the sections that have the highest amount of black pixels. Fails due to the fact that the distribution is not always uniform amongst similar templates.
I could really use your help on (1) how to identify important points in invoices and (2) on what I should focus as the important points.
I hope the question is clear enough as it is quite hard to explain.

Detecting rotation
I would suggest you start by detecting straight lines.
Look (perhaps randomly) for small areas with high contrast, i.e. mostly white but a fair amount of very black pixels as well. Then try to fit a line to these black pixels, e.g. using least squares method. Drop the outliers, and fit another line to the remaining points. Iterate this as required. Evaluate how good that fit is, i.e. how many of the pixels in the observed area are really close to the line, and how far that line extends beyond the observed area. Do this process for a number of regions, and you should get a weighted list of lines.
For each line, you can compute the direction of the line itself and the direction orthogonal to that. One of these numbers can be chosen from an interval [0°, 90°), the other will be 90° plus that value, so storing one is enough. Take all these directions, and find one angle which best matches all of them. You can do that using a sliding window of e.g. 5°: slide accross that (cyclic) region and find a value where the maximal number of lines are within the window, then compute the average or median of the angles within that window. All of this computation can be done taking the weights of the lines into account.
Once you have found the direction of lines, you can rotate your image so that the lines are perfectly aligned to the coordinate axes.
Detecting translation
Assuming the image wasn't scaled at any point, you can then try to use a FFT-based correlation of the image to match it to the template. Convert both images to gray, pad them with zeros till the originals take up at most 1/2 the edge length of the padded image, which preferrably should be a power of two. FFT both images in both directions, multiply them element-wise and iFFT back. The resulting image will encode how much the two images would agree for a given shift relative to one another. Simply find the maximum, and you know how to make them match.
Added text will cause no problems at all. This method will work best for large areas, like the company logo and gray background boxes. Thin lines will provide a poorer match, so in those cases you might have to blur the picture before doing the correlation, to broaden the features. You don't have to use the blurred image for further processing; once you know the offset you can return to the rotated but unblurred version.
Now you know both rotation and translation, and assumed no scaling or shearing, so you know exactly which portion of the template corresponds to which portion of the scan. Proceed.

If rotation is solved already, I'd just sum up all pixel color values horizontally and vertically to a single horizontal / vertical "line". This should provide clear spikes where you have horizontal and vertical lines in the form.
p.s. Generated a corresponding horizontal image with Gimp's scaling capabilities, attached below (it's a bit hard to see because it's only one pixel high and may get scaled down because it's > 700 px wide; the url is http://i.stack.imgur.com/Zy8zO.png ).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio