ORB Feature Descriptor Official Paper Explanation - feature-extraction

I was just reading the official paper of ORB from Ethan Rublee Official Paper and somewhat I find hard to understand the section of "4.3 Learning Good Binary Features"
I was surfing over the Internet to dig much deep into it and I found the below paragraph. I haven't getting the practical explanation of this. Can any of you explain me this in a simple terms.
"Given a local image patch in size of m × m, and suppose the local window
(i.e., the box filter used in BRIEF) used for intensity test is of size r × r , there are N = (m − r )2 such local windows.
Each two of them can define an intensity test, so we have C2N bit features. In the original implementation of ORB, m is set to 31, generating 228,150 binary tests. After removing tests that overlap, we finally have a set of 205,590 candidate bit features. Based on a training set, ORB selects at most 256 bits according to Greedy algorithm."
What am getting from the official paper and from the above paragraph is that.
We have a patch size of 31X31 and select a size of 5X5.. We will have N=(31-5)^2 = 676 possible Sub Windows. Am not getting the lines which are marked in bold. What does it mean by removing test that overlap, we get 205,590 bit Features?

Imagine a small image with size 31x31 (patch) and a small 5x5 window. How many different positions this window can be placed into the image? If you slide it 1 by 1 pixel then it can be placed in (31-5)^2 = 676 different positions, right? Combining only central pixels of 676 windows by 2 elements you have 676!/(2!*(676-2)!) = 228,150 combinations. In case of ORB descriptor they were not interested in slide the window in 1 by 1 pixel, it could be so much noised because of overlap between some windows (they are much near). Then they removed overlapping windows sliding it 5 by 5 pixels and used their central pixels to create binary tests, what reduced total combinations to 205,590.

Related

Starting point for image recognition?

I have a set of 274 color images (each one is 200x150 pixels). Each image is visually distinct. I would like to build an app which accepts an up/down-scaled version of one of the base set of images and determines the closest match.
I'm a senior software engineer but am totally new to image recognition. I'd really appreciate any recommendations as to where to start.
If you're comparing extremely similar images, it's in theory sufficient to calculate the Euclidean distance between the 2 images. The images must be the same size to do so, so it is often necessary to rescale an image to do so (generally the larger image is scaled down). Note that aliasing issues can happen here, so pay some attention to your downsampling algorithm. There's also an issue if your images don't have the same aspect ratio.
However, this is almost never done in practice since it's extremely slow. For N images of size WxH and 3 color channels, it requires N x W x H x 3 comparisons, which quickly gets unworkable (consider that many users can have over 1000 images of size >1000x1000).
Generally we attempt to reduce the image to a smaller array that captures the image information much more briefly, called a visual descriptor. For example taking a 1024x1024x3 image and reducing it to a 128 length vector. This needs only be calculated once for the reference images, and then stored in an appropriate data structure. Then we can compare the descriptor for the query image against the descriptor for the reference images.
The cost of calculating the distance for our dataset of N images for a descriptor of length L is then N x L instead of the original N x W x H x 3
So the issue is to find efficient descriptors that are (a) cheap to compute and (b) capture the image accurately. This is still an active area of research, but I can suggest some:
Histograms are probably the simplest way to do this, although they do very poorly with any illumination change and incorporate only color information, no spatial information. Make sure you normalise your histogram before doing any comparison
Perceptual hashing works well with very similar images or slightly cropped images. See here
GIST descriptors are powerful, but more complex, see here

What are some common focus stacking algorithms?

I want to write my own focus stacking software but haven't been able to find a suitable explanation of any algorithm for extracting the in-focus portions of each image in the stack.
For those who are not familiar with focus stacking, this Wikipedia article does a nice job of explaining the idea.
Can anyone point me in the right direction for finding an algorithm? Even some key words to search would be helpful.
I realise this is over a year old but for anyone who is interested...
I have had a fair bit of experience in machine vision and this is how I would do it:
Load every image in memory
Perform a Gaussian blur on each image on one of the channels (maybe Green):
The simplest Gaussian kernel is:
1 2 1
2 4 2
1 2 1
The idea is to loop through every pixel and look at the pixels immediately adjacent. The pixel that you are looping through is multiplied by 4, and the neighboring pixels are multiplied by whatever value corresponds to the kernel above.
You can make a larger Gaussian kernel by using the equation:
exp(-(((x*x)/2/c/c+(y*y)/2/c/c)))
where c is the strength of the blur
Perform a Laplacian Edge Detection kernel on each Gaussian Blurred image but do not apply a threshold
The simplest Laplacian operator is:
-1 -1 -1
-1 8 -1
-1 -1 -1
same deal as the Gaussian, slide the kernel over the entire image and generate a result.
An equation to work out larger kernels is here:
(-1/pi/c/c/c/c)*(1-(x*x+y*y)/2/c/c)*exp(-(x*x+y*y)/2/c/c)
Take the absolute value of the Laplacian of Gaussian result. this will quantify the strength of edges with respect to the size and strength of your kernel.
Now create a blank image, loop through each pixel and find the strongest edge in the LoG (i.e. the highest value in the image stack) and take the RGB value for that pixel from the corresponding image.
Here is an example in MATLAB that I have created:
http://www.magiclantern.fm/forum/index.php?topic=11886.0
You are free to use it for whatever you like. It will create a file called Outsharp.bmp which is what you are after.
To better your output image you could:
- Compensate for differences in lightness levels between images (i.e. histogram matching or simple level adjustment)
- Create a custom algorithm to reject image noise
- Manually adjust the stack after you have generated it
- Apply a Gaussian blur (be sure to divide the result by 16) on the focus map so that the individual images are better merged
Good luck!

How to detect boundaries of a pattern [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Detecting thin lines in blurry image
So as the title says, I am trying to detect boundaries of patterns. In the images attached, you can basically see three different patterns.
Close stripe lines
One thick L shaped line
The area between 1 & 2
I am trying to separate these three, in say 3 separate images. Depend on where the answers go, I will upload more images if needed. Both idea or code will be helpful.
You can solve (for some values of "solve") this problem using morphology. First, to make the image more uniform, remove irrelevant minima. One way to do this is using the h-dome transform for regional minima, which suppresses minima of height < h. Now, we want to join the thin lines. That is accomplished by a morphological opening with a horizontal line of length l. If the lines were merged, then the regional minima of the current image is the background. So we can fill holes to obtain the relevant components. The following code summarizes these tasks:
f = rgb2gray(imread('http://i.stack.imgur.com/02X9Z.jpg'));
hm = imhmin(f, h);
o = imopen(hm, strel('line', l, 0));
result = imfill(~imregionalmin(o), 'holes');
Now, you need to determine h and l. The parameter h is expected to be easier since it is not related to the scale of the input, and in your example, values in the range [10, 30] work fine. To determine l maybe a granulometry analysis could help. Another way is to check if the result contains two significant connected components, corresponding to the bigger L shape and the region of the thin lines. There is no need to increase l one by one, you could perform something that resembles a binary search.
Here are the hm, o and result images with h = 30 and l = 15 (l in [13, 19] works equally good here). This approach gives flexibility on parameter choosing, making it easier to pick/find good values.
To calculate the area in the space between the two largest components, we could merge them and simply count the black pixels inside the new connected component.
You can pass a window (10x10 pixels?) and collect features for that window. The features could be something as simple as the cumulative gradients (edges) within that window. This would distinguish the various areas as long as the window is big enough.
Then using each window as a data point, you can do some clustering, or if the patterns don't vary that much you can do some simple thresholds to determine which data points belong to which patterns (the larger gradient sums belong to the small lines: more edges, while the smallest gradient sums belong to the thickest lines: only one edge, and those in between belong to the other "in-between" pattern .
Once you have this classification, you can create separate images if need be.
Just throwing out ideas. You can binarize the image and do connected component labelling. Then perform some analysis on the connected components such as width to discriminate between the regions.

Visual similarity search algorithm

I'm trying to build a utility like this http://labs.ideeinc.com/multicolr,
but I don't know which algorithm they are using, Does anyone know?
All they are doing is matching histograms.
So build a histogram for your images. Normalize the histograms by size of image. A histogram is a vector with as many elements as colors. You don't need 32,24, and maybe not even 16 bits of accuracy and this will just slow you down. For performance reasons, I would map the histograms down to 4, 8, and 10-12 bits.
Do a fuzzy least distance compare between the all the 4 bit histograms and your sample colors.
Then take that set and do the 8 bit histogram compare.
Then maybe go up to a 10 or 12 bit histogram compare with the remaining set. This will be the highest performance search, because you are comparing the total set with a very small number of calculations, to find a small subset.
Then you work on the small subset with a higher number of calculations, etc.
The real big trick is to find the best algorithm for matching similar histograms.
Start with the distance calculation. In 3 dimensions i think it was:
SQRT((x1-x2)^2 + (y1-y2)^2 + (z1-z2)^2)
I'm doing this from memory, so look it up to make sure.
For your purposes, you will have more than 3 dimensions, so you will have more terms. A 4 bit histogram would have 16 terms, an 8 bit one would have 256 terms, etc. Remember that this kind of math is slow, so don't actually do the SQRT part. If you normalize the size of your images small enough, say down to 10,000 pixels, then you know you only will ever have to do x^2 for values 0..10,0000. Pre-calculate a lookup table of x^2 where x goes from 0..10,000. Then your calculations will go fast.
When you select a color from the palette, just make a histogram with that color = 10,0000. When select 2, make a histogram with color1=5000, color2=5000 etc.
In the end you will have to add in fudge factors to make the application match the real world, but you will find these with testing.
I'd suggest you do some kind of clustering of the colors present in the images in your database. I mean, for each image in your database:
collect the colors of each pixel in the image
perform clustering (let's say k-mean clustering with 5 clusters) on the collected colors
store the clustered colors as representative descriptor of the image
When the user provides a set of one or more query colors you do some kind of greedy matching choosing the best match between the given colors and the color descriptor (the 5 reprsentative colors) of each image in your database.
What is the size of your image collection, because depending on the size some search indexing can be a bigger problem than the alogorith itself?
Probably just creating a histogram of the colors used in the images, then doing a best fit to the user-selected colors.

Fitting tag cloud to available space

Given words and their frequencies and an area of screen real estate, what are good approaches to fitting a tag cloud to the space? The two variables I can think of to manipulate are:
Font sizes (both absolute and the gradient)
Number of words
Everything approach I can think of requires iteration, like setting an upper bound on the number of words then using binary search on font sizes until the words just fit the area. I'd rather have an analytical solution.
One complication of my situation is that the clouds are resizable, so the algorithm needs to be able to handle 100x100 pixels or 1000x1000 pixels reasonably well.
Edit: I should have said this is for a rich client application, not the web (hence the possibility of resizing). Also, I was hoping to hear some experience like "nobody ever looks at more than 100 words in tag cloud so don't bother displaying them".
What we do in Software Cartographer is
have a maximum font size,
map Math.sqrt(term.frequency) to this range (since words are 2D areas),
only show the top 30 (or so) terms,
exclude any fine print, ie font size smaller than 6 pt,
sort the terms in the cloud alphabetically.
Alternatives
Instead of showing the top 30, choose the top k such that there are no scroll bars.
Instead of mapping the most frequent word to the max font size, use a global mapping such that word size are comparable between clouds (this depends on your use case).
To my best knowledge, no empirical studies on term clouds are available (maybe Jonathan Feinberg, of Worlde fame, knows more in that regard).
This sounds like the knapsack problem, but inverted and with more variables. There is no trivial complete solution, but it is likely you will be able to find a heuristic algorithm that comes close to the optimal solution in most cases.
PS: You can only make this work reliably with font sizes measured in pixels. Font sizes measured in pixels are a Bad Thing (TM) in good web design.
You could create a predetermined set of incidence ranges, which could then relate to a font size in your cloud. For example:
0 - 100: 1 em
101 - 500: 1.2 em
501 - 1000: 1.4 em bold
1001 - 1500: 1.8 em bold
1501 - 2000: 2.0 em bold italic/underlined/flashing/whatever etc...
You could scale the cloud by adding a fixed offset to all the ranges based on the size of the container.

Resources