What is this vehicle locating algorithm doing exactly? - algorithm

This is an algorithm i found on this paper for vehicle detection in satellite images, this algorithm is for a sliding window technique.
I have a basic idea, but I don't understand how the Vehicle locating windows are created, any help would be great, thanks.
HDNN = Hybrid Deep Neural Network

Here's my interpretation.
Divide the image into 32 x 32 pixel images.
Calculate the center of all the images. Collect all of these images.
Divide the image into 48 x 48 pixel images. Repeat step 2. Divide the image into 64 x 64 pixel images. Repeat step 2.
Filter all of the images and feed the images that pass the filter into the HDNN. The images are filtered by
Rotating every image 11 times by: 0, 4.5, 9, ··· , 45 degrees
Shrinking or enlarging the non-rotating images into multi-scalings: 0.8, 0.9, 1.0, 1.1, 1.2, 1.3.
For each generated image, we get four preprocessed images: Gray, Gradient, Gradient1 and Gradient2 (see the (a), (b), (c) of Fig. 1). These preprocessed images are normalized into 48 × 48 size and [0,255] gray range.
All of these rotated, shrunk or enlarged preprocessed images are stored in their database and sent to HDNN.

Related

Align feature map with ego motion (problem of zooming ratio )

I want to align the feature map using ego motion, as mentioned in the paper An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
I use VoxelNet as backbone, which will shrink the image for 8 times. The size of my voxel is 0.1m x 0.1m x 0.2m(height)
So given an input bird-eye-view image size of 1408 x 1024,
the extracted feature map size would be 176 x 128, shrunk for 8 times.
The ego translation of the car between the "images"(point clouds actually) is 1 meter in both x and y direction. Am I right to adjust the feature map for 1.25 pixels?
1m/0.1m = 10 # meter to pixel
10/8 = 1.25 # shrink ratio of the network
However, though experiments, I found the feature maps align better if I adjust the feature map with only 1/32 pixel for the 1 meter translation in real world.
Ps. I am using the function torch.nn.functional.affine_grid to perform the translation, which takes a 2x3 affine matrix as input.
It's caused by the function torch.nn.functional.affine_grid I used.
I didn't fully understand this function before I use it.
These vivid images would be very helpful on showing what this function actually do(with comparison to the affine transformations in Numpy.

How the Viola Jones (Haar Cascade) sliding/scaling detector works during training and during detection?

I have read the original Viola Jones article, the Wikipedia article, the openCV manual and these SO answers:
How does the Viola-Jones face detection method work?
Defining an (initial) set of Haar Like Features
I am trying to implement my own version of the two detectors in the original article (the Adaboost-200-features version and the final cascade version), but something is missing in my understanding, specifically with how the 24 x 24 detector works on the entire image and (maybe) its sub-images on (maybe) different scales. To my understanding, during detection:
(1) The image integral is computed twice, for image variance normalization, once as is, once squared:
The variance of an image sub-window can be computed quickly using a
pair of integral images.
(2) The 24 x 24 square detector is moved across the normalized image in steps of 1 pixel, deciding for each square whether it is a face (or a different object) or not:
The final detector is scanned across the image at multiple scales and
locations.
(3) Then the image is scaled to be 1.25 smaller, and we go back to (1)
This is done 12 times until the smaller side of the image rectangle is 24 pixels long (288 in original image divided by (1.25 ^ (12 - 1)) is just over 24):
a 384 by 288 pixel image is scanned at 12 scales each a factor of 1.25
larger than the last.
But then I see this quote in the article:
Scaling is achieved by scaling the detector itself, rather
than scaling the image. This process makes sense because the features
can be evaluated at any scale with the same cost. Good detection
results were obtained using scales which are a factor of
1.25 apart.
And I see this quote in the Wikipedia article:
Instead of scaling the image itself (e.g. pyramid-filters), we scale the features.
And I'm thrown off understanding what exactly is going on. How can the 24 x 24 detector be scaled? The entire set of features calculated on this area (whether in the 200-features version or the ~6K-cascade version) is based on originally exploring the 162K possible features of this 24 x 24 rectangle. And why did I get that the pyramid-paradigm still holds for this algorithm?
What should change in the above algorithm? Is it the image which is scaled or the 24 x 24 detector, and if it is the detector how is it exactly done?

How to determine if an image needs to be rotated

I am trying to find a way to determine whether an image needs to be rotated in order for the text to be horizontally aligned. And if it does need to be rotated then by how many degrees?
I am sending the images to tesseract and for tesseract to be effective, the text in the images needs to be horizontally aligned.
I'm looking for a way do this without depending on the "Orientation" metadata in the image.
I've thought of following ways to do this:
Rotate the image 90 degrees clockwise four times and send all four images to tesseract. This isn't ideal because of the need to process one image 4 times.
Use hough line transform to see if the lines are vertical or horizontal. If they are vertical then rotate the image. This way the image still might need to be rotated 180 degrees. So I'm unsure how effective this would be.
I'm wondering if there are other ways to accomplish this using OpenCV, imageMagik or any other image processing techniques.
If you have a 1000 images which say horizontal or vertical, you can resize these images to 224x224 and then fine-tune a Convolutional neural network, like AlexNet or VGG for this task. If you want to know how many right rotations to make for the image, you can set the labels as the number of clock-wise rotations, like 0,1,2,3.
http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
Aytempting ocr on all 4 orientations seems like a reasonable choice, and I doubt you will find a more reliable heuristic.
If speed is an issue, you could OCR a small part of the image first. Select a rectangular region, that has the proper amount of edge pixels and white/black ratio for text, then send that to tesseract in different orientations. With a small region, you could even try smaller steps than 90°, or combine it with another heuristic like Hough.
If you remember the most likely orientation based on previous images, and stop once an orientation is successfully processed by tesseract, you probably do not even have to try most orientations in most cases.
You can figure this out in a terminal with tesseract's psm option.
tesseract --psm 0 "infile" "outfile" will create outfile.osd which contains the info:
Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 27.93
Script: Latin
Script confidence: 6.55
man tesseract
...
--psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
...

Processing images bigger than 4096px in Core Image for iOS 8

I am trying to write Core Image filter with custom kernel for iOS 8. My code works correct for images less than 4096x4096. For images bigger than 4096 I obtain "tile" effect. In other words it looks like image split into several sub-images and filters runs independently for each sub-image.
Here is good sample kernel:
kernel vec4 effectFoo(__sampler source_image) {
return sample(source_image, samplerTransform(source_image, destCoord() + vec2(200.0, 200.0)));
};
I apply it to big image with size 6000x4500. I expect to get image translated by (200, 200) offset with extrapolation by border color at two sides. However, the result is "tiled", so each tile has been translated and extrapolated independently.
Using log in ROI callback I got evidence that Core Image split image into 4 tiles of different sizes. Biggest tile was 4096x4096.
Thus there are two problems for me:
I can't get coordinates neither in full input image nor in full output image.
Core image extrapolate tiles without respect to neighbour tiles.

not getting glcm matrix dimensions for 8 bit grayscale image

As theory states, a glcm matrix is said to have dimensions of 2^x by 2^x where x is the grayscale depth of the image. My problem is that I get a 8 by 8 matrix instead of a 2^8 By 2^8 matrix when I run it on a 8 bit grayscale image.
Could someone please help me out?
According to MATLAB documentation,
graycomatrix calculates the GLCM from a scaled version of the image.
By default, if I is a binary image, graycomatrix scales the image to
two gray-levels. If I is an intensity image, graycomatrix scales the
image to eight gray-levels. You can specify the number of gray-levels
graycomatrix uses to scale the image by using the 'NumLevels'
parameter, and the way that graycomatrix scales the values using the
'GrayLimits' parameter — see Parameters.
In short, you need to run the function as follows:
glcm = graycomatrix(I , 'NumLevels' , 2^8 );

Resources