I want to align the feature map using ego motion, as mentioned in the paper An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
I use VoxelNet as backbone, which will shrink the image for 8 times. The size of my voxel is 0.1m x 0.1m x 0.2m(height)
So given an input bird-eye-view image size of 1408 x 1024,
the extracted feature map size would be 176 x 128, shrunk for 8 times.
The ego translation of the car between the "images"(point clouds actually) is 1 meter in both x and y direction. Am I right to adjust the feature map for 1.25 pixels?
1m/0.1m = 10 # meter to pixel
10/8 = 1.25 # shrink ratio of the network
However, though experiments, I found the feature maps align better if I adjust the feature map with only 1/32 pixel for the 1 meter translation in real world.
Ps. I am using the function torch.nn.functional.affine_grid to perform the translation, which takes a 2x3 affine matrix as input.
It's caused by the function torch.nn.functional.affine_grid I used.
I didn't fully understand this function before I use it.
These vivid images would be very helpful on showing what this function actually do(with comparison to the affine transformations in Numpy.
Related
I have read the original Viola Jones article, the Wikipedia article, the openCV manual and these SO answers:
How does the Viola-Jones face detection method work?
Defining an (initial) set of Haar Like Features
I am trying to implement my own version of the two detectors in the original article (the Adaboost-200-features version and the final cascade version), but something is missing in my understanding, specifically with how the 24 x 24 detector works on the entire image and (maybe) its sub-images on (maybe) different scales. To my understanding, during detection:
(1) The image integral is computed twice, for image variance normalization, once as is, once squared:
The variance of an image sub-window can be computed quickly using a
pair of integral images.
(2) The 24 x 24 square detector is moved across the normalized image in steps of 1 pixel, deciding for each square whether it is a face (or a different object) or not:
The final detector is scanned across the image at multiple scales and
locations.
(3) Then the image is scaled to be 1.25 smaller, and we go back to (1)
This is done 12 times until the smaller side of the image rectangle is 24 pixels long (288 in original image divided by (1.25 ^ (12 - 1)) is just over 24):
a 384 by 288 pixel image is scanned at 12 scales each a factor of 1.25
larger than the last.
But then I see this quote in the article:
Scaling is achieved by scaling the detector itself, rather
than scaling the image. This process makes sense because the features
can be evaluated at any scale with the same cost. Good detection
results were obtained using scales which are a factor of
1.25 apart.
And I see this quote in the Wikipedia article:
Instead of scaling the image itself (e.g. pyramid-filters), we scale the features.
And I'm thrown off understanding what exactly is going on. How can the 24 x 24 detector be scaled? The entire set of features calculated on this area (whether in the 200-features version or the ~6K-cascade version) is based on originally exploring the 162K possible features of this 24 x 24 rectangle. And why did I get that the pyramid-paradigm still holds for this algorithm?
What should change in the above algorithm? Is it the image which is scaled or the 24 x 24 detector, and if it is the detector how is it exactly done?
I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.
I try to build a structured light environment to do 3D scanning.
As far as I know, if I choose to use gray code to reconstruct a 3D model, I have to implement specific patterns that were encode in power 2(2^x, x = 0 ~ 10).
That is said, the patterns must be at least 1024 x 1024 in resolution.
What if my DLP projector only support resolution up to 800 x 480? It projects Moire pattern when the gray code pattern resolution becomes too high(I tried). What should I do?
My friends suggest that I create 1024 x 1024 patterns, and "crop" them into 800 x 480,
but I thought the gray code should follow specific sequence and patterns, my friends suggestion will create several image that is not symmetry.
Does anyone have the same experience like me?
----------2015.8.4 Update Question----------
I was thinking that if my projector can't perfectly projects high resolution patterns, can I just let it projects the patterns with low resolution, for instance, from 2^0 to 2^6?
Or the gray code strictly demands patterns from 2^0 to 2^10? Otherwise gray code is not available?
you can not directly scale down to your resolution
because it would distort the pattern make it useless
instead you can:
crop it to your resolution
but you need to handle that in the scanning part too because you would not have the full pattern available
use nearest usable power of 2 resolution
like 512x256 and create pattern for it. The rest of the space is unused (wasting pixels)
use bullet #2 + scale up to fit your resolution better
so create pattern 512x256 and linearly scale to fit to 800x480 as much as you can so:
800/512 = 1.5625
480/256 = 1.8750
use the smaller scale (512x256 * 1.5625 -> 800x400) so scale the pattern by 1.5625 and use that as a pattern image
this is scaled by nearest neighbor to avoid subpixel grayscale colors which are harder to detect. This will waste less pixels but it can lower the precision of 3D scan !!!
This is how I generate my pattern in C++ and VCL:
// [generate pattern xs*ys power of 2 resolution]
// clear buffer
bmp->Canvas->Brush->Color=clBlack;
bmp->Canvas->FillRect(TRect(0,0,xs,ys));
int x,y,a,da;
for (da=0;1<<da<xs;da++); // number of bits per x resolution
for (a=0,y=0;y<ys;y++,a=(y*da)/ys)
for (x=0;x<xs;x++)
if (int((x>>a)&1)==0) pyx[ys-1-y][x]=0x00FFFFFF;
bmp->SaveToFile("3D_scann_pattern0.bmp");
bmp is VCL bitmap
xs,ys is the resolution of a bitmap
p[ys][xs] is direct 32bit pixel access to bitmap
This is slight differently encoded then your pattern !!!
[Notes]
If you need precision use bullet #2
If you need to cover bigger area use bullet #3
You can also scale in y axis differently then in x axis as this is just 1D encoding
I have to remove gaussian noise from this image (before, I had to filter it and add the noise). Then, I have to use function "o" and my grade is based on how low result of this function will be. I am trying and trying different things, but I can't remove this noise so I can get a good grade :/ any help please?
img=imread('liftingbody.png');
img=double(img)/255;
maska1=[1 1 1; 1 5 1; 1 1 1]/13;
odfiltrowany=imfilter(img,maska1);
zaszumiony=imnoise(odfiltrowany,'gaussian');
nowy=wiener2(zaszumiony);
nowy4=medfilt2(nowy);
o=1/512.*sqrt(sum(sum(img-nowy4).^2));
subplot(311); imshow(img);
subplot(312); imshow(zaszumiony);
subplot(313); imshow(nowy);
Try convoluting a Gaussian filter with your noisy image to remove Gaussian noise like below:
nowx=conv2(zaszumiony,fspecial('gaussian',[3 3],1.5),'same')/(sum(sum(fspecial('gaussian',[3 3],1.5))));
It should reduce your o function somewhat.
Try playing around with the strength of the filter (i.e. the 1.5 value) and the size of the kernel (i.e. [3 3] value) to reduce the noise to a minimum.
Adding to #ALM865's answer, you can also use imfilter. In fact, this is the recommended function that you use for images as imfilter has optimizations in place specifically for images. conv2 is the more general function for any 2D signal.
I have also answered how to choose the standard deviation and ultimately the size of your a Gaussian filter / kernel here: By which measures should I set the size of my Gaussian filter in MATLAB?
In essence, once you choose which standard deviation you want, you find a floor(6*sigma) + 1 x floor(6*sigma) + 1 Gaussian kernel to use in your filtering operation. Assuming that sigma = 2, you would get a 13 x 13 kernel. As ALM865 has said, you can create a Gaussian kernel using fspecial. You specify the 'gaussian' flag, followed by the size of the kernel and the standard deviation after. As such:
sigma = 2;
width = 6*sigma + 1;
kernel = fspecial('gaussian', [width width], sigma);
out = imfilter(zaszumiony, kernel, 'replicate');
imfilter takes in the image you want to filter, the convolution kernel you want to use to filter the image, and an optional flag that specifies what happens along the image pixel borders when the kernel doesn't fit completely inside the image. 'replicate' means that it simply copies the pixels along the borders, thus replicating them. There are other options, such as padding with a value (usually zero), circular padding and symmetric padding.
Play around with the standard deviation until you get what you believe is a good result.
I have an image (logical values), like this
I need to get this image resampled from pixel to mm or cm; this is the code I use to get the resampling:
function [ Ires ] = imresample3( I, pixDim )
[r,c]=size(I);
x=1:1:c;
y=1:1:r;
[X,Y]=meshgrid(x,y);
rn=r*pixDim;
cn=c*pixDim;
xNew=1:pixDim:cn;
yNew=1:pixDim:rn;
[Xnew,Ynew]=meshgrid(xNew,yNew);
Id=double(I);
Ires=interp2(X,Y,Id,Xnew,Ynew);
end
What I get is a black image. I suspect that this code does something that is not what I have in mind: it seems to take only the upper-left part of the image.
What I want is, instead, to have the same image on a mm/cm scale: what I expect is that every white pixel should be mapped from the original position to the new position (in mm/cm); what happen is certainly not what I expect.
I'm not sure that interp2 is the right command to use.
I don't want to resize the image, I just want to go from pixel world to mm/cm world.
pixDim is of course the dimension of the image pixel, obtained dividing the height of the ear in cm by the height of the ear in mm (and it is on average 0.019 cm).
Any ideas?
EDIT: I was quite sure that the code had no sense, but someone told me to do that way...anyway, if I have two edged ears, I need first to scale both the the real dimension and then perform some operations on them. What I mean with "real dimension" is that if one has size 6.5x3.5cm and the other has size 6x3.2cm, I need to perform operations on this dimensions.
I don't get how can I move from the pixel dimension to cm dimension BEFORE doing operation.
I want to move from one world to the other because I want to get rid of the capturing distance (because I suppose that if a picture of the ear is taken near and the other is taken far, they should have different size in pixel dimension).
Am I correct? There is a way to do it? I thought I can plot the ear scaling the axis, but then I suppose I cannot subtract one from the other, right?
Matlab does not use units. To apply your factor of 0.019cm/pixel you have to scale by a factor of 0.019 to have a 1cm grid, but this would cause any artefact below a size of 1cm to be lost.
Best practice is to display the data using multiple axis, one for cm and one for pixels. It's explained here: http://www.mathworks.de/de/help/matlab/creating_plots/using-multiple-x-and-y-axes.html
Any function processing the data should be independent of the scale or use the scale factor as an input argument, everything else is a sign of some serious algorithmic issues.