How do successive convolutional layers work?

How do successive convolutional layers work? - filter

If my first convolution have 64 filters and my second has 32 filters.
Will i have :
1 Image -> Conv(64 filters) -> 64 ImagesFiltred -> Conv(32 filters) -> 64 x 32 = 2048 Images filtred
Or :
1 Image -> Conv(64 filters) -> 64 ImagesFiltred -> Conv(32 filters) -> 32 Images filtred
If it is the second answer : what are goin on between the 64 ImagesFiltred and the second Conv ??
Thanks for your answer, in don't find a good tutorial that explain clearly, it always a rush ...

Your first point is correct. Convolutions are essentially ways of altering and extracting features from data. We do this by creating m images, each looking at a certain frame of the original image. On this first convolutional layer, we then take n images for each convoluted image in the first layer.
SO: k1 *k2 would be the total number of images.
To further this point,
a convolution works by making feature maps of an image. When you have successive convolutional layers, you are making feature maps of feature maps. I.e. if I start with 1 image, and my first convolutional layer is of size 20, then I have 20 images (more specifically feature maps) at the end of convolution 1. Then let's say I add a second convolution of size 10. What happens is then I am making 10 feature maps for every 1 image. Thus, it would be 20*10 images = 200 feature maps.
Let's say for example you have a 50x50 pixel image. Let's say you have a convolutional layer with a filter of size 5x5. What happens if you don't have padding or anything else) is that you "slide" across the image and get a weighted average of the pixels at each iteration of the slide (depending on your location). You would then get an output feature map of size 5x5. Let's say you do this 20 times then (i.e. a 5x5x20 convolution) You would then have as an output 20 feature maps of size 5x5. In the diagram mentioned in the VGG neural network post below, the diagram only shows the number of feature maps to made for the incoming feature maps NOT the end sum of feature maps.
I hope this explanation was thorough!

Here we have the architecture of the VGG-16
In VGG-16 we have 4 convolutions : 64, 128, 256 512
And in the architecture we saw that we don't have 64 images, 64*128 images etc
but just 64 images, 128 images etc
So the good answer was not the first but the second. And it imply my second questions :
"What are goin on between the 64 ImagesFiltred and the second Conv ??"
I think between a 64 conv et and 32 conv they are finaly only 1 filter but on two pixel couch so it divide the thickness of the conv by 2.
And between a 64 conv and a 128 conv they are only 2 filter on one pixel couch so ti multiply by 2 the thickness of the conv.
Am i right ?

Related

Align feature map with ego motion (problem of zooming ratio )

I want to align the feature map using ego motion, as mentioned in the paper An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
I use VoxelNet as backbone, which will shrink the image for 8 times. The size of my voxel is 0.1m x 0.1m x 0.2m(height)
So given an input bird-eye-view image size of 1408 x 1024,
the extracted feature map size would be 176 x 128, shrunk for 8 times.
The ego translation of the car between the "images"(point clouds actually) is 1 meter in both x and y direction. Am I right to adjust the feature map for 1.25 pixels?
1m/0.1m = 10 # meter to pixel
10/8 = 1.25 # shrink ratio of the network
However, though experiments, I found the feature maps align better if I adjust the feature map with only 1/32 pixel for the 1 meter translation in real world.
Ps. I am using the function torch.nn.functional.affine_grid to perform the translation, which takes a 2x3 affine matrix as input.

It's caused by the function torch.nn.functional.affine_grid I used.
I didn't fully understand this function before I use it.
These vivid images would be very helpful on showing what this function actually do(with comparison to the affine transformations in Numpy.

How to determine the number of bytes necessary to store an uncompressed grayscale image of size 8000 × 3400 pixels?

This is all of the information I was provided in the practice question. I am trying to figure out how to calculate it when prompted to do so on an exam...
How to determine the number of bytes necessary to store an uncompressed grayscale image of size 8000 × 3400 pixels?
I am also curious how the calculation changes if the image is a compressed binary image.

"I am trying to figure out how to calculate it when prompted to do so on an exam."
There are 8 bits to make 1 byte, so once you know how many bits-per-pixel (bpp) you have, this is a very simple calculation.
For 8 bits per pixel greyscale, just multiply the width by the height.
8000 * 3400 = 27200000 bytes.
For 1 bit per pixel black&white, multiply the width by the height and then divide by 8.
(8000 * 3400) / 8 = 3400000 bytes.
It's critical that the image is uncompressed, and that there's no padding at the end of each raster line. Otherwise the count will be off.

The first thing to work out is how many pixels you have. That is easy, it is just the width of the image multiplied by the height:
N = w * h
So, in your case:
N = 8000 * 3400 = 27200000 pixels
Next, in general you need to work out how many samples (S) you have at each of those 27200000 pixel locations in the image. That depends on the type of the image:
if the image is greyscale, you will have a single grey value at each location, so S=1
if the image is greyscale and has transparency as well, you will have a grey value plus a transparency (alpha) value at each location, so S=2
if the image is colour, you will have three samples for each pixel - one Red sample, one Green sample and one Blue sample, so S=3
if the image is colour and has transparency as well, you will get the 3 RGB values plus a transparency (alpha) value for each pixel, so S=4
there are others, but let's not get too complicated
The final piece of the jigsaw is how big each sample is, or how much storage it takes, i.e. the bytes per sample (B).
8-bit data takes 1 byte per sample, so B=1
16-bit data takes 2 bytes per sample, so B=2
32-bit floating point or integer data take 4 bytes per sample, so B=4
there are others, but let's not get too complicated
So, the actual answer for an uncompressed greyscale image is:
storage required = w * h * S * B
and in your specific case:
storage required = 8000 * 3400 * 1 * 1 = 27200000 bytes
If the image were compressed, the only thing you should hope and expect is that it takes less storage. The actual amount required will depend on:
how repetitive/predictable the image is - the more predictable the image is, in general, the better it will compress
how many colours the image contains - fewer colours generally means better compression
which image file format you require (PNG, JPEG, TIFF, GIF)
which compression algorithm you use (RLE, LZW, DCT)
how long you are prepared to wait for compression and decompression - the longer you can wait, the better you can compress in general
what losses/inaccuracies you are prepared to tolerate to save space - if you are prepared to accept a lower quality version of your image, you can get a smaller file

Size of Input and ConvNet

In CS231n course about Convolution Neural Network, in ConvNet note:
INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.
From the document, I understand that a INPUT will contain images with 32 (width) x 32 (height) x 3 depth. But later in result of Conv layer, it was [32x32x12] if we decided to use 12 filters.
Where is the 3 as in depth of the image?
Please help me out here, thank you in advance.

It gets "distributed" to each feature map (result after convolution with filter).
Before thinking about 12 filters, just think of one. That is, you are applying convolution with a filter of [filter_width * filter_height * input_channel_number]. And because your input_channel_number is the same as filter channel, you basically applying input_channel_number of 2d convolution independently on each input channel and then sum them together. And the result is a 2D feature map.
Now you can repeat this 12 times to get 12 feature maps and stack them together to get your [32 x 32 x 12] feature volume. And that's why your filter size is a 4D vector with [filter_width * filter_height * input_channel_number * output_channel_number], in your case this should be something like [3x3x3x12] (please note the ordering may vary between different framework, but operation is the same)

So, this is fun. I have read the document again and found the answer which is some 'scroll down' away. Before, I thought the filter, for example, is 32 x 32 (no depth). The truth is:
A typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels).
During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.

Direct3D9 - Convert a A2R10G10B10 image to a A8R8G8B8 image

Before starting:
A2B10G10R10 (2 bits for the alpha, 10 bits for each color channels)
A8B8G8R8 (8 bits for every channels)
Correct me if I'm wrong, but is it right that the A2B10G10R10 pixel format cannot be displayed directly on screens ?
If so, I would like to convert my A2B10G10R10 image to a displayable A8B8G8R8 one either using OpenCV, Direct3D9 or even manually but I'm really bad when it comes to bitwise operation that's why I need your help.
So here I am:
// Get the texture bits pointer
offscreenSurface->LockRect(&memDesc, NULL, 0);
// Copy the texture bits to a cv::Mat
cv::Mat m(desc.Height, desc.Width, CV_8UC4, memDesc.pBits, memDesc.Pitch);
// Convert from A2B10G10R10 to A8B8G8R8
???
Here how I think I should do for every 32 bits pack:
Copy the first original 2 bits into the first converted 8 bits
Scale every original 10 bits to every converted 8 bits (How to do that ?) for every other channels
Note:
The cv::cvtColor doesn't seem to propose the format conversion I need
I can't use IDirect3DDevice9::StretchRect method
Even Google seems to be lost on this subject
So to resume, the question is:
How to convert a A2B10G10R10 pixel format texture to a A8B8G8R8 one ?
Thanks. Best regards.

I'm not sure why you are using legacy Direct3D 9 instead of DirectX 11. In any case, the naming scheme between Direct3D 9 era D3DFMT and the modern DXGI_FORMAT is flipped, so it can be a bit confusing.
D3DFMT_A8B8G8R8 is the same as DXGI_FORMAT_R8G8B8A8_UNORM
D3DFMT_A2B10G10R10 is the same as DXGI_FORMAT_R10G10B10A2_UNORM
D3DFMT_A8R8G8B8 is the same as DXGI_FORMAT_B8G8R8A8_UNORM
There is no direct equivalent of D3DFMT_A2R10G10B10 in DXGI but you can swap the red/blue channels to get it.
There's also a long-standing bug in the deprecated D3DX9, D3DX10, and D3DX11 helper libraries where the DDS file format DDPIXELFORMAT have the red and blue masks backwards for both 10:10:10:2 formats. My DDS texture readers solve this by flipping the mapping of the masks to the formats on read, and always writing DDS files using the more modern DX10 header where I explicitly use DXGI_FORMAT_R10G10B10A2_UNORM
. See this post for more details.
The biggest problem with converting 10:10:10:2 to 8:8:8:8 is that you are losing 2 bits of data from the R, G, B color channels. You can do a naïve bit-shift, but the results are usually crap. To handle the color conversion where you are losing precision, you want to use something like error diffusion or ordered dithering.
Furthermore for the 2-bit alpha, you don't want 3 (11) to map to 192 (11000000) because in 2-bit alpha 3 "11" is fully opaque while 255 (11111111) is in 8-bit alpha.
Take a look at DirectXTex which is an open source library that does conversions for every DXGI_FORMAT and can handle legacy conversions of most D3DFMT. It implements all the stuff I just mentioned.
The library uses float4 intermediate values because it's built on DirectXMath and that provides a more general solution than having a bunch of special-case conversion combinations. For special-case high-performance use, you could write a direct 10-bit to 8-bit converter with all the dithering, but that's a pretty unusual situation.
With all that discussion of format image conversion out of the way, you can in fact render a 10:10:10:2 texture onto a 8:8:8:8 render target for display. You can use 10:10:10:2 as a render target backbuffer format as well, and it will get converted to 8:8:8:8 as part of the present. Hardware support for 10:10:10:2 is optional on Direct3D 9, but required for Direct3D Feature Level 10 or better cards when using DirectX 11. You can even get true 10-bit display scan-out when using the "exclusive" full screen rendering mode, and Windows 10 is implementing HDR display out natively later this year.

There's a general solution to this, and it's nice to be able to do it at the drop of a hat without needing to incorporate a bulky library, or introduce new rendering code (sometimes not even practical).
First, rip it apart. I can never keep track of which order the RGBA fields are in, so I just try it every way until one works, a strategy which reliably works every time.. eventually. But you may as well trust your analysis for your first attempt. The docs I found said D3D is listing them from MSB to LSB, so in this case we have %AA RRRRRRRRRR GGGGGGGGGG BBBBBBBBBB (but I have no idea if that's right)
b = (src>> 0) & 0x3FF;
g = (src>>10) & 0x3FF;
r = (src>>20) & 0x3FF;
a = (src>>30) & 0x003;
Next, you fix the precision. Naive bit-shift frequently works fine. If the results are 8 bits per channel, you're no worse off than you are with most images. A shift down from 10 bits to 3 would look bad without dithering but from 10 to 8 can look alright.
r >>= 2; g >>= 2; b >>= 2;
Now the alpha component does get tricky because it's shifting the other way. As #chuck-walbourn said you need to consider how you want the alpha values to map. Here's what you probably want:
%00 -> %00000000
%01 -> %01010101
%10 -> %10101010
%11 -> %11111111
Although a lookup table with size 4 probably makes the most sense here, there's a more general way of doing it. What you do is shove your small value to the top of the big value and then replicate it. Here it is with your scenario and one other more interesting scenario:
%Aa -> %Aa Aa Aa Aa
%xyz -> %xyz xyz xy
Let's examine what would happen for xyz with a lookup table:
%000 -> %000 000 00 (0)
%001 -> %001 001 00 (36) +36
%010 -> %010 010 01 (73) +37
%011 -> %011 011 01 (109) +36
%100 -> %100 100 10 (146) +37
%101 -> %101 101 10 (182) +36
%110 -> %110 110 11 (219) +37
%111 -> %111 111 11 (255) +36
As you can see, we get some good characteristics with this approach. Naturally having a %00000000 and %11111111 result is of paramount importance.
Next we pack the results:
dst = (a<<24)|(r<<16)|(g<<8)|(b<<0);
And then we decide if we need to optimize it or look at what the compiler does and scratch our heads.

Compressing/packing "don't care" bits into 3 states

At the moment I am working on an on screen display project with black, white and transparent pixels. (This is an open source project: http://code.google.com/p/super-osd; that shows the 256x192 pixel set/clear OSD in development but I'm migrating to a white/black/clear OSD.)
Since each pixel is black, white or transparent I can use a simple 2 bit/4 state encoding where I store the black/white selection and the transparent selection. So I would have a truth table like this (x = don't care):
B/W T
x 0 pixel is transparent
0 1 pixel is black
1 1 pixel is white
However as can be clearly seen this wastes one bit when the pixel is transparent. I'm designing for a memory constrained microcontroller, so whenever I can save memory it is good.
So I'm trying to think of a way to pack these 3 states into some larger unit (say, a byte.) I am open to using lookup tables to decode and encode the data, so a complex algorithm can be used, but it cannot depend on the states of the pixels before or after the current unit/byte (this rules out any proper data compression algorithm) and the size must be consistent; that is, a scene with all transparent pixels must be the same as a scene with random noise. I was imagining something on the level of densely packed decimal which packs 3 x 4-bit (0-9) BCD numbers in only 10 bits with something like 24 states remaining out of the 1024, which is great. So does anyone have any ideas?
Any suggestions? Thanks!

In a byte (256 possible values) you can store 5 of your three-bit values. One way to look at it: three to the fifth power is 243, slightly less than 256. The fact that it's slightly less also shows that you're not wasting much of a fraction of a bit (hardly any, either).
For encoding five of your 3-bit "digits" into a byte, think of taking a number in base 3 made from your five "digits" in succession -- the resulting value is guaranteed to be less than 243 and therefore directly storable in a byte. Similarly, for decoding, do the base-3 conversion of a byte's value.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio