Fast Converting RGBA to ARGB - performance

I am trying to convert a rgba buffer into argb, is there any way to improve the next algorithm, or any other faster way to perform such operation?
Taking into account that the alpha value is not important once in the argb buffer, and should always end up as 0xFF.
int y, x, pixel;
for (y = 0; y < height; y++)
{
for (x = 0; x < width; x++)
{
pixel = rgbaBuffer[y * width + x];
argbBuffer[(height - y - 1) * width + x] = (pixel & 0xff00ff00) | ((pixel << 16) & 0x00ff0000) | ((pixel >> 16) & 0xff);
}
}

I will focus only in the swap function:
typedef unsigned int Color32;
inline Color32 Color32Reverse(Color32 x)
{
return
// Source is in format: 0xAARRGGBB
((x & 0xFF000000) >> 24) | //______AA
((x & 0x00FF0000) >> 8) | //____RR__
((x & 0x0000FF00) << 8) | //__GG____
((x & 0x000000FF) << 24); //BB______
// Return value is in format: 0xBBGGRRAA
}

Assuming that the code is not buggy (just inefficient), I can guess that all you want to do is swap every second (even-numbered) byte (and of course invert the buffer), isn't it?
So you can achieve some optimizations by:
Avoiding the shift and masking operations
Optimizing the loop, eg economizing in the indices calculations
I would rewrite the code as follows:
int y, x;
for (y = 0; y < height; y++)
{
unsigned char *pRGBA= (unsigned char *)(rgbaBuffer+y*width);
unsigned char *pARGB= (unsigned char *)(argbBuffer+(height-y-1)*width);
for (x = 4*(width-1); x>=0; x-=4)
{
pARGB[x ] = pRGBA[x+2];
pARGB[x+1] = pRGBA[x+1];
pARGB[x+2] = pRGBA[x ];
pARGB[x+3] = 0xFF;
}
}
Please note that the more complex indices calculation is performed in the outer loop only. There are four acesses to both rgbaBuffer and argbBuffer for each pixel, but I think this is more than offset by avoiding the bitwise operations and the indixes calculations. An alternative would be (like in your code) fetch/store one pixel (int) at a time, and make the processing locally (this econimizes in memory accesses), but unless you have some efficient way to swap the two bytes and set the alpha locally (eg some inline assembly, so that you make sure that everything is performed at registers level), it won't really help.

Code you provided is very strange since it shuffles color components not rgba->argb, but rgba->rabg.
I've made a correct and optimized version of this routine.
int pixel;
int size = width * height;
for (unsigned int * rgba_ptr = rgbaBuffer, * argb_ptr = argbBuffer + size - 1; argb_ptr >= argbBuffer; rgba_ptr++, argb_ptr--)
{
// *argb_ptr = *rgba_ptr >> 8 | 0xff000000; // - this version doesn't change endianess
*argb_ptr = __builtin_bswap32(*rgba_ptr) >> 8 | 0xff000000; // This does
}
The first thing i've made is simplifying your shuffling expression. It is obvious that XRGB is just RGBA >> 8.
Also i've removed calculation of array index on each iteration and used pointers as loop variables.
This version is about 2 times faster than the original on my machine.
You can also use SSE for shuffling if this code is intended for x86 CPU.

I am very late to this one. But I had the exact same problem when generating video on the fly. By reusing the buffer, I could get away with only setting the R, G, B values for every frame and only setting the A once.
See below code:
byte[] _workingBuffer = null;
byte[] GetProcessedPixelData(SKBitmap bitmap)
{
ReadOnlySpan<byte> sourceSpan = bitmap.GetPixelSpan();
if (_workingBuffer == null || _workingBuffer.Length != bitmap.ByteCount)
{
// Alloc buffer
_workingBuffer = new byte[sourceSpan.Length];
// Set all the alpha
for (int i = 0; i < sourceSpan.Length; i += 4) _workingBuffer[i] = byte.MaxValue;
}
Stopwatch w = Stopwatch.StartNew();
for (int i = 0; i < sourceSpan.Length; i += 4)
{
// A
// Dont set alpha here. The alpha is already set in the buffer
//_workingBuffer[i] = byte.MaxValue;
//_workingBuffer[i] = sourceSpan[i + 3];
// R
_workingBuffer[i + 1] = sourceSpan[i];
// G
_workingBuffer[i + 2] = sourceSpan[i + 1];
// B
_workingBuffer[i + 3] = sourceSpan[i + 2];
}
Debug.Print("Copied " + sourceSpan.Length + " in " + w.Elapsed.TotalMilliseconds);
return _workingBuffer;
}
This got me to around 15 milliseconds on an iPhone for a (1920 * 1080 * 4) buffer which is ~8mb.
This was not nearly enough for me. My final solution was instead to do a offset memcopy (Buffer.BlockCopy in C#) since the alpha is not important.
byte[] _workingBuffer = null;
byte[] GetProcessedPixelData(SKBitmap bitmap)
{
ReadOnlySpan<byte> sourceSpan = bitmap.GetPixelSpan();
byte[] sourceArray = sourceSpan.ToArray();
if (_workingBuffer == null || _workingBuffer.Length != bitmap.ByteCount)
{
// Alloc buffer
_workingBuffer = new byte[sourceSpan.Length];
// Set first byte. This is the alpha component of the first pixel
_workingBuffer[0] = byte.MaxValue;
}
// Converts RGBA to ARGB in ~2 ms instead of ~15 ms
//
// Copies the whole buffer with a offset of 1
// R G B A R G B A R G B A
// Originally the source buffer has: R1, G1, B1, A1, R2, G2, B2, A2, R3, G3, B3, A3
// A R G B A R G B A R G B A
// After the copy it looks like: 0, R1, G1, B1, A1, R2, G2, B2, A2, R3, G3, B3, A3
// So essentially we get the wrong alpha for every pixel. But all alphas should be 255 anyways.
// The first byte is set in the alloc
Buffer.BlockCopy(sourceArray, 0, _workingBuffer, 1, sourceSpan.Length - 1);
// Below is an inefficient method of converting RGBA to ARGB. Takes ~15 ms on iPhone 12 Pro Max for a 8mb buffer (1920 * 1080 * 4 bytes)
/*
for (int i = 0; i < sourceSpan.Length; i += 4)
{
// A
// Dont set alpha here. The alpha is already set in the buffer
//_workingBuffer[i] = byte.MaxValue;
//_workingBuffer[i] = sourceSpan[i + 3];
byte sR = sourceSpan[i];
byte sG = sourceSpan[i + 1];
byte sB = sourceSpan[i + 2];
if (sR == 0 && sG == byte.MaxValue && sB == 0)
continue;
// R
_workingBuffer[i + 1] = sR;
// G
_workingBuffer[i + 2] = sG;
// B
_workingBuffer[i + 3] = sB;
}
*/
return _workingBuffer;
}
The code is commented on how this works. On my same iPhone it takes ~2 ms which is sufficient for my use case.

Use assembly, the following is for Intel.
This example swaps Red and Blue.
void* b = pixels;
UINT len = textureWidth*textureHeight;
__asm
{
mov ecx, len // Set loop counter to pixels memory block size
mov ebx, b // Set ebx to pixels pointer
label:
mov al,[ebx+0] // Load Red to al
mov ah,[ebx+2] // Load Blue to ah
mov [ebx+0],ah // Swap Red
mov [ebx+2],al // Swap Blue
add ebx,4 // Move by 4 bytes to next pixel
dec ecx // Decrease loop counter
jnz label // If not zero jump to label
}

(pixel << 24) | (pixel >> 8) rotates a 32-bit integer 8 bits to the right, which would convert a 32-bit RGBA value to ARGB. This works because:
pixel << 24 discards the RGB portion of RGBA off the left side, resulting in A000.
pixel >> 8 discards the A portion of RGBA off the right side, resulting in 0RGB.
A000 | 0RGB == ARGB.

Related

Is it posible to know the brightness of a picture in Flutter?

I am building an application which has a Camera inside.
After I take a photo, I want to analyze it to know the brightness of this picture, if it is bad I have to take again the photo.
This is my code right now, it's a javascript function that I found and writing in Dart:
Thanks to #Abion47
EDIT 1
for (int i = 0; i < pixels.length; i++) {
int pixel = pixels[i];
int b = (pixel & 0x00FF0000) >> 16;
int g = (pixel & 0x0000FF00) >> 8;
int r = (pixel & 0x000000FF);
avg = ((r + g + b) / 3).floor();
colorSum += avg;
}
brightness = (colorSum / (width * height)).floor();
}
brightness = (colorSum / (width * height)).round();
// I tried with this other code
//brightness = (colorSum / pixels.length).round();
return brightness;
But I've got less brightness on white than black, the numbers are a little bit weird.
Do you know a better way to know the brightness?
SOLUTION:
Under further investigation we found the solution, we had an error doing the image decoding, but we used a Image function to do it.
Here is our final code:
Image image = decodeImage(file.readAsBytesSync());
var data = image.getBytes();
var colorSum = 0;
for(var x = 0; x < data.length; x += 4) {
int r = data[x];
int g = data[x + 1];
int b = data[x + 2];
int avg = ((r + g + b) / 3).floor();
colorSum += avg;
}
var brightness = (colorSum / (image.width * image.height)).floor();
return brightness;
Hope it helps you.
There are several things wrong with your code.
First, you are getting a range error because you are attempting to access a pixel that doesn't exist. This is probably due to width and/or height being greater than the image's actual width or height. There are a lot of ways to try and get these values, but for this application it doesn't actually matter since the end result is to get an average value across all pixels in the image, and you don't need the width or height of the image for that.
Second, you are fetching the color values by serializing the color value into a hex string and then parsing the individual channel substrings. Your substring is going to result in incorrect values because:
foo.substring(a, b) takes the substring of foo from a to b, exclusive. That means that a and b are indices, not lengths, and the resulting string will not include the character at b. So assuming hex is "01234567", when you do hex.substring(0, 2), you get "01", and then you do hex.substring(3, 5) you get "34" while hex.substring(6, 8) gets you "67". You need to do hex.substring(0, 2) followed by hex.substring(2, 4) and hex.substring(4, 6) to get the first three channels.
That being said, you are fetching the wrong channels. The image package stores its pixel values in ABGR format, meaning the first two characters in the hex string are going to be the alpha channel which is unimportant when calculating image brightness. Instead, you want the second, third, and forth channels for the blue, green, and red values respectively.
And having said all that, this is an extremely inefficient way to do this anyway when the preferred way to retrieve channel data from an integer color value is with bitwise operations on the integer itself. (Never convert a number to a string or vice versa unless you absolutely have to.)
So in summary, what you want will likely be something akin to the following;
final pixels = image.data;
double colorSum = 0;
for (int i = 0; i < pixels.length; i++) {
int pixel = pixels[i];
int b = (pixel & 0x00FF0000) >> 16;
int g = (pixel & 0x0000FF00) >> 8;
int r = (pixel & 0x000000FF);
avg = (r + g + b) / 3;
colorSum += avg;
}
return colorSum / pixels.length;

OCR algorithm (GOCR) to 32F429IDISCOVERY board

I'm trying to implement an OCR algorithm (GOCR algorithm specifically) to 32F429IDISCOVERY board and I'm still getting nothing back...
I'm recording a image from OV7670 camera in RGB565 format to SDRAM of the board that is then converted to greyscale and passed to the algorithm itself.
From this and other forums I got the impression that GOCR is very good algorithm and it seemed to be working very well on PC but I just cant get it to work on the board.
Does anyone have some experience with implementing OCR or GOCR? I am not sure where the problem is because it beaves in a very wierd way. The code stops in different part of the algorithm almost every time...
Calling the OCR algorithm:
void ocr_algorithm(char *output_str) {
job_t job1, *job; /* fixme, dont want global variables for lib */
job=OCR_JOB=&job1;
int linecounter;
const char *line;
uint8_t r,g,b;
uint32_t n,i,buffer;
char *p_pic;
uint32_t *image = (uint32_t*) SDRAM_START_ADR;
setvbuf(stdout, (char *) NULL, _IONBF, 0); /* not buffered */
job_init(job); /* init cfg and db */
job_init_image(job); /* single image */
p_pic = malloc(IMG_ROWS*IMG_COLUMNS);
// Converting RGB565 to grayscale
i=0;
for (n = 0; n < IMG_ROWS*IMG_COLUMNS; n++) {
if (n % 2 == 0){
buffer = image[i] & 0xFFFF;
}
else{
buffer = (image[i] >> 16) & 0xFFFF;
i++;
}
r = (uint8_t) ((buffer >> 11) & 0x1F);
g = (uint8_t) ((buffer >> 5) & 0x3F);
b = (uint8_t) (buffer & 0x1F);
// RGB888
r = ((r * 527) + 23) >> 6;
g = ((g * 259) + 33) >> 6;
b = ((b * 527) + 23) >> 6;
// Greyscale
p_pic[n] = 0.299*r + 0.587*g + 0.114*b;
}
//read_picture;
job->src.p.p = p_pic;
job->src.p.x = IMG_ROWS;
job->src.p.y = IMG_COLUMNS;
job->src.p.bpp = 1;
/* call main loop */
pgm2asc(job);
//print output
strcpy(output_str, "");
linecounter = 0;
line = getTextLine(&(job->res.linelist), linecounter++);
while (line) {
strcat(output_str, line);
strcat(output_str, "\n");
line = getTextLine(&(job->res.linelist), linecounter++);
}
free_textlines(&(job->res.linelist));
job_free_image(job);
free(p_pic);
}

Is this part of a real IFFT process really optimal?

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
pseudo-version:
INPUT
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
INDEX = 0 TO PI/2
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
}
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
}
}

How do convolution matrices work?

How do those matrices work? Do I need to multiple every single pixel? How about the upperleft, upperright, bottomleft and bottomleft pixels where there's no surrounding pixel? And does the matrix work from left to right and from up to bottom or from up to bottom first and then left to right?
Why does this kernel (Edge enhance) : http://i.stack.imgur.com/d755G.png
turns into this image: http://i.stack.imgur.com/NRdkK.jpg
The Convolution filter is applied to every single pixel.
On the edges there are a few things you can do (all leave a type of border or shrink the image):
skip the edges and crop 1 pixel from the edge of the image
substitute 0 or 255 for any of the pixels that are out of bounds for the image
use a cubic spline (or other interpolation method) between 0 (or 255) and the value of the images edge pixel to come up with a substitute.
The order you apply the convolution does not matter (upper right to bottom left is most common) you should get the same results no matter the order.
However, a common mistake when applying a convolution matrix is to overwrite the current pixel you are examining with the new value. This will affect the value you come up with for the pixel next to the current one. A better method would be to create a buffer to hold the computed values, so that previous applications of the convolution filter do not affect current application of the matrix.
From your example images it is hard to tell why the filter applied creates the black and white version without seeing the original image.
Below is a step by step example of applying a convolution kernel to an image (1D for simplicity).
As for the edge enhancement kernel in your post, notice the +1 next to the -1. Think about what that will do. If the region is constant the two pixel under the +/-1 will add to zero (black). If the two pixels are different they will have a non-zero value. So what you are seeing is that pixels next to each other that are different get highlighted, while ones that are the same get set to black. The bigger the difference the brighter (more white) the pixel in the filtered image.
Yes, you multiply every pixel, with that matrix. The traditional method is to find the relevant pixels relative to the pixel being convoluted, multiple the factors, and average it out. So a 3x3 blur of:
1, 1, 1,
1, 1, 1,
1, 1, 1
This matrix, means you take the relevant values of the various components and multiply them. Then divide by the number of elements. So you would get that 3 by 3 box, add up all the red values then divide by 9. You'd get the 3 by 3 box, add up all the green values then divide by 9. You'd get the 3 by 3 box, add up all the blue values then divide by 9.
This means a couple things. First, you need a second giant chunk of memory to perform this operation. And you do every pixel you can.
However, that's only for the traditional method and the traditional method is actually needlessly convoluted (get it?). If you return the results in a corner. You never actually need any additional memory and always do the entire operation within the memory footprint you started with.
public static void convolve(int[] pixels, int offset, int stride, int x, int y, int width, int height, int[][] matrix, int parts) {
int index = offset + x + (y*stride);
for (int j = 0; j < height; j++, index += stride) {
for (int k = 0; k < width; k++) {
int pos = index + k;
pixels[pos] = convolve(pixels,stride,pos, matrix, parts);
}
}
}
private static int crimp(int color) {
return (color >= 0xFF) ? 0xFF : (color < 0) ? 0 : color;
}
private static int convolve(int[] pixels, int stride, int index, int[][] matrix, int parts) {
int redSum = 0;
int greenSum = 0;
int blueSum = 0;
int pixel, factor;
for (int j = 0, m = matrix.length; j < m; j++, index+=stride) {
for (int k = 0, n = matrix[j].length; k < n; k++) {
pixel = pixels[index + k];
factor = matrix[j][k];
redSum += factor * ((pixel >> 16) & 0xFF);
greenSum += factor * ((pixel >> 8) & 0xFF);
blueSum += factor * ((pixel) & 0xFF);
}
}
return 0xFF000000 | ((crimp(redSum / parts) << 16) | (crimp(greenSum / parts) << 8) | (crimp(blueSum / parts)));
}
With the kernel traditionally returning the value to the center most pixel. This allows the image to blur around the edges but more or less remain where it started. This seemed like a good idea but it's actually problematic. The correct way to do it is to have the results pixel in the upper-left corner. Then you can simply, and with no extra memory, just iterate the the entire image with a scanline, going one pixel at a time and returning the value, without causing errors. The bulk of the color weight is shifted up and left by one pixel. But, it's one pixel, and you can shift it back down and to the left if you iterate backwards with a result pixel in the bottom-right. Though this might be trouble for the cache hits.
However, a lot of modern architecture have GPUs now, so the entire image can be done simultaneously. Making it a kind of moot point. But, it is strange that one of the most important algorithm in graphics is weird in requiring this, as that makes the easiest way to do the operation impossible, and a memory hog.
So that people like Matt on this question say things like "However, a common mistake when applying a convolution matrix is to overwrite the current pixel you are examining with the new value." -- Really this is the correct way to do it, the error is writing the result pixel to the center rather than the upper left corner. Because unlike the upper-left corner, you will need the center pixel again. You won't ever need the upper-left corner again (assuming you are iterating left->right, top->bottom), and so it's safe to store your value there.
"This will affect the value you come up with for the pixel next to the current one." -- If you wrote it to the upper left corner as you processed it as a scan, you would overwrite data that you do not ever need again. Using a bunch of extra memory isn't a better solution.
As such, here's likely the fastest Java blur you'd ever see.
private static void applyBlur(int[] pixels, int stride) {
int v0, v1, v2, r, g, b;
int pos;
pos = 0;
try {
while (true) {
v0 = pixels[pos];
v1 = pixels[pos+1];
v2 = pixels[pos+2];
r = ((v0 >> 16) & 0xFF) + ((v1 >> 16) & 0xFF) + ((v2 >> 16) & 0xFF);
g = ((v0 >> 8 ) & 0xFF) + ((v1 >> 8) & 0xFF) + ((v2 >> 8) & 0xFF);
b = ((v0 ) & 0xFF) + ((v1 ) & 0xFF) + ((v2 ) & 0xFF);
r/=3;
g/=3;
b/=3;
pixels[pos++] = r << 16 | g << 8 | b;
}
}
catch (ArrayIndexOutOfBoundsException e) { }
pos = 0;
try {
while (true) {
v0 = pixels[pos];
v1 = pixels[pos+stride];
v2 = pixels[pos+stride+stride];
r = ((v0 >> 16) & 0xFF) + ((v1 >> 16) & 0xFF) + ((v2 >> 16) & 0xFF);
g = ((v0 >> 8 ) & 0xFF) + ((v1 >> 8) & 0xFF) + ((v2 >> 8) & 0xFF);
b = ((v0 ) & 0xFF) + ((v1 ) & 0xFF) + ((v2 ) & 0xFF);
r/=3;
g/=3;
b/=3;
pixels[pos++] = r << 16 | g << 8 | b;
}
}
catch (ArrayIndexOutOfBoundsException e) { }
}

Cuda Bayer/CFA demosaicing example

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.
Am I doing anything obviously stupid?
--------------- calling routine ------------------
uchar4 *d_output;
size_t num_bytes;
cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);
// Do the conversion, leave the result in the PBO fordisplay
kernel_wrapper( imageWidth, imageHeight, blockSize, gridSize, d_output );
cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
--------------- cuda -------------------------------
texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;
__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width) && (y < height)) {
if ( y & 0x01 ) {
if ( x & 0x01 ) {
d_output[i].x = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in B
d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // R
} else {
d_output[i].x = (tex2D(tex,x,y)); //B
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // R
}
} else {
if ( x & 0x01 ) {
// odd col = R
d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // B
d_output[i].z = (tex2D(tex,x,y)); //R
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
} else {
d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in R
d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // R
}
}
}
}
void initTexture(int imageWidth, int imageHeight, uchar *imagedata)
{
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) );
uint size = imageWidth * imageHeight * sizeof(uchar);
cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );
cutFree(imagedata);
// bind array to texture reference with point sampling
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;
cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );
}
There aren't any obvious bugs in your code, but there are several obvious performance opportunities:
1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.
2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.
3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)
For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.
Based on answer on Nvidia forums, here (for the search engines) is a slightly more optomised version which writes a 2x2 block of pixels in each thread. Although the difference in speed isn't measurable on my setup.
Note it should be called with a gridsize half the size of the image;
dim3 blockSize(16, 16); // for example
dim3 gridSize((width/2) / blockSize.x, (height/2) / blockSize.y);
__global__ void d_convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);
uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width-1) && (y < height-1)) {
// x+1, y+1:
d_output[i+width+1] = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2, // B
(tex2D(tex,x+1,y+1)), // G in B
(tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2, // R
0xff);
// x, y+1:
d_output[i+width] = make_uchar4( (tex2D(tex,x,y+1)), //B
(tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4, // G
(tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4, // R
0xff);
// x+1, y:
d_output[i+1] = make_uchar4( (tex2D(tex,x,y-1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x+2,y-1))/4, // B
(tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4, // G
(tex2D(tex,x+1,y)), //R
0xff);
// x, y:
d_output[i] = make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2, // B
(tex2D(tex,x,y)), // G in R
(tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2, // R
0xff);
}
}
There are many if's and else's in the code. If you structure the code to eliminate all the conditional statements then you will get a huge performance boost as branching is a performance killer. It is indeed possible to remove the branches. There are exactly 30 cases which you will have to code explicitly. I have implemented it on CPU and it does not contain any conditional statements. I am thinking of making a blog explaining it. Will post it once its done.

Resources