I'm making a program to blur a 16 bit grayscale image in CUDA.
In my program, if I use a Gaussian blur function with sigma = 20 or 30, it takes a lot of time, while it is fast with sigma = 2.0 or 3.0.
I've read in some web site that Guaussian blur with FFT is good for large kernel size or large sigma value:
Is It really true ?
Which algorithm should I use: simple Gaussian blur or Gaussian blur with FFT ?
My code for Guassian Blur is below. In my code , is there something wrong or not ?
void gaussian_blur(
unsigned short* const blurredChannel, // return value: blurred channel (either red, green, or blue)
const unsigned short* const inputChannel, // red, green, or blue channel from the original image
int rows,
int cols,
const float* const filterWeight, // gaussian filter weights. The weights look like a bell shape.
int filterWidth // number of pixels in x and y directions for calculating average blurring
int r = blockIdx.y * blockDim.y + threadIdx.y; // current row
int c = blockIdx.x * blockDim.x + threadIdx.x; // current column
if ((r >= rows) || (c >= cols))
int half = filterWidth / 2;
float blur = 0.f; // will contained blurred value
int width = cols - 1;
int height = rows - 1;
for (int i = -half; i <= half; ++i) // rows
for (int j = -half; j <= half; ++j) // columns
// Clamp filter to the image border
int h = min(max(r + i, 0), height);
int w = min(max(c + j, 0), width);
// Blur is a product of current pixel value and weight of that pixel.
// Remember that sum of all weights equals to 1, so we are averaging sum of all pixels by their weight.
int idx = w + cols * h; // current pixel index
float pixel = static_cast<float>(inputChannel[idx]);
idx = (i + half) * filterWidth + j + half;
float weight = filterWeight[idx];
blur += pixel * weight;
blurredChannel[c + r * cols] = static_cast<unsigned short>(blur);
void createFilter(float *gKernel,double sigma,int radius)
double r, s = 2.0 * sigma * sigma;
// sum is for normalization
double sum = 0.0;
// generate 9*9 kernel
int m=0;
for (int x = -radius; x <= radius; x++)
for(int y = -radius; y <= radius; y++)
r = std::sqrtf(x*x + y*y);
gKernel[m] = (exp(-(r*r)/s))/(3.14 * s);
sum += gKernel[m];
// normalize the Kernel
for(int i = 0; i < (radius*2 +1); ++i)
for(int j = 0; j < (radius*2 +1); ++j)
gKernel[m++] /= sum;
int main()
cudaError_t cudaStatus;
const int size =81;
float gKernel[size];
float *dev_p=0;
cudaStatus = cudaMalloc((void**)&dev_p, size * sizeof(float));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
cudaStatus = cudaMemcpy(dev_p, gKernel, size* sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
/* i read image Buffere in unsigned short that code is not added here ,becouse it is large , and copy image data of buffere from host to device*/
/* So, suppose i have unsigned short *d_img which contain image data */
cudaMalloc( (void**)&d_img, length* sizeof(unsigned short));
cudaMalloc( (void**)&d_blur_img, length* sizeof(unsigned short));
static const int BLOCK_WIDTH = 32;
int image_width=1580.0,image_height=1050.0;
int x = static_cast<int>(ceilf(static_cast<float>(image_width) / BLOCK_WIDTH));
int y = static_cast<int>(ceilf(static_cast<float>((image_height) ) / BLOCK_WIDTH));
const dim3 grid (x, y, 1); // number of blocks
const dim3 block(BLOCK_WIDTH, BLOCK_WIDTH, 1);
/* after bluring image i will copied buffer from Device to Host and free gpu memory */
return 0;

Short answer: both algorithms are good with respect to image blurring, so feel free to pick the best (fastest) one for your use case.
Kernel size and sigma value are directly correlated: the greater the sigma, the larger the kernel (and thus the more operations-per-pixel to get the final result).
If you implemented a naive convolution, then you should try a separable convolution implementation instead; it will reduce the computation time by an order of magnitude already.
Now some more insight: they implement almost the same Gaussian blurring operation. Why almost ? It's because taking the FFT of an image does implicitly periodize it. Hence, at the border of the image, the convolution kernel sees an image that has been wrapped around its edge. This is called circular convolution (because of the wrapping). On the other hand, Gaussian blur implements a simple linear convolution.


Maximum float value in 10-bit image in WIC

I'm trying to convert a HDR image float array I load to a 10-bit DWORD with WIC.
The type of the loading file is GUID_WICPixelFormat128bppPRGBAFloat and I got an array of 4 floats per color.
When I try to convert these to 10 bit as follows:
struct RGBX
unsigned int b : 10;
unsigned int g : 10;
unsigned int r : 10;
int a : 2;
} rgbx;
(which is the format requested by the NVIDIA encoding library for 10-bit rgb),
then I assume I have to divide each of the floats by 1024.0f in order to get them inside the 10 bits of a DWORD.
However, I notice that some of the floats are > 1, which means that their range is not [0,1] as it happens when the image is 8 bit.
What would their range be? How to store a floating point color into a 10-bits integer?
I'm trying to use the NVidia's HDR encoder which requires an ARGB10 like the above structure.
How is the 10 bit information of a color stored as a floating point number?
Btw I tried to convert with WIC but conversion from GUID_WICPixelFormat128bppPRGBAFloat to GUID_WICPixelFormat32bppR10G10B10A2 fails.
HRESULT ConvertFloatTo10(const float* f, int wi, int he, std::vector<DWORD>& out)
CComPtr<IWICBitmap> b;
wbfact->CreateBitmapFromMemory(wi, he, GUID_WICPixelFormat128bppPRGBAFloat, wi * 16, wi * he * 16, (BYTE*)f, &b);
CComPtr<IWICFormatConverter> wf;
wf->Initialize(b, GUID_WICPixelFormat32bppR10G10B10A2, WICBitmapDitherTypeNone, 0, 0, WICBitmapPaletteTypeCustom);
// This last call fails with 0x88982f50 : The component cannot be found.
Edit: I found a paper (https://hal.archives-ouvertes.fr/hal-01704278/document), is this relevant to this question?
Floating-point color content that is greater than the 0..1 range is High Dynamic Range (HDR) content. If you trivially convert it to 10:10:10:2 UNORM then you are using 'clipping' for values over 1. This doesn't give good results.
SDR 10:10:10 or 8:8:8
You should instead use tone-mapping which converts the HDR signal to a SDR (Standard Dynamic Range a.k.a. 0..1) before or as part of doing the conversion to 10:10:10:2.
There a many different approaches to tone-mapping, but a common 'generic' solution is the Reinhard tone-mapping operator. Here's an implementation using DirectXTex.
std::unique_ptr<ScratchImage> timage(new (std::nothrow) ScratchImage);
if (!timage)
wprintf(L"\nERROR: Memory allocation failed\n");
return 1;
// Compute max luminosity across all images
XMVECTOR maxLum = XMVectorZero();
hr = EvaluateImage(image->GetImages(), image->GetImageCount(), image->GetMetadata(),
[&](const XMVECTOR* pixels, size_t w, size_t y)
for (size_t j = 0; j < w; ++j)
static const XMVECTORF32 s_luminance = { { { 0.3f, 0.59f, 0.11f, 0.f } } };
XMVECTOR v = *pixels++;
v = XMVector3Dot(v, s_luminance);
maxLum = XMVectorMax(v, maxLum);
if (FAILED(hr))
wprintf(L" FAILED [tonemap maxlum] (%08X%ls)\n", static_cast<unsigned int>(hr), GetErrorDesc(hr));
return 1;
maxLum = XMVectorMultiply(maxLum, maxLum);
hr = TransformImage(image->GetImages(), image->GetImageCount(), image->GetMetadata(),
[&](XMVECTOR* outPixels, const XMVECTOR* inPixels, size_t w, size_t y)
for (size_t j = 0; j < w; ++j)
XMVECTOR value = inPixels[j];
const XMVECTOR scale = XMVectorDivide(
XMVectorAdd(g_XMOne, XMVectorDivide(value, maxLum)),
XMVectorAdd(g_XMOne, value));
const XMVECTOR nvalue = XMVectorMultiply(value, scale);
value = XMVectorSelect(value, nvalue, g_XMSelect1110);
outPixels[j] = value;
}, *timage);
if (FAILED(hr))
wprintf(L" FAILED [tonemap apply] (%08X%ls)\n", static_cast<unsigned int>(hr), GetErrorDesc(hr));
return 1;
UPDATE: If you are trying to convert HDR floating-point content to an "HDR10" signal, then you need to do:
Color-space rotate from Rec.709 or P3D65 to Rec.2020.
Normalize for 'paper white' / 10,000 nits.
Apply the ST.2084 gamma curve.
Quantize to 10-bit.
// HDTV to UHDTV (Rec.709 color primaries into Rec.2020)
const XMMATRIX c_from709to2020 =
0.6274040f, 0.0690970f, 0.0163916f, 0.f,
0.3292820f, 0.9195400f, 0.0880132f, 0.f,
0.0433136f, 0.0113612f, 0.8955950f, 0.f,
0.f, 0.f, 0.f, 1.f
// DCI-P3-D65 https://en.wikipedia.org/wiki/DCI-P3 to UHDTV (DCI-P3-D65 color primaries into Rec.2020)
const XMMATRIX c_fromP3D65to2020 =
0.753845f, 0.0457456f, -0.00121055f, 0.f,
0.198593f, 0.941777f, 0.0176041f, 0.f,
0.047562f, 0.0124772f, 0.983607f, 0.f,
0.f, 0.f, 0.f, 1.f
// Custom Rec.709 into Rec.2020
const XMMATRIX c_fromExpanded709to2020 =
0.6274040f, 0.0457456f, -0.00121055f, 0.f,
0.3292820f, 0.941777f, 0.0176041f, 0.f,
0.0433136f, 0.0124772f, 0.983607f, 0.f,
0.f, 0.f, 0.f, 1.f
inline float LinearToST2084(float normalizedLinearValue)
const float ST2084 = pow((0.8359375f + 18.8515625f * pow(abs(normalizedLinearValue), 0.1593017578f)) / (1.0f + 18.6875f * pow(abs(normalizedLinearValue), 0.1593017578f)), 78.84375f);
return ST2084; // Don't clamp between [0..1], so we can still perform operations on scene values higher than 10,000 nits
// You can adjust this up to 10000.f
float paperWhiteNits = 200.f;
hr = TransformImage(image->GetImages(), image->GetImageCount(), image->GetMetadata(),
[&](XMVECTOR* outPixels, const XMVECTOR* inPixels, size_t w, size_t y)
const XMVECTOR paperWhite = XMVectorReplicate(paperWhiteNits);
for (size_t j = 0; j < w; ++j)
XMVECTOR value = inPixels[j];
XMVECTOR nvalue = XMVector3Transform(value, c_from709to2020);
// Some people prefer the look of using c_fromP3D65to2020
// or c_fromExpanded709to2020 instead.
// Convert to ST.2084
nvalue = XMVectorDivide(XMVectorMultiply(nvalue, paperWhite), c_MaxNitsFor2084);
XMStoreFloat4A(&tmp, nvalue);
tmp.x = LinearToST2084(tmp.x);
tmp.y = LinearToST2084(tmp.y);
tmp.z = LinearToST2084(tmp.z);
nvalue = XMLoadFloat4A(&tmp);
value = XMVectorSelect(value, nvalue, g_XMSelect1110);
outPixels[j] = value;
}, *timage);
You should really take a look at texconv.
Reinhard et al., "Photographic tone reproduction for digital images", ACM Transactions on Graphics, Volume 21, Issue 3 (July 2002). ACM DL.
#ChuckWalbourn answer is helpful, however I don't want to tonemap to [0,1] as there is no point in tonemapping to SDR then going to 10-bit HDR.
What I 'd think it's correct is to scale to [0,4] instead by first using g_XMFour.
const XMVECTOR scale = XMVectorDivide(
XMVectorAdd(g_XMFour, XMVectorDivide(v, maxLum)),
XMVectorAdd(g_XMFour, v));
then using a specialized 10-bit store which scales by 255 instead of 1023:
void XMStoreUDecN4a(DirectX::PackedVector::XMUDECN4* pDestination,DirectX::FXMVECTOR V)
using namespace DirectX;
static const XMVECTOR Scale = { 255.0f, 255.0f, 255.0f, 3.0f };
N = XMVectorClamp(V, XMVectorZero(), g_XMFour);
N = XMVectorMultiply(N, Scale);
pDestination->v = ((uint32_t)DirectX::XMVectorGetW(N) << 30) |
(((uint32_t)DirectX::XMVectorGetZ(N) & 0x3FF) << 20) |
(((uint32_t)DirectX::XMVectorGetY(N) & 0x3FF) << 10) |
(((uint32_t)DirectX::XMVectorGetX(N) & 0x3FF));
And then a specialized 10-bit load which divides with 255 instead of 1023:
DirectX::XMVECTOR XMLoadUDecN4a(DirectX::PackedVector::XMUDECN4* pSource)
using namespace DirectX;
fourx vectorOut;
uint32_t Element;
Element = pSource->v & 0x3FF;
vectorOut.r = (float)Element / 255.f;
Element = (pSource->v >> 10) & 0x3FF;
vectorOut.g = (float)Element / 255.f;
Element = (pSource->v >> 20) & 0x3FF;
vectorOut.b = (float)Element / 255.f;
vectorOut.a = (float)(pSource->v >> 30) / 3.f;
const DirectX::XMVECTORF32 j = { vectorOut.r,vectorOut.g,vectorOut.b,vectorOut.a };
return j;

How to spread the audio spectrum into a grid

I'm trying to use processing to take an audio input and create a audio spectrum that is broken into multiple rows and fits uniformly to the width of the sketch.
I want the ellipse to be spread out in a grid like fashion and also represent different parts of the spectrum.
import ddf.minim.analysis.*;
import ddf.minim.*;
Minim minim;
FFT fft;
AudioInput mic;
void setup()
size(512, 512, P3D);
minim = new Minim(this);
mic = minim.getLineIn();
fft = new FFT(mic.bufferSize(), mic.sampleRate());
void draw()
for(int i = 0; i < fft.specSize(); i++)
float size = fft.getBand(i);
float x = map(i, 0, fft.specSize(), 0, height);
float y = i;
ellipse(x, y, size, size );
The fft data is a 1D signal and you want to visualise the data as a 2D grid.
If you know how many rows and columns you want your grid to have you can use arithmetic to calculate the x and y grid location base on the index.
Let's say you have 100 elements and you want to display them in a 10x10 grid:
use the 1D array counter and modulo (%) the number of columns to calculate the 2D x index and divide (/) by the number of columns to calculate the 2D y index:
for(int i = 0 ; i < 100; i++){
println(i,i % 10, i / 10);
here's a longer commented example:
// fft data placeholder
float[] values = new float[100];
// fill with 100 random values
for(int i = 0 ; i < values.length; i++){
values[i] = random(0.0,1.0);
// how many rows/cols
int rows = 10;
int cols = 10;
// how large will a grid element be (including spacing)
float widthPerSquare = (width / cols);
// grid elements offset from top left
float offsetX = widthPerSquare * 0.5;
float offsetY = widthPerSquare * 0.5;
// traverse data
for(int i = 0; i < 100; i++){
// calculate x,y indices
int gridX = i % rows;
int gridY = i / rows;
// calculate on screen x,y position based on grid element size
float x = offsetX + (gridX * widthPerSquare);
float y = offsetY + (gridY * widthPerSquare);
// set the size to only be 75% of the grid element (to leave some spacing)
float size = values[i] * widthPerSquare * 0.75;
//fill(values[i] * 255);
In your case, let's say fft.specSize() is around 512 and you want to draw a square grid, you could do something like this:
import ddf.minim.analysis.*;
import ddf.minim.*;
Minim minim;
FFT fft;
AudioInput mic;
int rows;
int cols;
float xSpacing;
float ySpacing;
void setup()
size(512, 512, P3D);
minim = new Minim(this);
mic = minim.getLineIn();
fft = new FFT(mic.bufferSize(), mic.sampleRate());
// define your own grid size or use an estimation based on square root of your FFT data
rows = cols = (int)sqrt(fft.specSize());
println(rows,rows * rows);
xSpacing = width / cols;
ySpacing = height / rows;
void draw()
for(int i = 0; i < fft.specSize(); i++)
float size = fft.getBand(i) * 90;
float x = (i % rows) * xSpacing;
float y = (i / rows) * ySpacing;
ellipse(x, y, size, size );
Notice that the example isn't applying the offset and the grid is 22 x 22 (484 != 512),
but hopefully it will give you some ideas.
The other thing to bare in mind is the contents of that FFT array.
You might want to scale that logarithmically to account for how we perceive sound.
Check out Processing > Examples > Contributed Libraries > Minim > Analysis > SoundSpectrum and have a look at logAverages(). Playing minBandwidth and bandsPerOctave might help you get a nicer visualisation.
If you want to go a bit deeper into visualisation checkout this wakjah' excellent answer here and if you have time, go through Dan Ellis' amazing Music Signal Computing course

Finding center in random data based on adjacent points

Say I am giving a 2D array which contains black and white pixels.
I want to find the "center" or the datapoints based on the adjacent pixels.
That means the most dense parts have the highest impact, and small loose/scatterd/thinly only have a small impact.
Here is a sample images for my use case:
What is the best algorithm in this scenario to find the center?
The following function calculates the weighted center of a given image.
The image is represented as an array of boolean. Black is represented as 'true' and white as 'false'.
double[] weightedCenter(boolean[][] img){
int W = img.length;
int H = img[0].length;
double centerX = 0;
double centerY = 0;
for(int i=0;i<W;i++){
for(int j=0;j<H;j++){
centerX += nbs(img, i, j) * i;
centerY += nbs(img, i, j) * j;
centerX /= (W * H);
centerY /= (W * H);
return new double[]{centerX, centerY};
The weight for each black pixel is calculated(as requested) based on the number of immediate black neighbours.
double nbs(boolean[][] img, int x, int y){
int W = img.length;
int H = img[0].length;
int[] offset = {-1, 0, 1};
double nb0 = 0;
double nb1 = 0;
for(int xOff : offset){
for(int yOff : offset){
int x2 = x + xOff;
int y2 = y + yOff;
if(x2 < 0 || x2 >= W || y2 < 0 || y2 >= H)
return nb1 / nb0;

ray tracer objects stretch when off center

I am writing a ray tracer program for my computer graphics class. So far I only have spheres implemented and a shadow ray. The current problem is that when i move my sphere off center it stretches. here is the code that i use to calculate if a ray is intersecting a sphere:
bool Sphere::onSphere(Ray r)
float b = (r.dir*2).innerProduct(r.pos + centre*-1);
float c = (r.pos + centre*-1).innerProduct(r.pos + centre*-1) - radius*radius;
return b*b - 4*c >= 0;
here is the code that i use to spawn each ray:
for(int i = -cam.width/2; i &lt cam.width/2; i++)
for(int j = -cam.height/2; j &lt cam.height/2; j++)
float normi = (float)i;
float normj = (float)j;
Vector pixlePos = cam.right*normi + cam.up*normj + cam.forward*cam.dist + cam.pos*1;
Vector direction = pixlePos + cam.pos*-1;
Vector colour = recursiveRayTrace(Ray(pixlePos, direction), 30, 1, 0);
float red = colour.getX()/255;
float green = colour.getY()/255;
float blue = colour.getZ()/255;
fwrite (&red, sizeof(float), 1, myFile);
fwrite (&green, sizeof(float), 1, myFile);
fwrite (&blue, sizeof(float), 1, myFile);
Vector Scene::recursiveRayTrace(Ray r, float maxDist, int maxBounces, int bounces)
if(maxBounces &lt bounces)
return Vector(0,0,0);
int count = 0;
for(int i = 0; i &lt spheres.size(); i++)
Vector colour(ambiant.colour);
for(int j = 0; j &lt lights.size(); j++)
Vector intersection(r.pos + r.dir*spheres.at(i).getT(r));
Ray nRay(intersection, lights.at(i).centre + intersection*-1);
colour = colour + lights.at(i).colour;
return colour;
return Vector(0,0,0);
What i get is an sphere that is stretched in the direction of the vector from the center to the center of the circle. I'm not looking for anyone to do my homework. I am just having a really hard time debugging this on. Any hints are appreciated :) Thanks!
Edit: cam.dist is the distance from the camera to the view plane
The stretching is actually a natural consequence of perspective viewing and it is exaggerated if you have a very wide field of view. In other words moving the camera back from your image plane should make it seem more natural.

How to joint some objects in digital image?

I'm looking for some algorithm to joint objects, for example, combine an apple into a tree in digital image and some demo in Matlab. Please show me some materials of that. Thanks for reading and helping me!!!
I not sure if I undertand your question, but if you are looking to do some image overlaping, as does photoshop layers, you can use some image characteristics to, through that characteristc, determine the degree of transparency.
For example, consider using two RGB images. Image A will be overlapped by image B. To do it, we'll use image B brightness to determine transparency degree (255 = 100%).
Intensity = pixel / 255;
NewPixel = (PixelA * (1 - Intensity)) + (PixelB * Intensity);
As intensity is a percentage and each pixel is multiplied by the complement of this percentage, the resulting sum will never overflow over 255 (max graylevel)
int WidthA = imageA.Width * channels;
int WidthB = imageB.Width * channels;
int width = Min(ImageA.Width, ImageB.Width) * channels;
int height = Min(ImageA.Height, ImageB.Height);
byte *ptrA = imageA.Buffer;
byte *ptrB = imageB.Buffer;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x += channels, ptrA += channels, ptrB += channels)
//Take the intensity of the pixel. If RGB (channels = 3), intensity = (R+B+G) / 3. If grayscale, the pixel value is intensity itself
int avg = 0;
for (int j = 0; j < channels; ++j)
avg += ptrB[j];
//Obtain the intensity as a value between 0..100%
double intensity = (double)(avg / channels) / 255;
for (int j = 0; j < channels; ++j)
//Write in image A the resulting pixel which is obtained by multiplying Image B pixel
//by 100% - intensity plus Image A pixel multiplied by the intensity
ptrA[j] = (byte) ((ptrB[j] * (1.0 - intensity)) + ((intensity) * ptrA[j]));
ptrA = imageA.Buffer + (y * WidthA));
ptrB = imageB.Buffer + (y * WidthB));
You can also change this algorithm in order to overlap Image A over B, in a different place. I'm assuming here the image B coordinate (0, 0) will overlap image A coordinate (0, 0).
But once again, I'm not sure if this is what you are looking for.
