FFTW multi-dimensional FFT - fftw

I am trying to utilize the multidimensional FFT fftwf_plan_dft_r2c_2d from a single array of data. I have a M data points, N number of times:
float *input = (float*)malloc( M * N * sizeof( float ) );
// load M*N data points data
fftwf_complex *outputFFT = (fftwf_complex*)fftwf_malloc( N * ((M/2) + 1) * sizeof( fftwf_complex ) );
fftwf_plan forwardFFTPlan = fftwf_plan_dft_r2c_2d( N, M, input, outputFFT, FFTW_ESTIMATE );
fftwf_execute( forwardFFTPlan );
fftwf_destroy_plan( forwardFFTPlan );
// Plot M data points, ignore the rest. Plotting magnitude of the data sqrtf( ([REAL] * [REAL]) + ([IMAG] * [IMAG]) )
The FFT results are not correct (validated through working MATLAB script), but if I just take a 1D FFT of M data points:
for( int i = 0; i < N; i++ )
{
float *input = (float*)malloc( M * sizeof( float ) );
// load M data points
fftwf_complex *outputFFT = (fftwf_complex*)fftwf_malloc( ((M/2) + 1) * sizeof( fftwf_complex ) );
fftwf_plan forwardFFTPlan = fftwf_plan_dft_r2c_1d( M, input, outputFFT, FFTW_ESTIMATE );
fftwf_execute( forwardFFTPlan );
fftwf_destroy_plan( forwardFFTPlan );
}
// Plot M data points, ignore the rest. Plotting magnitude of the data sqrtf( ([REAL] * [REAL]) + ([IMAG] * [IMAG]) )
the result is correct. The data being loaded into the array is the same. What am I not understanding about multi-dimensional FFTs? I read through the help pages and I "thought" I was doing it right, but obviously I am missing something...

After asking around, I found my issue: fftwf_plan_dft_r2c_2d performs a 2D FFT which I was reading differently. I needed to use fftwf_plan_many_dft_r2c instead, which allowed me to specify the strides and sizes to perform a 1D FFT multiple times.

Related

How to get a random number in Metal shader?

How would I go about getting a random number in a Metal shader?
I searched for "random" in The Metal Shading Language Specification, but found nothing.
It looks like there's not one built in. This example code for MetalShaderShowcase/AAPLWoodShader.metal defines its own simple rand function.
// Generate a random float in the range [0.0f, 1.0f] using x, y, and z (based on the xor128 algorithm)
float rand(int x, int y, int z)
{
int seed = x + y * 57 + z * 241;
seed= (seed<< 13) ^ seed;
return (( 1.0 - ( (seed * (seed * seed * 15731 + 789221) + 1376312589) & 2147483647) / 1073741824.0f) + 1.0f) / 2.0f;
}
So I was working on a Random Number Generator for another project and was wanting to package it into a neat framework for a while.
Your question pushed me to do just that. If you don't mind the shameless plug, here is a very simple framework that will generate a random number for you in a metal shader based on (up to) three seeds that you give it. The code is based on the following research paper that describes how to create random numbers on parallel processors for Monte Carlo simulations. It also has a (theoretical) period of 2^121 so it should be good for most reasonable calculations that can be done on a GPU.
All you have to call in your shader is an intializer, then you call rand(), like so:
// Initialize a random number generator, seeds 2 and 3 are optional
Loki rng = Loki(seed1, seed2, seed3);
// get a random float [0,1)
float random_float = rng.rand();
I also included a sample project in the repo so you can see how it is used.
Instead of computing the random number on the GPU, you can also compute a bunch of random numbers on the CPU and pass them into a the shader using a uniform / MTLBuffer.
Please take a look at [pcg-random], it's very simple and fast, more importantly it's fast. And it's super easy to modify their C code for Metal. https://www.pcg-random.org/
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
void pcg32_srandom_r(thread pcg32_random_t* rng, uint64_t initstate, uint64_t initseq)
{
rng->state = 0U;
rng->inc = (initseq << 1u) | 1u;
pcg32_random_r(rng);
rng->state += initstate;
pcg32_random_r(rng);
}
uint32_t pcg32_random_r(thread pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
rng->state = oldstate * 6364136223846793005ULL + rng->inc;
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
How do I use it?
float randomF(thread pcg32_random_t* rng)
{
//return pcg32_random_r(rng)/float(UINT_MAX);
return ldexp(float(pcg32_random_r(rng)), -32);
}
pcg32_random_t rng;
pcg32_srandom_r(&rng, pos_grid.x*int_time, pos_grid.y*int_time);
auto randomFloat = randomF(&rng);

Is this part of a real IFFT process really optimal?

When calculating (I)FFT it is possible to calculate "N*2 real" data points using a ordinary complex (I)FFT of N data points.
Not sure about my terminology here, but this is how I've read it described.
There are several posts about this on stackoverflow already.
This can speed things up a bit when only dealing with such "real" data which is often the case when dealing with for example sound (re-)synthesis.
This increase in speed is offset by the need for a pre-processing step that somehow... uhh... fidaddles? the data to achieve this. Look I'm not even going to try to convince anyone I fully understand this but thanks to previously mentioned threads, I came up with the following routine, which does the job nicely (thank you!).
However, on my microcontroller this costs a bit more than I'd like even though trigonometric functions are already optimized with LUTs.
But the routine itself just looks like it should be possible to optimize mathematically to minimize processing. To me it seems similar to plain 2d rotation. I just can't quite wrap my head around it, but it just feels like this could be done with fewer both trigonometric calls and arithmetic operations.
I was hoping perhaps someone else might easily see what I don't and provide some insight into how this math may be simplified.
This particular routine is for use with IFFT, before the bit-reversal stage.
pseudo-version:
INPUT
MAG_A/B = 0 TO 1
PHA_A/B = 0 TO 2PI
INDEX = 0 TO PI/2
r = MAG_A * sin(PHA_A)
i = MAG_B * sin(PHA_B)
rsum = r + i
rdif = r - i
r = MAG_A * cos(PHA_A)
i = MAG_B * cos(PHA_B)
isum = r + i
idif = r - i
r = -cos(INDEX)
i = -sin(INDEX)
rtmp = r * isum + i * rdif
itmp = i * isum - r * rdif
OUTPUT rsum + rtmp
OUTPUT itmp + idif
OUTPUT rsum - rtmp
OUTPUT itmp - idif
original working code, if that's your poison:
void fft_nz_set(fft_complex_t complex[], unsigned bits, unsigned index, int32_t mag_lo, int32_t pha_lo, int32_t mag_hi, int32_t pha_hi) {
unsigned size = 1 << bits;
unsigned shift = SINE_TABLE_BITS - (bits - 1);
unsigned n = index; // index for mag_lo, pha_lo
unsigned z = size - index; // index for mag_hi, pha_hi
int32_t rsum, rdif, isum, idif, r, i;
r = smmulr(mag_lo, sine(pha_lo)); // mag_lo * sin(pha_lo)
i = smmulr(mag_hi, sine(pha_hi)); // mag_hi * sin(pha_hi)
rsum = r + i; rdif = r - i;
r = smmulr(mag_lo, cosine(pha_lo)); // mag_lo * cos(pha_lo)
i = smmulr(mag_hi, cosine(pha_hi)); // mag_hi * cos(pha_hi)
isum = r + i; idif = r - i;
r = -sinetable[(1 << SINE_BITS) - (index << shift)]; // cos(pi_c * (index / size) / 2)
i = -sinetable[index << shift]; // sin(pi_c * (index / size) / 2)
int32_t rtmp = smmlar(r, isum, smmulr(i, rdif)) << 1; // r * isum + i * rdif
int32_t itmp = smmlsr(i, isum, smmulr(r, rdif)) << 1; // i * isum - r * rdif
complex[n].r = rsum + rtmp;
complex[n].i = itmp + idif;
complex[z].r = rsum - rtmp;
complex[z].i = itmp - idif;
}
// For reference, this would be used as follows to generate a sawtooth (after IFFT)
void synth_sawtooth(fft_complex_t *complex, unsigned fft_bits) {
unsigned fft_size = 1 << fft_bits;
fft_sym_dc(complex, 0, 0); // sets dc bin [0]
for(unsigned n = 1, z = fft_size - 1; n <= fft_size >> 1; n++, z--) {
// calculation of amplitude/index (sawtooth) for both n and z
fft_sym_magnitude(complex, fft_bits, n, 0x4000000 / n, 0x4000000 / z);
}
}

How to make sum calculations without using atomic in CUDA

In the below code, how can I calculate sum_array value without using atomicAdd.
Kernel method
__global__ void calculate_sum( int width,
int height,
int *pntrs,
int2 *sum_array )
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if ( row >= height || col >= width ) return;
int idx = pntrs[ row * width + col ];
//atomicAdd( &sum_array[ idx ].x, col );
//atomicAdd( &sum_array[ idx ].y, row );
sum_array[ idx ].x += col;
sum_array[ idx ].y += row;
}
Launch Kernel
dim3 dimBlock( 16, 16 );
dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x,
( height + ( dimBlock.y - 1 ) ) / dimBlock.y );
Reduction is a general name for this kind of problems. Look at the presentation for further explanation or use Google for other examples.
General way to solve this is to make parallel sum of global memory segments inside the thread blocks and store the results in global memory. Afterwards, copy the partial results to CPU memory space, sum the partial results using CPU, and copy the result back to GPU memory. You can avoid coping of memory by execution of another parallel sum for the partial results.
Another approach is to use highly optimized libraries for CUDA such as Thrust or CUDPP which contain functions doing the stuff.
My Cuda is very very rusty, but this is roughly how you do it (courtesy of "Cuda by Example", which I would strongly suggest you to read):
https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0
Do a better partitioning of the array you need to sum: threads in CUDA are lightweight, but not so much that you can spawn one for just two sums and hope to get any performance benefit in return.
At this point each thread will be tasked to sum over a slice of your data: create an array of shared int as big as the number of your threads, where each thread will save the partial sum it computed.
Synchronize the threads and reduce the shared memory array:
(please, take it as pseudocode)
// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...
// Save the result in the shared memory
partial[threadidx] = localsum;
// Synchronize the threads:
__syncthreads();
// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
for(i = 1; i < nthreads; ++i) {
partial[0] += partial[i];
}
}
and off you go: partial[0] will hold your sum (or computation).
See the dot product example in "CUDA by example" for a more rigorous discussion of the topic and a reduction algorithm that runs in about O(log(n)).
Hope this helps

Data structures and algorithms for adaptive "uniform" mesh?

I need a data structure for storing float values at an uniformly sampled 3D mesh:
x = x0 + ix*dx where 0 <= ix < nx
y = y0 + iy*dy where 0 <= iy < ny
z = z0 + iz*dz where 0 <= iz < nz
Up to now I have used my Array class:
Array3D<float> A(nx, ny,nz);
A(0,0,0) = 0.0f; // ix = iy = iz = 0
Internally it stores the float values as an 1D array with nx * ny * nz elements.
However now I need to represent an mesh with more values than I have RAM,
e.g. nx = ny = nz = 2000.
I think many neighbour nodes in such an mesh may have similar values so I was thinking if there was some simple way that I could "coarsen" the mesh adaptively.
For instance if the 8 (ix,iy,iz) nodes of an cell in this mesh have values that are less than 5% apart; they are "removed" and replaced by just one value; the mean of the 8 values.
How could I implement such a data structure in a simple and efficient way?
EDIT:
thanks Ante for suggesting lossy compression. I think this could work the following way:
#define BLOCK_SIZE 64
struct CompressedArray3D {
CompressedArray3D(int ni, int nj, int nk) {
NI = ni/BLOCK_SIZE + 1;
NJ = nj/BLOCK_SIZE + 1;
NK = nk/BLOCK_SIZE + 1;
blocks = new float*[NI*NJ*NK];
compressedSize = new unsigned int[NI*NJ*NK];
}
void setBlock(int I, int J, int K, float values[BLOCK_SIZE][BLOCK_SIZE][BLOCK_SIZE]) {
unsigned int csize;
blocks[I*NJ*NK + J*NK + K] = compress(values, csize);
compressedSize[I*NJ*NK + J*NK + K] = csize;
}
float getValue(int i, int j, int k) {
int I = i/BLOCK_SIZE;
int J = j/BLOCK_SIZE;
int K = k/BLOCK_SIZE;
int ii = i - I*BLOCK_SIZE;
int jj = j - J*BLOCK_SIZE;
int kk = k - K*BLOCK_SIZE;
float *compressedBlock = blocks[I*NJ*NK + J*NK + K];
unsigned int csize = compressedSize[I*NJ*NK + J*NK + K];
float values[BLOCK_SIZE][BLOCK_SIZE][BLOCK_SIZE];
decompress(compressedBlock, csize, values);
return values[ii][jj][kk];
}
// number of blocks:
int NI, NJ, NK;
// number of samples:
int ni, nj, nk;
float** blocks;
unsigned int* compressedSize;
};
For this to be useful I need a lossy compression that is:
extremely fast, also on small datasets (e.g. 64x64x64)
compress quite hard > 3x, never mind if it looses quite a bit of info.
Any good candidates?
It sounds like you're looking for a LOD (level of detail) adaptive mesh. It's a recurring theme in video games and terrain simulation.
For terrain, see here: http://vterrain.org/LOD/Papers/ -- look for the ROAM video which is IIRC not only adaptive by distance, but also by view direction.
For non-terrain entities, there is a huge body of work (here's one example: Generic Adaptive Mesh Refinement).
I would suggest to use OctoMap to handle large 3D data.
And to extend it as shown here to handle geometrical properties.

Least Squares solution to simultaneous equations

I am trying to fit a transformation from one set of coordinates to another.
x' = R + Px + Qy
y' = S - Qx + Py
Where P,Q,R,S are constants, P = scale*cos(rotation). Q=scale*sin(rotation)
There is a well known 'by hand' formula for fitting P,Q,R,S to a set of corresponding points.
But I need to have an error estimate on the fit - so I need a least squares solution.
Read 'Numerical Recipes' but I'm having trouble working out how to do this for data sets with x and y in them.
Can anyone point to an example/tutorial/code sample of how to do this ?
Not too bothered about the language.
But - just use built in feature of Matlab/Lapack/numpy/R probably not helpful !
edit:
I have a large set of old(x,y) new(x,y) to fit to. The problem is overdetermined (more data points than unknowns) so simple matrix inversion isn't enough - and as I said I really need the error on the fit.
The following code should do the trick. I used the following formula for the residuals:
residual[i] = (computed_x[i] - actual_x[i])^2
+ (computed_y[i] - actual_y[i])^2
And then derived the least-squares formulae based on the general procedure described at Wolfram's MathWorld.
I tested out this algorithm in Excel and it performs as expected. I Used a collection of ten random points which were then rotated, translated and scaled by a randomly generated transformation matrix.
With no random noise applied to the output data, this program produces four parameters (P, Q, R, and S) which are identical to the input parameters, and an rSquared value of zero.
As more and more random noise is applied to the output points, the constants start to drift away from the correct values, and the rSquared value increases accordingly.
Here is the code:
// test data
const int N = 1000;
float oldPoints_x[N] = { ... };
float oldPoints_y[N] = { ... };
float newPoints_x[N] = { ... };
float newPoints_y[N] = { ... };
// compute various sums and sums of products
// across the entire set of test data
float Ex = Sum(oldPoints_x, N);
float Ey = Sum(oldPoints_y, N);
float Exn = Sum(newPoints_x, N);
float Eyn = Sum(newPoints_y, N);
float Ex2 = SumProduct(oldPoints_x, oldPoints_x, N);
float Ey2 = SumProduct(oldPoints_y, oldPoints_y, N);
float Exxn = SumProduct(oldPoints_x, newPoints_x, N);
float Exyn = SumProduct(oldPoints_x, newPoints_y, N);
float Eyxn = SumProduct(oldPoints_y, newPoints_x, N);
float Eyyn = SumProduct(oldPoints_y, newPoints_y, N);
// compute the transformation constants
// using least-squares regression
float divisor = Ex*Ex + Ey*Ey - N*(Ex2 + Ey2);
float P = (Exn*Ex + Eyn*Ey - N*(Exxn + Eyyn))/divisor;
float Q = (Exn*Ey + Eyn*Ex + N*(Exyn - Eyxn))/divisor;
float R = (Exn - P*Ex - Q*Ey)/N;
float S = (Eyn - P*Ey + Q*Ex)/N;
// compute the rSquared error value
// low values represent a good fit
float rSquared = 0;
float x;
float y;
for (int i = 0; i < N; i++)
{
x = R + P*oldPoints_x[i] + Q*oldPoints_y[i];
y = S - Q*oldPoints_x[i] + P*oldPoints_y[i];
rSquared += (x - newPoints_x[i])^2;
rSquared += (y - newPoints_y[i])^2;
}
To find P, Q, R, and S, then you can use least squares. I think the confusing thing is that that usual description of least squares uses x and y, but they don't match the x and y in your problem. You just need translate your problem carefully into the least squares framework. In your case the independent variables are the untransformed coordinates x and y, the dependent variables are the transformed coordinates x' and y', and the adjustable parameters are P, Q, R, and S. (If this isn't clear enough, let me know and I'll post more detail.)
Once you've found P, Q, R, and S, then scale = sqrt(P^2 + Q^2) and you can then find the rotation from sin rotation = Q / scale and cos rotation = P / scale.
You can use the levmar program to calculate this. Its tested and integrated into multiple products including mine. Its licensed under the GPL, but if this is a non-opensource project, he will change the license for you (for a fee)
Define the 3x3 matrix T(P,Q,R,S) such that (x',y',1) = T (x,y,1). Then compute
A = \sum_i |(T (x_i,y_i,1)) - (x'_i,y'_i,1)|^2
and minimize A against (P,Q,R,S).
Coding this yourself is a medium to large sized project unless you can guarntee that the data are well conditioned, especially when you want good error estimates out of the procedure. You're probably best off using an existing minimizer that supports error estimates..
Particle physics type would use minuit either directly from CERNLIB (with the coding most easily done in fortran77), or from ROOT (with the coding in c++, or it should be accessible though the python bindings). But that is a big installation if you don't have one of these tools already.
I'm sure that others can suggest other minimizers.
Thanks eJames, thats almost exaclty what I have. I coded it from an old army surveying manual that was based on an earlier "Instructions to Surveyors" note that must be 100years old! (It uses N and E for North and East rather than x/y )
The goodness of fit parameter will be very useful - I can interactively throw out selected points if they make the fit worse.
FindTransformation(vector<Point2D> known,vector<Point2D> unknown) {
{
// sums
for (unsigned int ii=0;ii<known.size();ii++) {
sum_e += unknown[ii].x;
sum_n += unknown[ii].y;
sum_E += known[ii].x;
sum_N += known[ii].y;
++n;
}
// mean position
me = sum_e/(double)n;
mn = sum_n/(double)n;
mE = sum_E/(double)n;
mN = sum_N/(double)n;
// differences
for (unsigned int ii=0;ii<known.size();ii++) {
de = unknown[ii].x - me;
dn = unknown[ii].y - mn;
// for P
sum_deE += (de*known[ii].x);
sum_dnN += (dn*known[ii].y);
sum_dee += (de*unknown[ii].x);
sum_dnn += (dn*unknown[ii].y);
// for Q
sum_dnE += (dn*known[ii].x);
sum_deN += (de*known[ii].y);
}
double P = (sum_deE + sum_dnN) / (sum_dee + sum_dnn);
double Q = (sum_dnE - sum_deN) / (sum_dee + sum_dnn);
double R = mE - (P*me) - (Q*mn);
double S = mN + (Q*me) - (P*mn);
}
One issue is that numeric stuff like this is often tricky. Even when the algorithms are straightforward, there's often problems that show up in actual computation.
For that reason, if there is a system you can get easily that has a built-in feature, it might be best to use that.

Resources