cuFFT runs slowly - any way to accelerate? - performance

I am using cufft to calculate 1D fft along each row for a matrix, and an array. The matrix size is 512 (x) X 720 (y), and the size of the array is 512 X 1. Which means the fft is applied on each row that has 512 elements for 720 times to the matrix, and is applied once for the array.
However, this operation turns out really slow, about one second basically. Is it normal, or any chance I can accelerate the code?
Here is my code (from NVIDIA sample code):
void FFTSinoKernel(cufftComplex* boneSinoF,
cufftComplex* kernelF,
int nChanDetX, // 512
int nView) // 720
{
cufftHandle plan;
// fft sino
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, nView);
cufftExecC2C(plan, boneSinoF, boneSinoF, CUFFT_FORWARD);
// fft kernel
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, 1);
cufftExecC2C(plan, kernelF, kernelF, CUFFT_FORWARD);
cufftDestroy(plan);
}
I was trying to usecufftExecR2C(), but I think that function has bug, because my DC component shifts 1or 2 units with each row. So I have filed a but report. But for now the cufftExecC2C() gives me the right results, so I decide to stick to it.
UPDATE:
Interestingly, I found if I call this function again, it will accelerate significantly, less than 10 ms. So whenever the cufft gets called the first, time, it is slow. Afterwards, it becomes much faster. I don't understand why the first time is slow, and how to avoid it. Anyone has any similar experience? Thanks.

Move the FFT initialization (plan creation) outside of the performance critical loop. The setup code has to allocate memory and calculate O(N) transcendental functions, which can be much slower than the O(NlogN) simple arithmetic ops inside the FFT computation itself.

Related

How to minimize a cost function with Matlab when input variable is a large image: increase speed and prevent memory crash

I am trying to implement a differential phase integration method described in this paper:
Thüring, Thomas, et al. "Non-linear regularized phase retrieval for unidirectional X-ray differential phase contrast radiography." Optics express 19.25 (2011): 25545-25558.
Basically, it's a way to integrate a differential image across the columns only, while imposing some constraints on continuity across the rows to prevent stripe noise.
From a mathematical point of view, I want to minimize the following equation:
where ||.|| is the L2 norm, Dx is the derivative along the columns, Dy is the derivative across the rows, A is the unknown integrated matrix, lambda is a user-defined parameter and phi is the differential profile I measured. Note that for the Dy operator the L1 norm can also be used.
I wrote down a code using fminunc as Matlab solver
pdiff=imresize(diff(padarray(p,[0,1],'replicate','post'),1,2),[128,128]);
noise = 0.02 * randn(size(pdiff));
pdiff_noise = pdiff + noise ;
% normal integration
integratedProfile=cumsum(pdiff_noise,2);
options=optimoptions(#fminunc,'Display','iter-detailed','UseParallel',true,'MaxIterations',35);
% regularized integration
startingPoint=zeros(size(pdiff_noise));
fun=#(x)costFunction(pdiff_noise,x);
integratedProfile_optmized=fminunc(fun,startingPoint,options);
function difference=costFunction(ep,op)
L=0.2;
dep_o=diff(padarray(op,[0,1],'replicate','post'),1,2);
dep_v=diff(padarray(op,[1,0],'replicate','post'),1,1);
difference=sum(sum((ep-dep_o).^2))+L*sum(sum(dep_v.^2));
end
It works using a 128x128 differential image.
The problem arises as soon as I try to work with a larger image. In particular, when I use a 256x256 matrix takes forever to make each iteration even using the parallel option and takes almost the entire RAM.
When I move to a matrix that is 512x512 I get this error
Requested 262144x262144 (512.0GB) array exceeds maximum
array size preference.
Error in fminusub (line 165)
H = eye(sizes.nVar);
Error in fminunc (line 446)
[x,FVAL,GRAD,HESSIAN,EXITFLAG,OUTPUT] =
fminusub(funfcn,x, ...
Error in Untitled (line 13)
integratedProfile_optmized=fminunc(fun,startingPoint,options);
Unfortunately, my final goal is to process approximately 3000 images of 500x500 size.
I think I have understood that the crash problem is related to the size of the matrix and to the fact that each pixel is a variable. Therefore, Matlab needs to calculate a huge hessian that doesn't fit into the memory.
However, I don't really know how to solve it while also speeding up the processing.
Do you have any suggestions on how to work with large images? Is there another solver that may work in a faster way? Any mathematical approach to making the problem easier?
Thanks!

Why is reshape so fast? (Spoiler: Copy-on-Write)

I have a big matrix A which is 1GB of double values, when I reshape it to different dimensions, it's incredible fast.
A=rand(128,1024,1024);
tic;B=reshape(A,1024,128,1024);toc
Elapsed time is 0.000011 seconds.
How can it be that fast? Another observation, MATLAB uses less memory than it should after running that code and storing two matrices of 1GB each: Memory used by MATLAB: 1878 MB (1.969e+09 bytes)
Explanation of the good performance
Matlab uses copy-on-write whenever possible. If you write expressions like B=A, MATLAB does not copy A, instead both variables A and B are references to the same data structure. Only if one of the two variables will be modified, MATLAB will create a copy.
Now to the special case of reshape. Here it looks like A and B are not the same, but in memory they are. The underlying array which holds the data is unaffected by the reshape operation, nothing has to be moved: all(A(:)==B(:)). Everything MATLAB has to do when calling reshape is to create a new reference and annotate it with the new dimensions of the matrix. Reshaping a matrix is nothing more than creating a new reference to the input data, which is annotated with the new dimensions. The runtime of reshape is less than 1µs or roughly the time two simple assignments like B=A require. For all practical applications a zero time operation.
>> tic;for i=1:1000;B=reshape(A,1024,128,1024);end;toc
Elapsed time is 0.000724 seconds.
>> tic;for i=1:1000;B=A;end;toc
Elapsed time is 0.000307 seconds.
It is unknown how large such a reference really is, but we can assume it to be within a few bytes.
Other zero cost operations
Functions known to have practically zero cost (both runtime and memory):
B=reshape(A,sz)
B=A(:)
B=A.' - only for Vectors
B=A' - only for Vectors of real numbers, without the attribute complex. Use .' instead.
B=permute(A,p) - only for the cases where all(A(:)==B(:)).1
B=ipermute(A,p) - only for the cases where all(A(:)==B(:)).1
B=squeeze(A) 1
shiftdim - only for the cases where all(A(:)==B(:)), which are:1
used to remove leading singleton dimensions.
used with negative second input
used without second input argument.
Functions which are "expensive", regardless of the fact that they don't touch the representation in memory (all(A(:)==B(:)) is true)
Left sided indexing: B(1:numel(A))=A; 2
Right sided indexing other than (:), including B=A(1:end); and B=A(:,:,:); 2
1 Significantly slower runtime than reshape between 1µs and 1ms. Probably because of some constant computation overhead. Memory consumption is practically zero and the runtime is independent from the input size. Operations without this annotation have a runtime below 1µs and roughly equivalent to reshape.
2 Zero cost in OCTAVE
Originally used MATLAB 2013b when writing this post. Confirmed the numbers with MATLAB 2019b.

GPGPU computation with MATLAB does not scale properly

I've been experimenting with the GPU support of Matlab (v2014a). The notebook I'm using to test my code has a NVIDIA 840M build in.
Since I'm new to GPU computing with Matlab, I started out with a few simple examples and observed a strange scalability behavior. Instead of increasing the size of my problem, I simply put a forloop around my computation. I expected the time for the computation, to scale with the number of iterations, since the problem size itself does not increase. This was also true for smaller numbers of iterations, however at a certain point the time does not scale as expected, instead I observe a huge increase in computation time. Afterwards, the problem continues to scale again as expected.
The code example started from a random walk simulation, but I tried to produce an example that is easy and still shows the problem.
Here's what my code does. I initialize two matrices as sin(alpha)and cos(alpha). Then I loop over the number of iterations from 2**1to 2**15. I then repead the computation sin(alpha)^2 and cos(alpha)^2and add them up (this was just to check the result). I perform this calculation as often as the number of iterations suggests.
function gpu_scale
close all
NP = 10000;
NT = 1000;
ALPHA = rand(NP,NT,'single')*2*pi;
SINALPHA = sin(ALPHA);
COSALPHA = cos(ALPHA);
GSINALPHA = gpuArray(SINALPHA); % move array to gpu
GCOSALPHA = gpuArray(COSALPHA);
PMAX=15;
for P = 1:PMAX;
for i=1:2^P
GX = GSINALPHA.^2;
GY = GCOSALPHA.^2;
GZ = GX+GY;
end
end
The following plot, shows the computation time in a log-log plot for the case that I always double the number of iterations. The jump occurs when doubling from 1024 to 2048 iterations.
The initial bump for two iterations might be due to initialization and is not really relevant anyhow.
I see no reason for the jump between 2**10 and 2**11 computations, since the computation time should only depend on the number of iterations.
My question: Can somebody explain this behavior to me? What is happening on the software/hardware side, that explains this jump?
Thanks in advance!
EDIT: As suggested by Divakar, I changed the way I time my code. I wasn't sure I was using gputimeit correctly. however MathWorks suggests another possible way, namely
gd= gpuDevice();
tic
% the computation
wait(gd);
Time = toc;
Using this way to measure my performance, the time is significantly slower, however I don't observe the jump in the previous plot. I added the CPU performance for comparison and keept both timings for the GPU (wait / no wait), which can be seen in the following plot
It seems, that the observed jump "corrects" the timining in the direction of the case where I used wait. If I understand the problem correctly, then the good performance in the no wait case is due to the fact, that we do not wait for the GPU to finish completely. However, then I still don't see an explanation for the jump.
Any ideas?

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

CUDA Add Rows of a Matrix

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,
Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.
If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/
Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).
Your functor might look something like:
struct ColumnSumFunctor {
const Matrix matrix;
// Make a functor to sum the matrix
ColumnSumFunctor(const Matrix& matrix);
// Compute and return the sum of the specified column
__device__
int operator()(const int& column) const;
};
Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.
What graphics card are you using?
I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.
http://code.google.com/p/cudpp/
I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.
The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.
Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat
The Computing SDK has lots of examples of reduction techniques.

Resources