halide total size for allocation func is constant but larger than 2^31-1 - halide

I'm working on a Halide project processing bursts of images. My initial data set is 9 bursts of images of 4208*3120 pixel, in format uint16_t. I got the constant_allocation_size error during run time:
"Error:
Total size for allocation layer_0 is constant but exceeds 2^31 - 1.
Aborted (core dumped)".
where layer_0 is a down-sample function scheduled with compute_root();
The sample code is like this:
Func my_process(Func input, ...){
<do something here...>
output.compute_root().parallel(y).vectorize(x, 16);
}
Func raw2gray(Func input, std::string name){
...
compute_root();
}
Func imgs_mirror = BoundaryConditions::mirror_interior(imgs, 0, imgs.width(), 0, imgs.height());
Func layer_0 = raw2gray(imgs_mirror, "layer_0");
Func out_imgs = my_process(layer_0, ...);
Does anyone know how to shrink my constant allocation size or how to increase this 2^31-1 upper limit?

I figured it out myself. Here to post my solutions in case anyone else comes across a similar issue.
This is caused by my function boundary clarifications. Since I though it would be sufficient by putting BoundaryConditions to the input image, I did not care about function boundary problem in my latter process. However, since my latter function processes image hierarchically and manipulates several pixels' coordinates in-between, the boundary limit calculated back to the original image would be surprisingly large, which lead to the mirrored image to be as large as several 8GBs.
After adding a clamp each time manipulating the coordinates, this problem is solved.

If you really do need more than 2^31-1 elements, you can add Target::LargeBuffers to the compilation target you use when compiling your pipeline (JIT or AOT); this increases the limit to 2^63-1 elements.

Related

scheduling prefetch in halide rdom update stage

I've been trying to recreate a hand tuned c function via halide. It is a a series of histograms done on vertical scanlines of the source image. As such I'm using an 1 dimension RDom to iterate the source image.
RDom reductionY(0, input.height());
parade(x,y,c) = Halide::cast<uint16_t>(0);
parade(x, input(x, reductionY, c), c) += Halide::cast<uint16_t>(1);
To increase locality, I'm wrapping the rdom in another func so I can schedule it with compute_at.
wrapper(x,y,c) = parade(x, y, c);
parade.update(0).reorder(c, reductionY, x);
parade.update(0).split(x, x_outer, x_inner, THREADWIDTH);
parade.compute_at(wrapper, x_outer);
This (plus some vectorization/parallelization I've stripped out for this question) closely matches my hand tuned original. One thing the original benefits from that I can't figure out how to schedule, is to prefetch the first read of each vertical line from input in the update(0) stage. If I schedule
parade.update(0).prefetch(inputParam, x_inner, 3);
it seems to prefetch every pixel to be read? My hope is to issue a single prefetch to the first pixel to be read.
On first glance, it doesn't seem that the code you posted is complete: parade is computed at the x_outer dimension of wrapper, but wrapper has never been split to create such a dimension. Seeing the exact code would help, and you may also find both print_loop_nest and compiling to a lowered statement file useful in seeing the exact structure and figuring out where you want the prefetch to be executed.
Quickly, though, I don't believe prefetches can be issued for only a subset of the used data—logically, they apply to the whole block of the data to be used at a given granularity. Do you observe poor performance due to prefetching the whole column rather than a single pixel? Explicitly prefetching a single pixel seems likely to help only insofar as it may trigger the hardware prefetcher to speculatively fetch the whole column.
If this is a case where a known-better approach is not representable in the current Halide model, however, you should share it with the halide-dev list or as an issue on GitHub with a simple reproducer for your target platform (x86?).

cuFFT runs slowly - any way to accelerate?

I am using cufft to calculate 1D fft along each row for a matrix, and an array. The matrix size is 512 (x) X 720 (y), and the size of the array is 512 X 1. Which means the fft is applied on each row that has 512 elements for 720 times to the matrix, and is applied once for the array.
However, this operation turns out really slow, about one second basically. Is it normal, or any chance I can accelerate the code?
Here is my code (from NVIDIA sample code):
void FFTSinoKernel(cufftComplex* boneSinoF,
cufftComplex* kernelF,
int nChanDetX, // 512
int nView) // 720
{
cufftHandle plan;
// fft sino
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, nView);
cufftExecC2C(plan, boneSinoF, boneSinoF, CUFFT_FORWARD);
// fft kernel
cufftPlan1d(&plan, nChanDetX, CUFFT_C2C, 1);
cufftExecC2C(plan, kernelF, kernelF, CUFFT_FORWARD);
cufftDestroy(plan);
}
I was trying to usecufftExecR2C(), but I think that function has bug, because my DC component shifts 1or 2 units with each row. So I have filed a but report. But for now the cufftExecC2C() gives me the right results, so I decide to stick to it.
UPDATE:
Interestingly, I found if I call this function again, it will accelerate significantly, less than 10 ms. So whenever the cufft gets called the first, time, it is slow. Afterwards, it becomes much faster. I don't understand why the first time is slow, and how to avoid it. Anyone has any similar experience? Thanks.
Move the FFT initialization (plan creation) outside of the performance critical loop. The setup code has to allocate memory and calculate O(N) transcendental functions, which can be much slower than the O(NlogN) simple arithmetic ops inside the FFT computation itself.

OSX AudioUnit SMP

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA
This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...
Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.
As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.
Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}

copy data to the input of FFTW waste a lot of time?

I hava a array A of 512*1024*127(rho,column page),and I want to do 2D FFT to every page,
when I create a FFTW plane,for example:
fftwf_plan mFFTPalen = fftwf_plan_dft_r2c_2d(1024, 512, in, out, FFTW_ESTIMATE);
I want to use this plane to finish all the 2D ffts, that is 127 times 2D fft,
I have to copy the data into the "in" array 127 times,
and copy the fft result from the "out" array 127 times,
I think it is a waste of time,
for(int plane=0; plane<127; plane++)
{
memcpy(in, A[plane*512*1024], sizeof(float)*512*1024);
fftwf_execute(mFFTPalen);
memcpy(complexData,out,sizeof(float)*513*512*2);
}
Can anyone tell me am I do the right thing?
Use fftwf_plan_many_dft from the Advanced Interface to save all this copying. Only do this though if you are certain that you have a performance problem, otherwise it's just a waste of effort.

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

Resources