openCL reduction, and passing 2d array - parallel-processing

Here is the loop I want to convert to openCL.
for(n=0; n < LargeNumber; ++n) {
for (n2=0; n2< SmallNumber; ++n2) {
A[n]+=B[n2][n];
}
Re+=A[n];
}
And here is what I have so far, although, I know it is not correct and missing some things.
__kernel void openCL_Kernel( __global int *A,
__global int **B,
__global int *C,
__global _int64 Re,
int D)
{
int i=get_global_id(0);
int ii=get_global_id(1);
A[i]+=B[ii][i];
//barrier(..); ?
Re+=A[i];
}
I'm a complete beginner to this type of thing. First of all I know that I can't pass a global double pointer to an openCL kernel. If you can, wait a few days or so before posting the solution, I want to figure this out for myself, but if you can help point me in the right direction I would be grateful.

Concerning your problem with passing doublepointers: That kind of problem is typically solved by copying the whole matrix (or whatever you are working on) into one continous block of memory and, if the blocks have different lengths passing another array, which contains the offsets for the individual rows ( so your access would look something like B[index[ii]+i]).
Now for your reduction down to Re: since you didn't mention what kind of device you are working on I'm going to assume its GPU. In that case I would avoid doing the reduction in the same kernel, since its going to be slow as hell the way you posted it (you would have to serialize the access to Re over thousands of threads (and the access to A[i] too).
Instead I would write want kernel, which sums all B[*][i] into A[i] and put the reduction from A into Re in another kernel and do it in several steps, that is you use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result (I'm making this description intentionally vague, since you said you wanted to figure thinks out yourself).
As a sidenote: You realize that the original code doesn't exactly have a nice memory access pattern? Assuming B is relatively large (and much larger then A due to the second dimension) having the inner loop iterate over the outer index is going to create a lot of cachemisses. This is even worse when porting to the gpu, which is very sensitive about coherent memory access
So reordering it like this may massively increase performance:
for (n2=0; n2< SmallNumber; ++n2)
for(n=0; n < LargeNumber; ++n)
A[n]+=B[n2][n];
for(n=0; n < LargeNumber; ++n)
Re+=A[n];
This is particulary true if you have a compiler which is good at autovectorization, since it might be able to vectorize that construct, but it's very unlikely to be able to do so for the original code (and if it can't prove that A and B[n2] can't refer to the same memory it can't turn the original code into this).

Related

Problem with maximum index value of a vector [duplicate]

Why does the following work? I thought that writing to an an index of a vector object beyond the end of the vector object would cause a segmentation fault.
#include <iostream>
#include <vector>
using namespace std;
int main() {
vector<int> x(1);
x[10] = 1;
cout << x[10] << endl;
}
What are the implications of this? Is there a safer way to initialize a vector of exactly n elements and write only to those? Should I always use push_back()?
Somebody implementing std::vector might easily decide to give it a minimum size of 10 or 20 elements or so, on the theory that the memory manager likely has a large enough minimum allocation size that it will use (about) the same amount of memory anyway.
As far as avoiding reading/writing past the end, one possibility is to avoid using indexing whenever possible, and using .at() to do indexing when you truly can't avoid it.
I find that I can usually avoid doing indexing at all by using range-based for loops and/or standard algorithms for most tasks. For a trivial example, consider something like this:
int main() {
vector<int> x(1);
x.push_back(1);
for (auto i : x)
cout << i << "\n";
}
.at() does work, but I rarely find it useful or necessary--I suspect I use it less than once a year on average.
So under the covers what actually happens when you try to address an element of an array or vector that is outside the container's bounds is a memory dereference to a piece of memory that is not part of the container. However, reads and writes to this location can "work" or appear to work because it is just more memory that you are reading/writing to. You are doing something very very bad. You will generally see random junk when accessing memory outside of your bounds because it can belong to something else or is leftovers from a previous process because the memory controller will not naturally zero out memory on its own. So the best practice is to use the size of the container to check your reads and writes if you are ever in doubt as to if you are outside the bounds of the container. You can use vector.size() to find the current size of the container.

Max size in array 2-dimentional in C++

I want to execute large computational program in 3 and 2 dimension with size of array[40000][40000] or more ,this code can explain my problem a bit,I comment vector because it have same problem when I run it it goes to lib of vector, how to increase memory of compiler or delete(clean) some part of it when program running?
#include<iostream>
#include<cstdlib>
#include<vector>
using namespace std;
int main(){
float array[40000][40000];
//vector< vector<double> > array(1000,1000);
cout<<"bingo"<<endl;
return 0;
}
A slightly better option than vector (and far better than vector-of-vector1), which like vector, uses dynamic allocation for the contents (and therefore doesn't overflow the stack), but doesn't invite resizing:
std::unique_ptr<float[][40000]> array{ new float[40000][40000] };
Conveniently, float[40000][40000] still appears, making it fairly obvious what is going on here even to a programmer unfamiliar with incomplete array types.
1 vector<vector<T> > is very bad, since it would have many different allocations, which all have to be separately initialized, and the resulting storage would be discontiguous. Slightly better is a combination of vector<T> with vector<T*>, with the latter storing pointers created one row apart into a single large buffer managed by the former.

OSX AudioUnit SMP

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA
This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...
Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.
As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.
Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

CUDA Add Rows of a Matrix

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,
Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.
If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/
Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).
Your functor might look something like:
struct ColumnSumFunctor {
const Matrix matrix;
// Make a functor to sum the matrix
ColumnSumFunctor(const Matrix& matrix);
// Compute and return the sum of the specified column
__device__
int operator()(const int& column) const;
};
Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.
What graphics card are you using?
I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.
http://code.google.com/p/cudpp/
I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.
The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.
Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat
The Computing SDK has lots of examples of reduction techniques.

Resources