CUDA Add Rows of a Matrix - matrix

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,

Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.
If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/
Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).
Your functor might look something like:
struct ColumnSumFunctor {
const Matrix matrix;
// Make a functor to sum the matrix
ColumnSumFunctor(const Matrix& matrix);
// Compute and return the sum of the specified column
__device__
int operator()(const int& column) const;
};

Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.
What graphics card are you using?
I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.
http://code.google.com/p/cudpp/

I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.
The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.
Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat
The Computing SDK has lots of examples of reduction techniques.

Related

Is there any efficient way to compact a sparse array in OpenCL / CUDA? [duplicate]

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.
Example:
Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}
Host Memory Result: {17, 13, 14, 0, 0, ...}
The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.
The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.
An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:
kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
if (array[threadId.x] == 0)
return;
for (int i = 0; i < arraySize; i++) {
array = arr[threadId.x];
__threadfence();
// If we were the lucky thread we won!
// kill the thread and continue re-reincarnated in a different thread
if (array[i] == arr[threadId.x])
return;
}
}
This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)
Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).
So I'm not quite sure which method would be the fastest, and I still
think there's a better way of handling this. Any suggestions?
What you are asking for is a classic parallel algorithm called stream compaction1.
If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.
Rough sketch:
#include <thrust/copy.h>
template<typename T>
struct is_non_zero {
__host__ __device__
auto operator()(T x) const -> bool {
return x != 0;
}
};
// ... your input and output vectors here
thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());
If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.
(1) Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?
Stream compaction is a well-known problem for which a lot of code was written (Thrust, Chagg to cite two libraries that implement stream compaction on CUDA).
If you have a relatively new CUDA-capable device that supports intrinsic function as __ballot (compute capability >= 3.0) it is worth trying a small CUDA procedure that performs stream compaction much faster than Thrust.
Here finds the code and minimal doc.
https://github.com/knotman90/cuStreamComp
It uses a ballotting function in a single kernel fashion to perform the compaction.
Edit:
I wrote an article explaining the inner workings of this approach. You can find it here if you are interested.
With this answer, I'm only trying to provide more details to Davide Spataro's approach.
As you mentioned, stream compaction consists of removing undesired elements in a collection depending on a predicate. For example, considering an array of integers and the predicate p(x)=x>5, the array A={6,3,2,11,4,5,3,7,5,77,94,0} is compacted to B={6,11,7,77,94}.
The general idea of stream compaction approaches is that a different computational thread be assigned to a different element of the array to be compacted. Each of such threads must decide to write its corresponding element to the output array depending on whether it satisfies the relevant predicate or not. The main problem of stream compaction is thus letting each thread know in which position the corresponding element must be written in the output array.
The approach in [1,2] is an alternative to Thrust's copy_if mentioned above and consists of three steps:
Step #1. Let P be the number of launched threads and N, with N>P, the size of the vector to be compacted. The input vector is divided in sub-vectors of size S equal to the block size. The __syncthreads_count(pred) block intrinsic is exploited which counts the number of threads in a block satisfying the predicate pred. As a result of the first step, each element of the array d_BlockCounts, which has size N/P, contains the number of elements meeting the predicate pred in the corresponding block.
Step #2. An exclusive scan operation is performed on the array d_BlockCounts. As a result of the second step, each thread knows how many elements in the previous blocks write an element. Accordingly, it knows the position where to write its corresponding element, but for an offset related to its own block.
Step #3. Each thread computes the mentioned offset using warp intrinsic functions and eventually writes to the output array. It should be noted that the execution of step #3 is related to warp scheduling. As a consequence, the elements order in the output array does not necessarily reflect the elements order in the input array.
Of the three steps above, the second is performed by CUDA Thrust’s exclusive_scan primitive and is computationally significantly less demanding than the other two.
For an array of 2097152 elements, the mentioned approach has executed in 0.38ms on an NVIDIA GTX 960 card, in contrast to 1.0ms of CUDA Thrust’s copy_if. The mentioned approach appears to be faster for two reasons:
1) It is specifically tailored to cards supporting warp intrinsic elements;
2) The approach does not guarantee the output ordering.
It should be noticed that we have tested the approach also against the code available at inkc.sourceforge.net. Although the latter code is arranged in a single kernel call (it does not employ any CUDA Thrust primitive), it has not better performance as compared to the three-kernels version.
The full code is available here and is slightly optimized as compared to the original Davide Spataro's routine.
[1] M.Biller, O. Olsson, U. Assarsson, “Efficient stream compaction on wide SIMD many-core architectures,” Proc. of the Conf. on High Performance Graphics, New Orleans, LA, Aug. 01 - 03, 2009, pp. 159-166.
[2] D.M. Hughes, I.S. Lim, M.W. Jones, A. Knoll, B. Spencer, “InK-Compact: in-kernel stream compaction and its application to multi-kernel data visualization on General-Purpose GPUs,” Computer Graphics Forum, vol. 32, n. 6, pp. 178-188, 2013.

Max size in array 2-dimentional in C++

I want to execute large computational program in 3 and 2 dimension with size of array[40000][40000] or more ,this code can explain my problem a bit,I comment vector because it have same problem when I run it it goes to lib of vector, how to increase memory of compiler or delete(clean) some part of it when program running?
#include<iostream>
#include<cstdlib>
#include<vector>
using namespace std;
int main(){
float array[40000][40000];
//vector< vector<double> > array(1000,1000);
cout<<"bingo"<<endl;
return 0;
}
A slightly better option than vector (and far better than vector-of-vector1), which like vector, uses dynamic allocation for the contents (and therefore doesn't overflow the stack), but doesn't invite resizing:
std::unique_ptr<float[][40000]> array{ new float[40000][40000] };
Conveniently, float[40000][40000] still appears, making it fairly obvious what is going on here even to a programmer unfamiliar with incomplete array types.
1 vector<vector<T> > is very bad, since it would have many different allocations, which all have to be separately initialized, and the resulting storage would be discontiguous. Slightly better is a combination of vector<T> with vector<T*>, with the latter storing pointers created one row apart into a single large buffer managed by the former.

A good hashing function for a non-uniform sequence of uniformly distributed 4 bits values?

I have a very specific problem:
I have uniformly random values spread on a 15x50 grid and the sample I want to hash corresponds to a square of 5x5 cells centered around any possible grid position.
The number of samples can thus vary from 25 (away from borders, most cases) to 20, 15 (near a border) down to a minimum of 9 (in a corner).
So even though the cell values are random, the location introduces a deterministic variation in the sequence length.
The hash table size is a small number, typically between 50 and 20.
The function will operate on a large set of randomly generated grids (a few hundreds/thousands), and might be called a few thousands times per grid. The positions on the grid can be considered random.
I would like a function that could spread the 15x50 possible samples as evenly as possible.
I have tried the following pseudo-code:
int32 hash = 0;
int i = 0; // I guess i could take any initial value and even be left uninitialized, but fixing one makes the function deterministic
foreach (value in block)
{
hash ^= (value << (i%28))
i++
}
hash %= table_size
but the results, though not grossly imbalanced, do not seem very smooth to me. Maybe it's because the sample is too small, but the circumstances make it difficult to run the code on a bigger sample, and I would rather not have to write a complete test harness if some computer savvy has an answer ready for me :).
I am not sure pairing the values two by two and using a general purpose byte hashing strategy would be the best solution, especially since the number of values might be odd.
I have tought of using a 17th value to represent off-grid cells, but that seems to introduce a bias (the sequences from cells near a border will have a lot of "off grid" values).
I am not sure either what would be the best way to test the efficiency of various solutions (how many grids shall I generate to have an idea of the performances, for instance).
http://www.partow.net/programming/hashfunctions/
Here are few different hash function from experts on various fields. Functions are designed for 8bit values, but I am sure you can extend for your case. I dont know what to suggest, but I think that any of them should work better than your current idea.
Problem with current approach you propose is that values are cyclic in field 2^n and if you make mod 64 at the end for example you lost most values out and only last 3 values remains in final result.
Despite your scepticism I would just shove them through a standard hash function.
If they are well randomised (and relatively independent - you don't say) to begin with you probably don't need to do too much work. Fowler-Noll-Vo (FNV) is a good candidate in these circumstances.
FNV operates on a series of 8-bit input and your input is (logically) 4-bit.
I would start without even bothering to pack 'two by two' as you describe.
If you feel like trying that, just logically pad odd length series with the message length (reduced to a 4 bit value obviously).
I wouldn't expect that packing to improve the hash. It may save you a tiny number of cycles because it swaps a relatively expensive * with a << and a |.
Try both and report back!
Here are implementations of packed and 'normal' versions of FNV1a in C:
#include <inttypes.h>
static const uint32_t sFNVOffsetBasis=2166136261;
static const uint32_t sFNVPrime= 16777619;
const uint32_t FNV1aPacked4Bit(const uint8_t*const pBytes,const size_t pSize) {
uint32_t rHash=sFNVOffsetBasis;
for(size_t i=0;i<pSize;i+=2){
rHash=rHash^(pBytes[i]|(pBytes[i+1]<<4));
rHash=rHash*sFNVPrime;
}
if(pSize%2){//Length is odd. The loop missed the last element.
rHash=rHash^(pBytes[pSize-1]|((pSize&0x1E)<<3));
rHash=rHash*sFNVPrime;
}
return rHash;
}
const uint32_t FNV1a(const uint8_t*const pBytes,const size_t pSize) {
uint32_t rHash=sFNVOffsetBasis;
for(size_t i=0;i<pSize;++i){
rHash=(rHash^pBytes[i])*sFNVPrime;
}
return rHash;
}
NB: I've edited it to skip the first bit when adding in the length. Obviously the bottom bit of an odd length is 100% biased to 1. I don't know how length is distributed. It may be wiser to put it in at the start than the end.

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

openCL reduction, and passing 2d array

Here is the loop I want to convert to openCL.
for(n=0; n < LargeNumber; ++n) {
for (n2=0; n2< SmallNumber; ++n2) {
A[n]+=B[n2][n];
}
Re+=A[n];
}
And here is what I have so far, although, I know it is not correct and missing some things.
__kernel void openCL_Kernel( __global int *A,
__global int **B,
__global int *C,
__global _int64 Re,
int D)
{
int i=get_global_id(0);
int ii=get_global_id(1);
A[i]+=B[ii][i];
//barrier(..); ?
Re+=A[i];
}
I'm a complete beginner to this type of thing. First of all I know that I can't pass a global double pointer to an openCL kernel. If you can, wait a few days or so before posting the solution, I want to figure this out for myself, but if you can help point me in the right direction I would be grateful.
Concerning your problem with passing doublepointers: That kind of problem is typically solved by copying the whole matrix (or whatever you are working on) into one continous block of memory and, if the blocks have different lengths passing another array, which contains the offsets for the individual rows ( so your access would look something like B[index[ii]+i]).
Now for your reduction down to Re: since you didn't mention what kind of device you are working on I'm going to assume its GPU. In that case I would avoid doing the reduction in the same kernel, since its going to be slow as hell the way you posted it (you would have to serialize the access to Re over thousands of threads (and the access to A[i] too).
Instead I would write want kernel, which sums all B[*][i] into A[i] and put the reduction from A into Re in another kernel and do it in several steps, that is you use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result (I'm making this description intentionally vague, since you said you wanted to figure thinks out yourself).
As a sidenote: You realize that the original code doesn't exactly have a nice memory access pattern? Assuming B is relatively large (and much larger then A due to the second dimension) having the inner loop iterate over the outer index is going to create a lot of cachemisses. This is even worse when porting to the gpu, which is very sensitive about coherent memory access
So reordering it like this may massively increase performance:
for (n2=0; n2< SmallNumber; ++n2)
for(n=0; n < LargeNumber; ++n)
A[n]+=B[n2][n];
for(n=0; n < LargeNumber; ++n)
Re+=A[n];
This is particulary true if you have a compiler which is good at autovectorization, since it might be able to vectorize that construct, but it's very unlikely to be able to do so for the original code (and if it can't prove that A and B[n2] can't refer to the same memory it can't turn the original code into this).

Resources