Is there any parallel algorithm that can be implemented in OpenACC to find the median of a vector?
Finding the median requires having a sorted representation of the data.
OpenACC doesn't have a good way to achieve this. However, you could use Thrust to sort the data on the GPU and then continue working using OpenACC. It is likely that, with some fiddling, you can get this to work using device pointers with OpenACC without moving memory between the CPU and GPU.
Related
I need a bit of advice from you, and I hope it won't take a lot of your time.
So here is my question:
I have a small square dense matrix, with possible sizes 4x4, 8x8, 16x16,
and I want to inverse it using CUDA.
The special part of the question is that I have 1024 idle cuda threads to perform this task.
So I have a suspicion that the most widespread inverse methods like Gauss Jordan won't properly work here, because they are slightly parallel and will use only about 4-16 threads from huge amount of 1024.
But how else can I inverse this matrices using all available threads?
Thank you for your attention!
There are at least two possible ready made options for this sort of problem:
Use the batched solvers shipping in recent versions of the CUBLAS library
Use the BSD licensed Gauss-Jordan elimination device code functions which NVIDIA distribute to registered developers. These were intended to invert small matrices using one thread per matrix
[This answer was assembled from comments and added as a community wiki entry to get the question off the unanswered queue]
I have been wondering about when to use Parallel prefix sum instead of using sequential buildup. The algorithm I am using constructs parallel sums but I read somewhere that for small number of elements (typically less than 100 elements), its better to go for sequential algorithm. This brings the question of whether there is a certain threshold above which parallel implementation might yield some gain over sequential? I am using opencl for coding and have implemented parallel prefix sum using Blelloch 1990 implementation.
It depends, as usual. On the implementation, the device, and the size of the data.
GPU Gems 3, chapter 39 has some pretty graphs that show when their specific implementations have thresholds. They didn't implement the algorithm naively of course - it's an optimized version using shared memory, unrolled loops, and cache bank conflict-avoidance.
Once you have an implementation, you'll just have to benchmark it to find the threshold.
I want to implement Disjoint set Data Structures and Kruskal's algorithm in OpenCL. I have implemented some codes in OpenCL, but don't know how to get started with Data Structures in OpenCL. Djkstra's algorithm given in the book by Aftab Munshi is hard to understand. Can anyone suggest another source...?
I suggest you start with simple C version of the algorithm like:
http://prabhakargouda.hubpages.com/hub/Kruskal-algorithm-implementation-in-C
Assess what can be done in parallel. In the above code, there are are several nested for loops that are candidates to be done in parallel. The adjacency matrix, is good parallel structure compared with pointers in a tree. So try and leverage that.
Remember that not all the phases of algorithm can be done in parallel. So start with the inner most for loops and take the implementation in stages.
Also, note there is no copyright or license associated with the above code. So careful how you use it.
Remember to give copyright where its due.
I was using the built-in sort function of Matlab:
[temp, Idx] = sort(M,2);
I would like to have the sorted index of each row of M, which is a matrix of size > 50k.
I searched hard but did not find anything.. It would be greatly appreciated if you have any comments!
To get a sense of how much room for improvement you have, I would suggest writing a test program in C and use qsort or in C++ and user sort and carefully time it on 7000 inputs of size 7000 (or whatever setup you have in Matlab).
I'm going to give you my estimate: probably Matlab's sort runs (on properly vectorized code, like yours) as fast as C++, and you're just seeing the effect of running an algorithm that takes O(n^2 log n). It is reported in Matlab's marketing material that its sort function was faster than C's qsort, but take it with a grain of salt.
The best way to speed up that sort is to get a faster computer. It will speed everything else up too. :)
The fact is, you can rarely speed up a single call to something like a sort. MATLAB is already doing that in an efficient manner, using an optimized code internally. (Reread the carlosdc answer.) The things you can sometimes get a boost on are tools that are written in MATLAB itself.
So, what can you do? Short of buying that new computer, you can look at your overall code. One single sort of that size is never that big of a problem. But the reason for doing that sort over and over again is. Think carefully about the code, about whether you can change the flow or avoid a many times repeated sort. Algorithm change is often a FAR bigger source of improvement than the wee bit you would ever get even if you could improve that sort.
Sorting is fundamentally O(n log n).
As long as you have a reasonably efficient implementation, this is unlikely to change much.
That said, as Andrew Janke's comment suggests, multi-threading can improve things dramatically.
GPU programming can be a way to get massive speedups. If you have R2010b or later, you may be able to use accelerated versions of built-in functions like sort from Mathworks.
Otherwise, write a mex wrapper around the CUDA Thrust library which includes a sort.
You could write your own sort function in C/C++ as MEX. MATLAB documentation has examples for it.
There exist many sort algorithms which are better then other in edge cases, for example almost sorted data or stability (which does not matter in MATLAB because all its types are value types).
Is your data numeric or strings? For strings there are probably special algorithms for ASCII sort, sometimes natural sort is preferable.
For example I have array of (x,y) points and I want to organize them in kd-tree
Building kd-tree includes sorting and computing bounding boxes. These algorithms work fine on CUDA, but is there any way to build kd-tree utilizing as many threads as possible?
I think there should be some tricks:
Usually, kd-tree is implemented with recursion, but as far as I know, CUDA processors don't have hardware stack, so recursion should be avoided.
How can I build kd-tree in Cuda effectively?
You might want to have a look at the following papers:
Stackless KD-Tree Traversal for High Performance GPU Ray Tracing
Real-Time KD-Tree Construction on Graphics Hardware
They might help you along. Google them and you'll find them available online.