Are Unbound's block and device radix sorts supported now? - aleagpu

Sort is not trivial to implement and I can't find the module in either documentation or the autocomplete. Is it not supported yet?

Extending Alea GPU with sorting primitives is still pending but in the pipeline. In the mean time you would need to implement your own sorting kernels.

Related

thrust: sorting within a threadblock

I am dispatching a kernel with around 5k blocks. At some point, we need to sort an array within each threadblock. If possible we would like to use a library like thrust.
From the documentation I understand that how sort is executed in thrust depends on the specified execution_policy. However I don't understand if I can use execution_policies to specify that I would like to use the threads of my current block for sorting. Can someone explain or hint me towards a good documentation of execution policies and tell me if what I intend to do is feasible?
Turns out that execution policies are basically a bridge design pattern that uses template specialization instead of inheritance to select the appropriate implementation of an algorithm while exposing a stable interface towards the user of the library and avoiding the overhead/necessity of virtual functions. Thank you robert-crovella for the great video.
As for the actual implementation of sorting within a threadblock in thrust, talonmies is right. There simply is no implementation (currently?), I could not find anything in the source code.

Exploitation of GPU using Halide

I'm implementing an algorithm using Halide while comparing hand-tuned(using CUDA) version of same algorithm.
Acceleration of the Halide implementation mostly went well, but still slower a bit than hand-tuned version. So I tried to see exact execution time of each Func using nvvp(nvidia visual profiler). By doing that, I figured out that hand-tuned implementation overlaps multiple function's(they're similar) execution which is implemented as a Func in Halide implemetation. Cuda's Stream technology is used to do it.
I would like to know whether I can do similar exploitation of GPU in Halide or not.
I appreciate for reading.
Currently the runtime has no support for CUDA streams. It might be possible to replace the runtime with something that can do this, but there is no extra information passed in to control the concurrency. (The runtime is somewhat designed to be replaceable, but there is a bit of a notion of a single queue and full dependency information is not passed down. It may be possible to reconstruct the dependencies from the inputs and outputs, but that starts to be a lot of work to solve a problem the compiler should be solving itself.)
We're talking about how to express such control in the schedule. One possibility is to use the support being prototyped in the async branch to do this, but we haven't totally figured out how to apply this to GPUs. (The basic idea is scheduling a Func async on a GPU would put it on a different stream. We'd need to use GPU synchronization APIs to handle producer/consumer dependencies.) Ultimately this is something we are interested in exploiting, but work needs to be done.

Is array.push() O(1) in Haxe?

Many language has a standard type which can resize itself when needed, like C++'s vector<T> or C#'s ArrayList<T>. In Haxe, however, I don't see such a data structure.
Does Array in Haxe work this way? Can it add/remove the last element in (amortized) O(1)?
Technically this of course depends on the platform-specific implementation of Array, but it is safe to assume that push has amortized O(1) as that's pretty straight forward to accomplish (the neko implementation shows that rather nicely).
On all platforms that come with a dynamically sized Array with support for sparseness Haxe uses those for implementation (AFAIK that's flash, js and php), but I guess if those ever showed poor metrics, they would be re-implemented.
I would note that there is also List, if random access is not important. But on some platforms it is never faster than Array, only smaller.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

scala sort indexedseq in place

How do you sort an IndexedSeq in place in scala? Where's the API function?
Currently there is nothing to sort them in-place.
If you really need that it would be possible to convert the IndexedSeq to an Array[AnyRef] and use Arrays.sort from Java (you have to cast to Array[AnyRef], because Scala's arrays are not covariant like Java's).
Interestingly a few weeks ago there was a discussion about adding in-place versions of operation like map, filter and sort to Scala's mutable collections.
I hope after the 2.9 release of parallel collections this could be the next work item on the list to further improve Scala's collection.
It doesn't hurt if people would raise their voice in support of it (or supply a working implememtation) :-).

Resources