Ray Tracer Uniform Grid Traversal in OpenCL - raytracing

I'm attempting to create a Real Time Ray Tracer using OpenCL however, I'm very new to OpenCL and how to use it.
As part of accelerating the ray tracing process, I have implemented a Uniform Grid which can be traversed via a form of the DDA algorithm seen here: http://www.cse.yorku.ca/~amana/research/grid.pdf
My problem is, although I can traverse the grid and grab the cells to check for objects, I am unable to actually access those objects (it sends the kernel build in to an endless loop). They exist in the array but I believe it is due to the way the GPU inlines the commands that it can't resolve the program.
The following is the piece of code that is having trouble, where 'grid' is a 4D array represented as a 1D (fourth dimension for holding each object in a cell), toAccess is the list of elements containing object IDs that were encountered during grid traversal, caughtObjs is the size of toAccess (amount of objects encountered) and objList is the list of object IDs I'm trying to get.
for(int i = 0; i < caughtObjs; i++){
objList[i] = grid[toAccess[i]]; //Commenting out this line stops the build crash
}
So I'm assuming using an array to reference elements in another array is bad in OpenCL due to its inlining. So what is a common alternative used to combat this problem?
(Also, I'm aware you're meant to check one cell at a time but I'm not doing this because the depth of the loops required was killing OpenCL's build too.)

Related

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

Orthogonal Recursive Bisection in Chapel (Barnes-Hut algorithm)

I'm implementing a distributed version of the Barnes-Hut n-body simulation in Chapel. I've already implemented the sequential and shared memory versions which are available on my GitHub.
I'm following the algorithm outlined here (Chapter 7):
Perform orthogonal recursive bisection and distribute bodies so that each process has equal amount of work
Construct locally essential tree on each process
Compute forces and advance bodies
I have a pretty good idea on how to implement the algorithm in C/MPI using MPI_Allreduce for bisection and simple message passing for communication between processes (for body transfer). And also MPI_Comm_split is a very handy function that allows me to split the processes at each step of ORB.
I'm having some trouble performing ORB using parallel/distributed constructs that Chapel provides. I would need some way to sum (reduce) work across processes (locales in Chapel), splitting processes into groups and process-to-process communication to transfer bodies.
I would be grateful for any advice on how to implement this in Chapel. If another approach would be better for Chapel that would also be great.
After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel
I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.
In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.
I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:
var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);
var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;
for i in 0..#numLocales-1 {
var intendedReceiver = (rank + i + 1) % numLocales;
var partner = ((rank - (i + 1)) + numLocales) % numLocales;
// Wait until the previous value is read
if (arr$[rank].isFull) {
arr$[rank];
}
// Store my value
arr[rank] = valueIWantToSend;
arr$[rank] = intendedReceiver;
// Am I the intended receiver?
while (arr$[partner].readFF() != rank) {}
// Read partner value
var partnerValue = arr[partner];
// Do something with partner value
arr$[partner]; // empty
// Reset write, which blocks until arr$[rank] is empty
arr$[rank] = -1;
}
This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern").
Overview:
First each locale calculates its intended receiver and its partner (where the locale will read a value from)
Locale checks if the previous value was read
Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver
Locale waits while its partner sets the value and sets the appropriate intended receiver
Once the locale is the intended receiver it reads the partner value and does some operation on it
Then locale empties/unlocks the value by reading arr$[partner]
Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver
I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.
That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.

runif function in Cuda

I am trying to implement a Metropolis-Hastings algorithm in Cuda. For this algorithm, I need to be able to generate many uniform random numbers with varying range. Therefore, I would like to have a function called runif(min, max) that returns a uniformly distributed number in this range. This function has to be called multiple times inside another function that actually implements the algorithm.
Based on this post, I tried to put the code shown there into a function (see below). If I understood this correctly, the same state leads to the same sequences of numbers. So, if the state doesn't change, I always get the same output. One alternative would be to generate a new state inside the runif function so that each time the function is called, it is called with another state. As I've heard though, this is not advisable since the function gets slow.
So, what would be the best implementation of such a function? Should I generate a new state inside the function or generate a new one outside each time I call the function? Or is there yet another approach?
__device__ float runif(float a, float b, curandState state)
{
float myrandf = curand_uniform_double(&state);
myrandf *= (b - a + 0.999999);
myrandf += a;
return myrandf;
}
How it works
curand_uniform* family of functions accepts a pointer to curandState object, uses it somehow and modifies it, so when next time curand_uniform*function will be called with the same state object you could have desired randomness.
Important thing here is:
In order to get meaningful results you need to write curandState changes back.
Wrong way 1
For now you are passing curandState by value, so state changes are lost after kernel returns. Not even mentioning unnecessary waste of time on copying.
Wrong way 2
Creating and initializing of a new local state inside kernel will not only kill performance (and defeat any use of CUDA) but will give you wrong distribution.
Right way
In the sample code you've linked, curandState is passed by pointer, that guarantees that modifications are saved (somewhere where this pointer points to).
Usually, you would want to allocate and initialize an array of random states once in your program (before launching any kernels that require RNG). Then, in order to generate some numbers, you access this array from kernels, with indices based on thread ids. Multiple (many) states are required to avoid data race (at least one state per concurrently running curand_uniform* function).
This way you avoid performance costs of copies and state initialization and get your perfect distribution.
See cuRand documentation for mode information and sample code.

Dynamically Allocated Jagged Arrays with Smart Pointers

So I've recently become familiar with (and fell in love with) boost and c++11 smart pointers. It makes memory management SO much easier. And, on top of all that, they can usually still work with legacy code (through the use of the get call)
However, the big hole I keep running into is multidimensional jagged arrays. The correct way to do it is to have a boost::scoped_array<boost::scoped_array<double>> or vector<vector<double>>, which will clean up nicely. However, you cannot get a double** out of this easily to send to legacy code.
Is there any way to do this, or am I stuck with non-smart jagged arrays?
I'd start with std::vector<std::vector<double>> for storage, unless the structure was highly static.
To produce my array-of-arrays, I'd produce a std::vector<double*> via transformation of my above storage, using syntax like transform_to_vector( storage, []( std::vector<double>& v ) { return v.data(); } ) (transform_to_vector left as an exercise to the reader).
Keeping the two in sync would be a simple matter of wrapping it in a small class.
If the jagged array is relatively fixed in size, I'd take a std::vector<std::size_t> to create my buffer (or maybe a std::initializer_list<std::size_t> -- actually, a template<typename Container>, and I'd just for( : ) over it twice, and let the caller pick what container it provided me), then create a single std::vector<double> with the sum of the sizes, then build a std::vector<double*> at the dictated offsets.
Resizing this gets expensive, which is a disadvantage.
A nice property of using std::vector is that newer APIs have full access to the pretty begin and end values. If you have a single large buffer, you can expose a range view of the sub arrays to new code (a structure containing a double* begin() and double* end(), and while we are at it a double& operator[] and std::size_t size() const { return end()-begin(); }), so they can bask in the glory of full on C++ container-style views while keeping C compatibility for legacy interfaces.
If you're working in C++11, you should probably work with unique_ptr<T[]> rather than scoped_array<T>. It can do everything that scoped_array can, and then some.
If you want a rectangular array, I recommend using a unique_ptr<double[]> to hold the main data and a unique_ptr<double*[]> to hold the row bases. This would work something like this:
unique_ptr<double[]> data{ new double[5*3] };
unique_ptr<double*[]> rows{ new double*[3] };
rows[0] = data.get();
for ( size_t i = 1; i!=5; ++i )
rows[i] = rows[i-1]+3;
Then you can pass rows.get() to a function taking double**. This approach can work for a non-rectangular array as well, provided the geometry of the array is known at array creation time so that you can allocate all the data at once and point rows to the proper offsets. (It may not be as straightforward as a simple loop, though.)
This will also give you better locality of reference and memory usage, since you only perform two allocations. All of your data will be stored together in memory and there won't be additional overhead for the separate allocations.
If you want to change the geometry of the jagged array after creating it, you will need to come up with a principled way of managing the storage for this solution to be applicable. However, since changing the geometry using scoped_array is awkward (requiring specific uses of swap()), I wouldn't be surprised if this isn't an issue for you.
(Note that this approach can work with scoped_array as well as unique_ptr<[]>; I'm simply illustrating it using unique_ptr since we're in C++11 now.)

vector --> concurrent_vector migration + OpenGL restriction

I need to speed-up some calculation and result of calculation then used to draw OpenGL model.
Major speed-up archived when I changed std::vector to Concurrency::concurrent_vector and used parallel_for instead of just for loops.
This vector (or concurrent_vector) calculated in for (or parallel_for) loop and contains vertices for OpenGL to visualize.
It is fine using std::vector because OpenGL rendering procedure relies on the fact that std::vector keeps it's items in sequence which is not a case with concurrent_vector. Code runs something like this:
glVertexPointer(3, GL_FLOAT, 0, &vectorWithVerticesData[0]);
To generate concurrent_vector and copy it to std::vector is too expensive since there are lot of items.
So, the question is: I'd like to use OpenGL arrays, but also like to use concurrent_vector which is incompatible with OpenGL output.
Any suggestions?
You're trying to use a data structure that doesn't store its elements contiguously in an API that requires contiguous storage. Well, one of those has to give, and it's not going to be OpenGL. GL isn't going to walk concurrent_vector's data structure (not if you like performance).
So your option is to not use non-sequential objects.
I can only guess at what you're doing (since you didn't provide example code for the generator), so that limits what I can advise. If your parallel_for iterates for a fixed number of times (by "fixed", I mean a value that is known immediately before parallel_for executes. It doesn't change based on how many times you've iterated), then you can just use a regular vector.
Simply size the vector with vector::size. This will value-initialize the elements, which means that every element exists. You can now perform your parallel_for loop, but instead of using push_back or whatever, you simply copy the element directly into its location in the output. I think parallel_for can iterate over the actual vector iterators, but I'm not positive. Either way, it doesn't matter; you won't get any race conditions unless you try to set the same element from different threads.

Resources