Orthogonal Recursive Bisection in Chapel (Barnes-Hut algorithm) - parallel-processing

I'm implementing a distributed version of the Barnes-Hut n-body simulation in Chapel. I've already implemented the sequential and shared memory versions which are available on my GitHub.
I'm following the algorithm outlined here (Chapter 7):
Perform orthogonal recursive bisection and distribute bodies so that each process has equal amount of work
Construct locally essential tree on each process
Compute forces and advance bodies
I have a pretty good idea on how to implement the algorithm in C/MPI using MPI_Allreduce for bisection and simple message passing for communication between processes (for body transfer). And also MPI_Comm_split is a very handy function that allows me to split the processes at each step of ORB.
I'm having some trouble performing ORB using parallel/distributed constructs that Chapel provides. I would need some way to sum (reduce) work across processes (locales in Chapel), splitting processes into groups and process-to-process communication to transfer bodies.
I would be grateful for any advice on how to implement this in Chapel. If another approach would be better for Chapel that would also be great.

After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel
I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.
In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.
I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:
var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);
var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;
for i in 0..#numLocales-1 {
var intendedReceiver = (rank + i + 1) % numLocales;
var partner = ((rank - (i + 1)) + numLocales) % numLocales;
// Wait until the previous value is read
if (arr$[rank].isFull) {
arr$[rank];
}
// Store my value
arr[rank] = valueIWantToSend;
arr$[rank] = intendedReceiver;
// Am I the intended receiver?
while (arr$[partner].readFF() != rank) {}
// Read partner value
var partnerValue = arr[partner];
// Do something with partner value
arr$[partner]; // empty
// Reset write, which blocks until arr$[rank] is empty
arr$[rank] = -1;
}
This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern").
Overview:
First each locale calculates its intended receiver and its partner (where the locale will read a value from)
Locale checks if the previous value was read
Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver
Locale waits while its partner sets the value and sets the appropriate intended receiver
Once the locale is the intended receiver it reads the partner value and does some operation on it
Then locale empties/unlocks the value by reading arr$[partner]
Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver
I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.
That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.

Related

runif function in Cuda

I am trying to implement a Metropolis-Hastings algorithm in Cuda. For this algorithm, I need to be able to generate many uniform random numbers with varying range. Therefore, I would like to have a function called runif(min, max) that returns a uniformly distributed number in this range. This function has to be called multiple times inside another function that actually implements the algorithm.
Based on this post, I tried to put the code shown there into a function (see below). If I understood this correctly, the same state leads to the same sequences of numbers. So, if the state doesn't change, I always get the same output. One alternative would be to generate a new state inside the runif function so that each time the function is called, it is called with another state. As I've heard though, this is not advisable since the function gets slow.
So, what would be the best implementation of such a function? Should I generate a new state inside the function or generate a new one outside each time I call the function? Or is there yet another approach?
__device__ float runif(float a, float b, curandState state)
{
float myrandf = curand_uniform_double(&state);
myrandf *= (b - a + 0.999999);
myrandf += a;
return myrandf;
}
How it works
curand_uniform* family of functions accepts a pointer to curandState object, uses it somehow and modifies it, so when next time curand_uniform*function will be called with the same state object you could have desired randomness.
Important thing here is:
In order to get meaningful results you need to write curandState changes back.
Wrong way 1
For now you are passing curandState by value, so state changes are lost after kernel returns. Not even mentioning unnecessary waste of time on copying.
Wrong way 2
Creating and initializing of a new local state inside kernel will not only kill performance (and defeat any use of CUDA) but will give you wrong distribution.
Right way
In the sample code you've linked, curandState is passed by pointer, that guarantees that modifications are saved (somewhere where this pointer points to).
Usually, you would want to allocate and initialize an array of random states once in your program (before launching any kernels that require RNG). Then, in order to generate some numbers, you access this array from kernels, with indices based on thread ids. Multiple (many) states are required to avoid data race (at least one state per concurrently running curand_uniform* function).
This way you avoid performance costs of copies and state initialization and get your perfect distribution.
See cuRand documentation for mode information and sample code.

OSX AudioUnit SMP

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA
This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...
Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.
As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.
Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}

How is a prefix sum a bulk-synchronous algorithmic primitive?

Concerning NVIDIA GPU the author in High Performance and Scalable GPU Graph Traversal paper says:
1-A sequence of kernel invocations is bulk- synchronous: each kernel is
initially presented with a consistent view of the results from the
previous.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I am unable to understand these two points (I know GPU based prefix sum though), Can smeone help me this concept.
1-A sequence of kernel invocations is bulk- synchronous: each kernel is initially presented with a consistent view of the results from the previous.
It's about parallel computation model: each processor has its own memory which is fast (like cache in CPU) and is performing computation using values stored there without any synchronization. Then non-blocking synchronization takes place - processor puts data it has computed so far and gets data from its neighbours. Then there's another synchronization - barrier, which makes all of them wait for each other.
2-Prefix-sum is a bulk-synchronous algorithmic primitive
I believe that's about the second step of BSP model - synchronization. That's the way processors store and get data for the next step.
Name of the model implies that it is highly concurrent (many many processes that work synchronously relatively to each other). And this is how we get to the second point.
As far as we want to live up to the name (be highly concurrent) we want get rid of sequential parts where it is possible. We can achieve that with prefix-sum.
Consider prefix-sum associative operator +. Then scan on set [5 2 0 3 1] returns the set [0 5 7 7 10 11]. So, now we can replace such sequential pseudocode:
foreach i = 1...n
foo[i] = foo[i-1] + bar(i);
with this pseudocode, which now can be parallel(!):
foreach(i)
baz[i] = bar(i);
scan(foo, baz);
That is very much naive version, but it's okay for explanation.

Fastest data structure with default values for undefined indexes?

I'm trying to create a 2d array where, when I access an index, will return the value. However, if an undefined index is accessed, it calls a callback and fills the index with that value, and then returns the value.
The array will have negative indexes, too, but I can overcome that by using 4 arrays (one for each quadrant around 0,0).
You can create a Matrix class that relies on tuples and dictionary, with the following behavior :
from collections import namedtuple
2DMatrixEntry = namedtuple("2DMatrixEntry", "x", "y", "value")
matrix = new dict()
defaultValue = 0
# add entry at 0;1
matrix[2DMatrixEntry(0,1)] = 10.0
# get value at 0;1
key = 2DMatrixEntry(0,1)
value = {defaultValue,matrix[key]}[key in matrix]
Cheers
This question is probably too broad for stackoverflow. - There is not a generic "one size fits all" solution for this, and the results depend a lot on the language used (and standard library).
There are several problems in this question. First of all let us consider a 2d array, we say this is simply already part of the language and that such an array grows dynamically on access. If this isn't the case, the question becomes really language dependent.
Now often when allocating memory the language automatically initializes the spots (again language dependent on how this happens and what the best method is, look into RAII). Though I can foresee that actual calculation of the specific cell might be costly (compared to allocation). In that case an interesting thing might be so called "two-phase construction". The array has to be filled with tuples/objects. The default construction of an object sets a bit/boolean to false - indicating that the value is not ready. Then on acces (ie a get() method or a operator() - language dependent) if this bit is false it constructs, else it just reads.
Another method is to use a dictionary/key-value map. Where the key would be the coordinates and the value the value. This has the advantage that the problem of construct-on-access is inherit to the datastructure (though again language dependent). The drawback of using maps however is that lookup speed of a value changes from O(1) to O(logn). (The actual time is widely different depending on the language though).
At last I hope you understand that how to do this depends on more specific requirements, the language you used and other libraries. In the end there is only a single data structure that is in each language: a long sequence of unallocated values. Anything more advanced than that depends on the language.

Reusing memory of immutable state in eager evaluation?

I'm studying purely functional language and currently thinking about some immutable data implementation.
Here is a pseudo code.
List a = [1 .. 10000]
List b = NewListWithoutLastElement a
b
When evaluating b, b must be copied in eager/strict implementation of immutable data.
But in this case, a is not used anymore in any place, so memory of 'a' can be re-used safely to avoid copying cost.
Furthermore, programmer can force compiler always do this by marking the type List with some keyword meaning must-be-disposed-after-using. Which makes compile time error on logic cannot avoid copying cost.
This can gain huge performance. Because it can be applied to huge object graph too.
How do you think? Any implementations?
This would be possible, but severely limited in scope. Keep in mind that the vast majority of complex values in a functional program will be passed to many functions to extract various properties from them - and, most of the time, those functions are themselves arguments to other functions, which means you cannot make any assumptions about them.
For example:
let map2 f g x = f x, g x
let apply f =
let a = [1 .. 10000]
f a
// in another file :
apply (map2 NewListWithoutLastElement NewListWithoutFirstElement)
This is fairly standard in functional code, and there is no way to place a must-be-disposed-after-using attribute on a because no specific location has enough knowledge about the rest of the program. Of course, you could try adding that information to the type system, but type inference on this is decidedly non-trivial (not to mention that types would grow quite large).
Things get even worse when you have compound objects, such as trees, that might share sub-elements between values. Consider this:
let a = binary_tree [ 1; 2; 5; 7; 9 ]
let result_1 = complex_computation_1 (insert a 6)
let result_2 = complex_computation_2 (remove a 5)
In order to allow memory reuse within complex_computation_2, you would need to prove that complex_computation_1 does not alter a, does not store any part of a within result_1 and is done using a by the time complex_computation_2 starts working. While the two first requirements might seem the hardest, keep in mind that this is a pure functional language: the third requirement actually causes a massive performance drop because complex_computation_1 and complex_computation_2 cannot be run on different threads anymore!
In practice, this is not an issue in the vast majority of functional languages, for three reasons:
They have a garbage collector built specifically for this. It is faster for them to just allocate new memory and reclaim the abandoned one, rather than try to reuse existing memory. In the vast majority of cases, this will be fast enough.
They have data structures that already implement data sharing. For instance, NewListWithoutFirstElement already provides full reuse of the memory of the transformed list without any effort. It's fairly common for functional programmers (and any kind of programmers, really) to determine their use of data structures based on performance considerations, and rewriting a "remove last" algorithm as a "remove first" algorithm is kind of easy.
Lazy evaluation already does something equivalent: a lazy list's tail is initially just a closure that can evaluate the tail if you need to—so there's no memory to be reused. On the other hand, this means that reading an element from b in your example would read one element from a, determine if it's the last, and return it without really requiring storage (a cons cell would probably be allocated somewhere in there, but this happens all the time in functional programming languages and short-lived small objects are perfectly fine with the GC).

Resources