OSX AudioUnit SMP - macos

I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA

This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...

Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.

As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.

Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}

Related

Orthogonal Recursive Bisection in Chapel (Barnes-Hut algorithm)

I'm implementing a distributed version of the Barnes-Hut n-body simulation in Chapel. I've already implemented the sequential and shared memory versions which are available on my GitHub.
I'm following the algorithm outlined here (Chapter 7):
Perform orthogonal recursive bisection and distribute bodies so that each process has equal amount of work
Construct locally essential tree on each process
Compute forces and advance bodies
I have a pretty good idea on how to implement the algorithm in C/MPI using MPI_Allreduce for bisection and simple message passing for communication between processes (for body transfer). And also MPI_Comm_split is a very handy function that allows me to split the processes at each step of ORB.
I'm having some trouble performing ORB using parallel/distributed constructs that Chapel provides. I would need some way to sum (reduce) work across processes (locales in Chapel), splitting processes into groups and process-to-process communication to transfer bodies.
I would be grateful for any advice on how to implement this in Chapel. If another approach would be better for Chapel that would also be great.
After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel
I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.
In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.
I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:
var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);
var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;
for i in 0..#numLocales-1 {
var intendedReceiver = (rank + i + 1) % numLocales;
var partner = ((rank - (i + 1)) + numLocales) % numLocales;
// Wait until the previous value is read
if (arr$[rank].isFull) {
arr$[rank];
}
// Store my value
arr[rank] = valueIWantToSend;
arr$[rank] = intendedReceiver;
// Am I the intended receiver?
while (arr$[partner].readFF() != rank) {}
// Read partner value
var partnerValue = arr[partner];
// Do something with partner value
arr$[partner]; // empty
// Reset write, which blocks until arr$[rank] is empty
arr$[rank] = -1;
}
This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern").
Overview:
First each locale calculates its intended receiver and its partner (where the locale will read a value from)
Locale checks if the previous value was read
Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver
Locale waits while its partner sets the value and sets the appropriate intended receiver
Once the locale is the intended receiver it reads the partner value and does some operation on it
Then locale empties/unlocks the value by reading arr$[partner]
Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver
I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.
That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.

runif function in Cuda

I am trying to implement a Metropolis-Hastings algorithm in Cuda. For this algorithm, I need to be able to generate many uniform random numbers with varying range. Therefore, I would like to have a function called runif(min, max) that returns a uniformly distributed number in this range. This function has to be called multiple times inside another function that actually implements the algorithm.
Based on this post, I tried to put the code shown there into a function (see below). If I understood this correctly, the same state leads to the same sequences of numbers. So, if the state doesn't change, I always get the same output. One alternative would be to generate a new state inside the runif function so that each time the function is called, it is called with another state. As I've heard though, this is not advisable since the function gets slow.
So, what would be the best implementation of such a function? Should I generate a new state inside the function or generate a new one outside each time I call the function? Or is there yet another approach?
__device__ float runif(float a, float b, curandState state)
{
float myrandf = curand_uniform_double(&state);
myrandf *= (b - a + 0.999999);
myrandf += a;
return myrandf;
}
How it works
curand_uniform* family of functions accepts a pointer to curandState object, uses it somehow and modifies it, so when next time curand_uniform*function will be called with the same state object you could have desired randomness.
Important thing here is:
In order to get meaningful results you need to write curandState changes back.
Wrong way 1
For now you are passing curandState by value, so state changes are lost after kernel returns. Not even mentioning unnecessary waste of time on copying.
Wrong way 2
Creating and initializing of a new local state inside kernel will not only kill performance (and defeat any use of CUDA) but will give you wrong distribution.
Right way
In the sample code you've linked, curandState is passed by pointer, that guarantees that modifications are saved (somewhere where this pointer points to).
Usually, you would want to allocate and initialize an array of random states once in your program (before launching any kernels that require RNG). Then, in order to generate some numbers, you access this array from kernels, with indices based on thread ids. Multiple (many) states are required to avoid data race (at least one state per concurrently running curand_uniform* function).
This way you avoid performance costs of copies and state initialization and get your perfect distribution.
See cuRand documentation for mode information and sample code.

DCF77 decoder vs. noisy signal

I have almost completed my open source DCF77 decoder project. It all started out when I noticed that the standard (Arduino) DCF77 libraries perform very poorly on noisy signals. Especially I was never able to get the time out of the decoders when the antenna was close to the computer or when my washing machine was running.
My first approach was to add a (digital) exponential filter + trigger to the incoming signal.
Although this improved the situation significantly, it was still not really good. Then I started to read some standard books on digital signal processing and especially the original works of Claude Elwood Shannon. My conclusion was that the proper approach would be to not "decode" the signal at all because it is (except for leap seconds) completely known a priori. Instead it would be more appropriate to match the received data to a locally synthesized signal and just determine the proper phase. This in turn would reduce the effective bandwidth by some orders of magnitude and thus reduce the noise significantly.
Phase detection implies the need for fast convolution. The standard approach for efficient convolution is of course the fast Fourier transform. However I am implementing for the Arduino / Atmega 328. Thus I have only 2k RAM. So instead of the straightforward approach with FFT, I started stacking matched phase locked loop filters. I documented the different project stages here:
First try: exponential filter
Start of the better apprach: phase lock to the signal / seconds ticks
Phase lock to the minutes
Decoding minute and hour data
Decoding the whole signal
Adding a local clock to deal with signal loss
Using local synthesized signal for faster lock reacquisition after signal loss
I searched the internet quite extensively and found no similar approach. Still I wonder if there are similar (and maybe better) implementations. Or if there exist research on this kind of signal reconstruction.
What I am not searching for: designing optimized codes for getting close to the Shannon limit. I am also not searching for information on the superimposed PRNG code on DCF77. I also do not need hints on "matched filters" as my current implementation is an approximation of a matched filter. Specific hints on Viterbi Decoders or Trellis approaches are not what I am searching for - unless they address the issue of tight CPU and RAM constraints.
What I am searching for: are there any descriptions / implementations of other non-trivial algorithms for decoding signals like DCF77, with limited CPU and RAM in the presence of significant noise? Maybe in some books or papers from the pre internet era?
Have you considered using a chip matched filter to perform your convolution?
http://en.wikipedia.org/wiki/Matched_filter
They are almost trivially easy to implement, as each chip / bit period can be implemented as a n add subtract delay line ( use a circular buffer )
A simple one for a square wave (will also work, but less optimal with other waveforms) of unknown sequence (but known frequency) can be implemented something like this:
// Filter class
template <int samples_per_bit>
class matchedFilter(
public:
// constructor
matchedFilter() : acc(0) {};
// destructor
~matchedFilter() {};
int filterInput(int next_sample){
int temp;
temp = sample_buffer.insert(nextSample);
temp -= next_sample;
temp -= result_buffer.insert(temp);
return temp;
};
private:
int acc;
CircularBuffer<samples_per_bit> sample_buffer;
CircularBuffer<samples_per_bit> result_buffer;
);
// Circular buffer
template <int length>
class CircularBuffer(
public:
// constructor
CircularBuffer() : element(0) {
buffer.fill(0);
};
// destructor
~CircularBuffer(){};
int insert(int new_element){
int temp;
temp = array[element_pos];
array[element_pos] = new_element;
element_pos += 1;
if (element_pos == length){
element_pos = 0;
};
return temp;
}
private:
std::array<int, length> buffer;
int element_pos;
);
As you can see, resource wise, this is relatively trivial. It there is a specific waveform you're after, you can cascade these together to give a longer correlation.
The reference to matched filters by Ollie B. is not what I was asking for. I already covered this before in my blog.
However by now I received a very good hint by private mail. There exists a paper "Performance Analysis and
Receiver Architectures of DCF77 Radio-Controlled Clocks" by Daniel Engeler. This is the kind of stuff I am searching for.
With further searches starting from the Engeler paper I found the following German patents DE3733966A1 - Anordnung zum Empfang stark gestoerter Signale des Senders dcf-77 and DE4219417C2 - Schmalbandempfänger für Datensignale.

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

openCL reduction, and passing 2d array

Here is the loop I want to convert to openCL.
for(n=0; n < LargeNumber; ++n) {
for (n2=0; n2< SmallNumber; ++n2) {
A[n]+=B[n2][n];
}
Re+=A[n];
}
And here is what I have so far, although, I know it is not correct and missing some things.
__kernel void openCL_Kernel( __global int *A,
__global int **B,
__global int *C,
__global _int64 Re,
int D)
{
int i=get_global_id(0);
int ii=get_global_id(1);
A[i]+=B[ii][i];
//barrier(..); ?
Re+=A[i];
}
I'm a complete beginner to this type of thing. First of all I know that I can't pass a global double pointer to an openCL kernel. If you can, wait a few days or so before posting the solution, I want to figure this out for myself, but if you can help point me in the right direction I would be grateful.
Concerning your problem with passing doublepointers: That kind of problem is typically solved by copying the whole matrix (or whatever you are working on) into one continous block of memory and, if the blocks have different lengths passing another array, which contains the offsets for the individual rows ( so your access would look something like B[index[ii]+i]).
Now for your reduction down to Re: since you didn't mention what kind of device you are working on I'm going to assume its GPU. In that case I would avoid doing the reduction in the same kernel, since its going to be slow as hell the way you posted it (you would have to serialize the access to Re over thousands of threads (and the access to A[i] too).
Instead I would write want kernel, which sums all B[*][i] into A[i] and put the reduction from A into Re in another kernel and do it in several steps, that is you use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result (I'm making this description intentionally vague, since you said you wanted to figure thinks out yourself).
As a sidenote: You realize that the original code doesn't exactly have a nice memory access pattern? Assuming B is relatively large (and much larger then A due to the second dimension) having the inner loop iterate over the outer index is going to create a lot of cachemisses. This is even worse when porting to the gpu, which is very sensitive about coherent memory access
So reordering it like this may massively increase performance:
for (n2=0; n2< SmallNumber; ++n2)
for(n=0; n < LargeNumber; ++n)
A[n]+=B[n2][n];
for(n=0; n < LargeNumber; ++n)
Re+=A[n];
This is particulary true if you have a compiler which is good at autovectorization, since it might be able to vectorize that construct, but it's very unlikely to be able to do so for the original code (and if it can't prove that A and B[n2] can't refer to the same memory it can't turn the original code into this).

Resources