OSX CoreAudio: Getting inNumberFrames in advance - on initialization? - macos

I'm experimenting with writing a simplistic single-AU play-through based, (almost)-no-latency tracking phase vocoder prototype in C. It's a standalone program. I want to find how much processing load can a single render callback safely bear, so I prefer keeping off async DSP.
My concept is to have only one pre-determined value which is window step, or hop size or decimation factor (different names for same term used in different literature sources). This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?). All other parameters, such as window size and FFT size would be set in relation to the window step. This seems the simplest method for keeipng everything inside one callback.
Is there a safe method to machine-independently and safely guess or query which could be the inNumberFrames before the actual rendering starts, i.e. before calling AudioOutputUnitStart()?
The phase vocoder algorithm is mostly standard and very simple, using vDSP functions for FFT, plus custom phase integration and I have no problems with it.
Additional debugging info
This code is monitoring timings within the input callback:
static Float64 prev_stime; //prev. sample time
static UInt64 prev_htime; //prev. host time
printf("inBus: %d\tframes: %d\tHtime: %lld\tStime: %7.2lf\n",
(unsigned int)inBusNumber,
(unsigned int)inNumberFrames,
inTimeStamp->mHostTime - prev_htime,
inTimeStamp->mSampleTime - prev_stime);
prev_htime = inTimeStamp->mHostTime;
prev_stime = inTimeStamp->mSampleTime;
Curious enough, the argument inTimeStamp->mSampleTime actually shows the number of rendered frames (name of the argument seems somewhat misguiding). This number is always 512, no matter if another sampling rate has been re-set through AudioMIDISetup.app at runtime, as if the value had been programmatically hard-coded. On one hand, the
inTimeStamp->mHostTime - prev_htime
interval gets dynamically changed depending on the sampling rate set in a mathematically clear way. As long as sampling rate values match multiples of 44100Hz, actual rendering is going on. On the other hand 48kHz multiples produce the rendering error -10863 ( =
kAudioUnitErr_CannotDoInCurrentContext ). I must have missed a very important point.

The number of frames is usually the sample rate times the buffer duration. There is an Audio Unit API to request a sample rate and a preferred buffer duration (such as 44100 and 5.8 mS resulting in 256 frames), but not all hardware on all OS versions honors all requested buffer durations or sample rates.

Assuming audioUnit is an input audio unit:
UInt32 inNumberFrames = 0;
UInt32 propSize = sizeof(UInt32);
AudioUnitGetProperty(audioUnit,
kAudioDevicePropertyBufferFrameSize,
kAudioUnitScope_Global,
0,
&inNumberFrames,
&propSize);

This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?)
It depends on what you attempt to set it to. You can set it.
// attempt to set duration
NSTimeInterval _preferredDuration = ...
NSError* err;
[[AVAudioSession sharedInstance]setPreferredIOBufferDuration:_preferredDuration error:&err];
// now get the actual duration it uses
NSTimeInterval _actualBufferDuration;
_actualBufferDuration = [[AVAudioSession sharedInstance] IOBufferDuration];
It would use a value roughly around the preferred value you set. The actual value used is a time interval based on a power of 2 and the current sample rate.
If you are looking for consistency across devices, choose a value around 10ms. The worst performing reasonable modern device is iOS iPod touch 16gb without the rear facing camera. However, this device can do around 10ms callbacks with no problem. On some devices, you "can" set the duration quite low and get very fast callbacks, but often times it will crackle up because the processing is not finished in the callback before the next callback happens.

Related

Generate few us short delay in GNU Linux

I am trying to generate a short delay between two calls writing HW based registers in GNU C on ARM (Linux).
It looks like the system latency is too high when I am using usleep() or nanosleep() functions.
The following code fragment
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 1; // 1 nano second
//...
do{ } while (nanosleep(&ts, &ts));
results in over 100 us delay (comparing when present or commented out).
What is the way around? Since my desired delay is approximately 2 us I can possibly live even with a blocking function.
As #Lubo hinted I cannot rely on reliable delay generated within my code since that may be interrupted.
The HW register I am writing needs ~ 1us between two consequent writes.
If I want to generate a shortest possible delay at least 2us and won't mind getting longer delay in cases I get interrupted I may still be fine. In total I may acquire less delay compared to the current state when every time I am getting 100us more than intended.

How to properly implement waiting of async computations?

i have some little trouble and i am asking for hint. I am on Windows platform, doing calculations in a following manner:
int input = 0;
int output; // junk bytes here
while(true) {
async_enqueue_upload(input); // completes instantly, but transfer will take 10us
async_enqueue_calculate(); // completes instantly, but computation will take 80us
async_enqueue_download(output); // completes instantly, but transfer will take 10us
sync_wait_finish(); // must wait while output is fully calculated, and there is no junk
input = process(output); // i cannot launch next step without doing it on the host.
}
I am asking about wait_finish() thing. I must wait all devices to finish, to combine all results and somehow process the data and upload a new portion, that is based on a previous computation step. I need to sync data in between each step, so i can't parallelize steps. I know, this is not quite performant case. So lets proceed to question.
I have 2 ways of checking completion, within wait_finish(). First is to put thread to sleep until it wakes up by completion event:
while( !is_completed() )
Sleep(1);
It has very low performance, because actual calculation, to say, takes 100us, and minimal Windows sheduler timestep is 1ms, so it gives unsuitable 10x lower performance.
Second way is to check completion in empty infinite loop:
while( !is_completed() )
{} // do_nothing();
It has 10x good computation performance. But it is also unsuitable solution, because it makes full cpu core utilisation usage, with absolutely useless work. How to make cpu "sleep" exactly time i needed? (Each step has equal amount of work)
How this case is usually solved, when amount of calculation time is too big for active spin-wait, but is too small compared to sheduler timestep? Also related subquestion - how to do that on linux?
Fortunately, i have succeeded in finding answer on my own. In short words - i should use linux for that.
And my investigation shows following. On windows there is hidden function in ntdll, NtDelayExecution(). It is not exposed through SDK, but can be loaded in a following manner:
static int(__stdcall *NtDelayExecution)(BOOL Alertable, PLARGE_INTEGER DelayInterval) = (int(__stdcall*)(BOOL, PLARGE_INTEGER)) GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtDelayExecution");
It allows to set sleep intervals in 100ns periods. However, even that not worked well, as shown in a following benchmark:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); // requires Admin privellegies
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
uint64_t hpf = qpf(); // QueryPerformanceFrequency()
uint64_t s0 = qpc(); // QueryPerformanceCounter()
uint64_t n = 0;
while (1) {
sleep_precise(1); // NtDelayExecution(-1); waits one 100-nanosecond interval
auto s1 = qpc();
n++;
auto passed = s1 - s0;
if (passed >= hpf) {
std::cout << "freq=" << (n * hpf / passed) << " hz\n";
s0 = s1;
n = 0;
}
}
That yields something less than just 2000 hz loop rate, and result varies from string to string. That led me towards windows thread switching sheduler, which is totally not suited for real time tasks. And its minimum interval of 0.5ms (+overhead). Btw, does anyone knows on how to tune that value?
And next was linux question, and what does it can? So i've built custom tiny kernel 4.14 with means of buildroot, and tested that benchmark code there. I replaced qpc() to return clock_gettime() data, with CLOCK_MONOTONIC clock, and qpf() just returns number of nanoseconds in a second and sleep_precise() just called clock_nanosleep(). I was failed to find out what is the difference between CLOCK_MONOTONIC and CLOCK_REALTIME.
And i was quite surprised, getting whooping 18.4khz frequency just out of the box, and that was quite stable. While i tested several intervals, i found that i can set the loop to almost any frequency up to 18.4khz, but also that actual measured wait time results differs to 1.6 times of what i asked. For example if i ask to sleep 100 us it actually sleeps for ~160 us, giving ~6.25 khz frequency. Nothing else is going on the system, just kernel, busybox and this test. I am not an experience linux user, and i am still wondering how can i tune this to be more real-time and deterministic. Can i push that frequency maximum even more?

How can I avoid distortion and stuttering in DirectSound?

I have a DirectSound application I'm writing in C, running on Windows 7. The application just captures some sound frames, and plays them back. For sanity-checking the capture results, I'm writing out the PCM data to a file, which I can play in Linux using aplay.
Unfortunately, the sound is choppy, sometimes contains stuttering (and plays at the wrong speed in Linux). Oddly, the amount of distortion observed when playing the capture file is less if the PCM data is not played in the playback buffer at the time of capture.
Here's the initialization of my WAVEFORMATEX:
memset(&wfx, 0, sizeof(WAVEFORMATEX));
wfx.cbSize = 0;
wfx.wFormatTag = WAVE_FORMAT_PCM;
wfx.nChannels = 1;
wfx.nSamplesPerSec = sampleRate;
wfx.wBitsPerSample = sampleBitWidth;
wfx.nBlockAlign = (wfx.nChannels * wfx.wBitsPerSample) / 8;
wfx.nAvgBytesPerSec = wfx.nSamplesPerSec * wfx.nBlockAlign code here
The sampleRate is 8000, and sampleBitWidth is 16.
I create a capture and play buffer using this same structure, and the capture buffer has 3 notification positions. I start capturing with:
lpDsCaptureBuffer->Start(DSCBSTART_LOOPING);
I then spark off a playback thread that calls WaitForMultipleObjects on the events associated with the notification points. Upon notification, I reset all the events, and copy the 1 or 2 pieces of the capture buffer to a local buffer, and pass those on to a play routine:
void playFromBuff(LPVOID captureBuff,DWORD captureLen) {
LPVOID playBuff;
DWORD playLen;
HRESULT hr;
hr = lpDsPlaybackBuffer->Lock(0L,captureLen,&playBuff,&playLen,NULL,0L,0L);
memcpy(playBuff,captureBuff,playLen);
hr = lpDsPlaybackBuffer->Unlock(playBuff,playLen,NULL,0L);
hr = lpDsPlaybackBuffer->SetCurrentPosition(0L);
hr = lpDsPlaybackBuffer->Play(0L,0L,0L);
}
(some error-checking omitted).
Note that the playback buffer has no notification positions. Each time I get a chunk from the capture buffer, I lock the playback buffer starting at position 0.
The capture code, guarded by the WaitForMultipleObjects, looks like:
lpDsCaptureBuffer->GetCurrentPosition(NULL,&readPos);
hr = lpDsCaptureBuffer->Lock(...,...,&captureBuff1,&captureLen1,&captureBuff2,&captureLen2,0L);
where the ellipses contain calculations involving the current and previously-seen read positions. I'm omitting those likely-wrong calculations -- I suspect that's where the problem lies.
My notification positions are multiples of 1024. Yet the read positions reported are 1500, 2500, and 3500. So if I see a read position of 1500, does that mean I can read from bytes 0 to 1500. And when next I see 2500, does that mean I should read from 1501 to 2500? Why do those read positions not correspond exactly to my notification positions? What's the right algorithm here?
I've tried the simpler alternative of stopping the capture when the capture buffer is full, without other notification positions. But that means, I think, allowing some sound to escape capture.
My notification positions are multiples of 1024. Yet the read positions reported are 1500, 2500, and 3500. So if I see a read position of 1500, does that mean I can read from bytes 0 to 1500. And when next I see 2500, does that mean I should read from 1501 to 2500? Why do those read positions not correspond exactly to my notification positions? What's the right algorithm here?
DirectSound API is nowadays a compatibility layer on top of other "real" audio capture API. This means that inside audio capture fills some buffers (esp. those multiples of 500) and then passes the filled buffers to DirectSound capture, which in turn reports them to you. This explains why you see read positions as multiples of 500, because DirectSound itself has data available this way.
Since you are interested in getting captured data, your assumption is correct that you are interested mostly in read position. You get the notification and you know what offset is safe to read up to. Since the capture API is layered, there is some latency involved because layers need to pass chunks of data between one another, before making them available to you.

MME Audio Output Buffer Size

I am currently playing around with outputting FP32 samples via the old MME API (waveOutXxx functions). The problem I've bumped into is that if I provide a buffer length that does not evenly divide the sample rate, certain audible clicks appear in the audio stream; when recorded, it looks like some of the samples are lost (I'm generating a sine wave for the test). Currently I am using the "magic" value of 2205 samples per buffer for 44100 sample rate.
The question is, does anybody know the reason for these dropouts and if there is some magic formula that provides a way to compute the "proper" buffer size?
Safe alignment of data buffers is the value of nBlockAlign of WAVEFORMATEX structure.
Software must process a multiple of nBlockAlign bytes of data at a
time. Data written to and read from a device must always start at the
beginning of a block. For example, it is illegal to start playback of
PCM data in the middle of a sample (that is, on a non-block-aligned
boundary).
For PCM formats this is the amount of bytes for single sample across all channels. Non-PCM formats have their own alignments, often equal to length of format-specific block, e.g. 20 ms.
Back in time when waveOutXxx was the primary API for audio, carrying over unaligned bytes was an unreasonable burden for the API and unneeded performance overhead. Right now this API is a compatibility layer on top of other audio APIs, and I suppose that unaligned bytes are just stripped to still play the rest of the content, which would otherwise be rejected in full due to this small glitch, which might be just a smaller and non-fatal caller's inaccuracy.
if you fill the audio buffer with sine sample and play it looped , very easily it will click , unless the buffer length is not a multiple of the frequence, as you said ... the audible click in fact is a discontinuity in the wave ...an advanced techinques is to fill the buffer dinamically , that is, you should set a callback notification while the buffer pointer advance and fill the buffer with appropriate data at appropriate offset. i would use a more large buffer as 2205 is too short to get an async notification , calculate data , and write the buffer ,all that while playing , but it would depend of cpu power

PID Control Loops with large and unpredictable anomalies

Short Question
Is there a common way to handle very large anomalies (order of magnitude) within an otherwise uniform control region?
Background
I am working on a control algorithm that drives a motor across a generally uniform control region. With no / minimal loading the PID control works great (fast response, little to no overshoot). The issue I'm running into is there will usually be at least one high load location. The position is determined by the user during installation, so there is no reasonable way for me to know when / where to expect it.
When I tune the PID to handle the high load location, it causes large over shoots on the non-loaded areas (which I fully expected). While it is OK to overshoot mid travel, there are no mechanical hard stops on the enclosure. The lack of hardstops means that any significant overshoot can / does cause the control arm to be disconnected from the motor (yielding a dead unit).
Things I'm Prototyping
Nested PIDs (very agressive when far away from target, conservative when close by)
Fixed gain when far away, PID when close
Conservative PID (works with no load) + an external control that looks for the PID to stall and apply additional energy until either: the target is achieved or rapid rate of change is detected (ie leaving the high load area)
Hardware Limitations
Full travel defined
Hardstops cannot be added (at this point in time)
Update
My answer does not indicate that this is best solution. It's just my current solution that I thought I would share.
Initial Solution
stalled_pwm_output = PWM / | ΔE |
PWM = Max PWM value
ΔE = last_error - new_error
The initial relationship successfully ramps up the PWM output based on the lack of change in the motor. See the graph below for the sample output.
This approach makes since for the situation where the non-aggressive PID stalled. However, it has the unfortunate (and obvious) issue that when the non-aggressive PID is capable of achieving the setpoint and attempts to slow, the stalled_pwm_output ramps up. This ramp up causes a large overshoot when traveling to a non-loaded position.
Current Solution
Theory
stalled_pwm_output = (kE * PID_PWM) / | ΔE |
kE = Scaling Constant
PID_PWM = Current PWM request from the non-agressive PID
ΔE = last_error - new_error
My current relationship still uses the 1/ΔE concept, but uses the non-aggressive PID PWM output to determine the stall_pwm_output. This allows the PID to throttle back the stall_pwm_output when it starts getting close to the target setpoint, yet allows 100% PWM output when stalled. The scaling constant kE is needed to ensure the PWM gets into the saturation point (above 10,000 in graphs below).
Pseudo Code
Note that the result from the cal_stall_pwm is added to the PID PWM output in my current control logic.
int calc_stall_pwm(int pid_pwm, int new_error)
{
int ret = 0;
int dE = 0;
static int last_error = 0;
const int kE = 1;
// Allow the stall_control until the setpoint is achived
if( FALSE == motor_has_reached_target())
{
// Determine the error delta
dE = abs(last_error - new_error);
last_error = new_error;
// Protect from divide by zeros
dE = (dE == 0) ? 1 : dE;
// Determine the stall_pwm_output
ret = (kE * pid_pwm) / dE;
}
return ret;
}
Output Data
Stalled PWM Output
Note that in the stalled PWM output graph the sudden PWM drop at ~3400 is a built in safety feature activated because the motor was unable to reach position within a given time.
Non-Loaded PWM Output

Resources