I have a skeleton audio app which uses kAudioUnitSubType_HALOutput to play audio via a AURenderCallback. I'm generating a simple pure tone just to test things out, but the tone changes pitch noticeably from time to time; sometimes drifting up or down, and sometimes changing rapidly. It can be up to a couple of tones out at ~500Hz. Here's the callback:
static OSStatus outputCallback(void *inRefCon, AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp, UInt32 inOutputBusNumber,
UInt32 inNumberFrames, AudioBufferList *ioData) {
static const float frequency = 1000;
static const float rate = 48000;
static float phase = 0;
SInt16 *buffer = (SInt16 *)ioData->mBuffers[0].mData;
for (int s = 0; s < inNumberFrames; s++) {
buffer[s] = (SInt16)(sinf(phase) * INT16_MAX);
phase += 2.0 * M_PI * frequency / rate;
}
return noErr;
}
I understand that audio devices drift over time (especially cheap ones like the built-in IO), but this is a lot of drift — it's unusable for music. Any ideas?
Recording http://files.danhalliday.com/stackoverflow/audio.png
You're never resetting phase, so its value will increase indefinitely. Since it's stored in a floating-point type, the precision of the stored value will be degraded as the value increases. This is probably the cause of the frequency variations you're describing.
Adding the following lines to the body of the for() loop should significantly mitigate the problem:
if (phase > 2.0 * M_PI)
phase -= 2.0 * M_PI;
Changing the type of phase from float to double will also help significantly.
Related
I'm using stb_image to upload an image to the GPU. If I just upload the image with stbi_load I can confirm (nvidia Nsight) that the image is correctly stored in the GPU memory. However, some images I like to resize before I upload the to the GPU. In this case, I get a crash. This is the code:
int textureWidth;
int textureHeight;
int textureChannelCount;
stbi_uc* pixels = stbi_load(fullPath.string().c_str(), &textureWidth, &textureHeight, &textureChannelCount, STBI_rgb_alpha);
if (!pixels) {
char error[512];
sprintf_s(error, "Failed to load image %s!", pathToTexture);
throw std::runtime_error(error);
}
stbi_uc* resizedPixels = nullptr;
uint32_t imageSize = 0;
if (scale > 1.0001f || scale < 0.9999f) {
stbir_resize_uint8(pixels, textureWidth, textureHeight, 0, resizedPixels, textureWidth * scale, textureHeight * scale, 0, textureChannelCount);
stbi_image_free(pixels);
textureWidth *= scale;
textureHeight *= scale;
imageSize = textureWidth * textureHeight * textureChannelCount;
} else {
resizedPixels = pixels;
imageSize = textureWidth * textureHeight * textureChannelCount;
}
// Upload the image to the gpu
When this code is run with scale set to 1.0f, it works fine. However, when I set the scale to 0.25f, the program crashes in method stbir_resize_uint8. The image I'm providing in both cases is a 1920x1080 RGBA PNG. Alpha channel is set to 1.0f across the whole image.
Which function do I have to use to resize the image?
EDIT: If I allocate the memory myself, the function no longer crashes and works fine. But I though stb handles all memory allocation internally. Was I wrong?
I see you found and solved the problem in your edit but here's some useful advice anyway:
It seems like the comments in the source (which is also the documentation) don't explicitly mention that you have to allocate memory for the resized image, but it becomes clear when you take a closer look at the function's signature:
STBIRDEF int stbir_resize_uint8( const unsigned char *input_pixels , int input_w , int input_h , int input_stride_in_bytes,
unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
int num_channels);
Think about how you yourself would return the address of a memory chunk that you allocated in a function. The easiest would be to return the pointer directly like so:
unsigned char* allocate_memory( int size )
{ return (unsigned char*) malloc(size); }
However the return seems to be reserved for error codes, so your only option would be to manipulate the pointer as a side-effect. To do that, you'd need to pass a pointer to it (pointer to pointer):
int allocate_memory( unsigned char** pointer_to_array, int size )
{
*pointer_to_array = (unsigned char*) malloc(size);
/* Check if allocation was successful and do other stuff... */
return 0;
}
If you take a closer look at the resize function's signature, you'll notice that there's no such parameter passed, so there's no way for it to return the address of internally allocated memory. (unsigned char* output_pixels instead of unsigned char** output_pixels). As a result, you have to allocate the memory for the resized image yourself.
I hope this helps you in the future.
There is a mention of memory allocation in the docs but as far as I understand, it's about allocations required to perform the resizing, which is unrelated to the output.
My question is what should I do when I use real-time time stretch?
I understand that the change of rate will change the count of samples for output.
For example, if I stretch audio with 2.0 coefficient, the output buffer is bigger (twice).
So, what should I do if I create reverb, delay or real-time time stretch?
For example, my input buffer is 1024 samples. Then I stretch audio with 2.0 coefficient. Now my Buffer is 2048 samples.
In this code with superpowered audio stretch, everything is work. But if I do not change the rate... When I change rate - it sounds with distortion without actual change of speed.
return ^AUAudioUnitStatus(AudioUnitRenderActionFlags *actionFlags,
const AudioTimeStamp *timestamp,
AVAudioFrameCount frameCount,
NSInteger outputBusNumber,
AudioBufferList *outputBufferListPtr,
const AURenderEvent *realtimeEventListHead,
AURenderPullInputBlock pullInputBlock ) {
pullInputBlock(actionFlags, timestamp, frameCount, 0, renderABLCapture);
Float32 *sampleDataInLeft = (Float32*) renderABLCapture->mBuffers[0].mData;
Float32 *sampleDataInRight = (Float32*) renderABLCapture->mBuffers[1].mData;
Float32 *sampleDataOutLeft = (Float32*)outputBufferListPtr->mBuffers[0].mData;
Float32 *sampleDataOutRight = (Float32*)outputBufferListPtr->mBuffers[1].mData;
SuperpoweredAudiobufferlistElement inputBuffer;
inputBuffer.samplePosition = 0;
inputBuffer.startSample = 0;
inputBuffer.samplesUsed = 0;
inputBuffer.endSample = frameCount;
inputBuffer.buffers[0] = SuperpoweredAudiobufferPool::getBuffer(frameCount * 8 + 64);
inputBuffer.buffers[1] = inputBuffer.buffers[2] = inputBuffer.buffers[3] = NULL;
SuperpoweredInterleave(sampleDataInLeft, sampleDataInRight, (Float32*)inputBuffer.buffers[0], frameCount);
timeStretch->setRateAndPitchShift(1.0f, -2);
timeStretch->setSampleRate(48000);
timeStretch->process(&inputBuffer, outputBuffers);
if (outputBuffers->makeSlice(0, outputBuffers->sampleLength)) {
int numSamples = 0;
int samplesOffset =0;
while (true) {
Float32 *timeStretchedAudio = (Float32 *)outputBuffers->nextSliceItem(&numSamples);
if (!timeStretchedAudio) break;
SuperpoweredDeInterleave(timeStretchedAudio, sampleDataOutLeft + samplesOffset, sampleDataOutRight + samplesOffset, numSamples);
samplesOffset += numSamples;
};
outputBuffers->clear();
}
return noErr;
};
So, how can I create my Audio Unit render block, when my input and output buffers have the different count of samples (reverb, delay or time stretch)?
If your process creates more samples than provided by the audio callback input/output buffer size, you have to save those samples and play them later, by mixing in with subsequent output in a later audio unit callback if necessary.
Often circular buffers are used to decouple input, processing, and output sample rates or buffer sizes.
I made a small "game" to test some stuttering I had noticed in my actual game, and I can't for the life of me figure out why this is happening. I made the simplest possible project I could to test this out, but I still get pretty heavy stuttering. The FPS is still 60, but every few seconds, sometimes more, the game will stutter.
I have tried it on both mobile and a high-end pc, and oddly enough, it's more noticeable on the PC, though it still occurs on mobile.
I can't upload a video of it, since it's gone in the recording, so feel free to compile the project yourself if you want to test it. Here's the code:
public class LagTest extends ApplicationAdapter {
SpriteBatch batch;
Texture dot;
float x;
float y;
float speed;
float dotWidth;
int screenWidth;
#Override
public void create () {
batch = new SpriteBatch();
dot = new Texture("dot.png");
x = 100;
y = Gdx.graphics.getHeight()/2 - dot.getHeight()/2;
speed = 500;
dotWidth = dot.getWidth();
screenWidth = Gdx.graphics.getWidth();
}
#Override
public void render () {
Gdx.gl.glClearColor(0.2f, 0.4f, 0.8f, 1);
Gdx.gl.glClear(GL20.GL_COLOR_BUFFER_BIT);
batch.begin();
batch.draw(dot, x, y);
batch.end();
if (x < 0) {
speed = 500;
}
if (x > screenWidth - dotWidth) {
speed = -500;
}
x += speed * Gdx.graphics.getDeltaTime();
}
}
If anyone have some clue to what could be causing this, I'm all ears.
Edit:
So here's something fun. This only seems to occur in windowed mode, not in fullscreen. This might also be why it works better on mobile. Perhaps this is a bug then?
After trying some different methods (averaging delta / averaging raw delta / using raw delta / lowering frame rate to 30 / using a set delta each frame) getting the same stuttering on each one and then googling some on stuttering in windowed mode:
I would like to propose that the stuttering is not caused by LibGDX in itself, but rather is a general problem that occurs in windowed mode and which can have a number of different hardware-near causes. See here for one example and explanation: https://gamedev.stackexchange.com/questions/47356/why-would-a-borderless-full-screen-window-stutter-occasionally
I have a performance problem when using LDS memory with AMD Radeon HD 6850.
I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:
#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
{
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
{
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg; )
{
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
{
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
acc *= 24.0f / myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
}
The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line
float4 r = myPos - pblock[i];
is replaced by
float4 r = myPos - pos[jw * s_wg + i];
the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.
Moreover, when the line
float4 r = myPos - pblock[i];
is removed completely and all following occurences of r are replaced by myPos - pblock[i], the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r should be the fastest but the compiler somehow "optimizes" this line out.
Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.
Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)
I'm gonna make a wild guess:
When using float4 r = myPos - pos[jw * s_wg + i]; the compiler is smart enough to notice that the barrier put after the initialization of pblock[id] is not necessary anymore and remove it. Very likely all these barriers (in the for loop) impact your performances, so removing them is very noticeable.
Yeah but global access cost a lot too...So I'm guessing that behind the scene cache memories are well utilized. There is also the fact that you use vectors and as a matter of fact the architecture of the AMD Radeon HD 6850 uses VLIW processors...maybe it helps also to make a better use of the cache memories...maybe.
EDIT:
I've just found out a article benchmarking GPU/APU Cache and Memory Latencies. Your GPU is in the list. You might get some more answers (sorry didn't really read it - too tired).
After some more digging it turned out that the code causes some LDS bank conflicts. The reason is that for AMD there are 32 banks with 4 bytes length, but the float4 covers 16 bytes and therefore the half-wavefront accesses different addresses in the same banks. The solution was to make __local float* for x and y coordinates separately and read them also separately with the proper shift of array index as (id + i) % s_wg. Nevertheless, the overall gain in performance is small, most likely due to the overall latencies described in the link provided by #CaptainObvious (well then one has to increase the global work size to hide them).
I'm trying to create a simple function that will decrease audio volume in a buffer (like a fade out) each iteration through the buffer. Here's my simple function.
double iterationSum = 1.0;
double iteration(double sample)
{
iterationSum *= 0.9;
//and then multiply that sum with the current sample.
sample *= iterationSum;
return sample;
}
This works fine when set to a 44100 kHz samplerate but the problem I'm having is that if the samplerate is for an example changed to 88200 kHz it should only reduce the volume half that step each time because the samplerate is twice as much and will otherwise end the "fade out" in halftime, and I've tried to use a factor like 44100 / 88200 = 0.5 but this will not make it half the step in any way.
I'm stuck with this simple problem and need a guide to lead me through, what can I do to make it half step in each iteration as this function is called if the samplerate is changed during programtime?
Regards, Morgan
The most robust way to fade out independent of sample rate is to keep track of the time since the fadeout started, and use an explicit fadeout(time) function.
If for some reason you can't do that, you can set your exponential decay rate based on the sample rate, as follows:
double decay_time = 0.01; // time to fall to ~37% of original amplitude
double sample_time = 1.0 / sampleRate;
double natural_decay_factor = exp(- sample_time / decay_time);
...
double iteration(double sample) {
iterationSum *= natural_decay_factor;
...
}
The reason for the ~37% is because exp(x) = e^x, where e is the "natural log" base, and 1/e ~ 0.3678.... If you want a different decay factor for your decay time, you need to scale it by a constant:
// for decay to 50% amplitude (~ -6dB) over the given decay_time:
double halflife_decay_factor = exp(- log(2) * sample_time / decay_time);
// for decay to 10% amplitude (-20dB) over the given decay_time:
double db20_decay_factor = exp(- log(10) * sample_time / decay_time);
im not sure if i understood, but what about something like this:
public void fadeOut(double sampleRate)
{
//run 1 iteration per sec?
int defaultIterations=10;
double decrement = calculateIteration(sampleRate, defaultIterations);
for(int i=0; i < defaultIterations; i++)
{
//maybe run each one of these loops every x ms?
sampleRate = processIteration(sampleRate, decrement);
}
}
public double calculateIteration(double sampleRate, int numIterations)
{
return sampleRate/numIterations;
}
private double processIteration(double sampleRate, double decrement)
{
return sampleRate -= decrement;
}